A pipeline to relate Single Nucleotide Polymorphisms (SNPs) to a continuous phenotype using either a Random Forest approach from the MUVR
package or a canonical GWAS pipeline from RAINBOWR
.
Both methods will perform a Genome Wide Analysis (GWAS) using genetic variant (Variant Call Format) and phenotype information.
- PLINK
- 1. Inputs and outputs
- 2. Installation
- 3. Test
- 4. References
- 5. Troubleshooting
- 6. Useful links
mydata is the prefix of the .ped and .map files.
plink --file mydata --pheno pheno.txt --pheno-name Phenotype --perm --threads 20 --assoc
Both MUVR Random Forest and RAINBOWR methods use the same input files (VCF and phenotype). They also have similar outputs (table of significant SNPs) with some disctint graphs and tables.
A Variant Call Format (VCF) file that contains the nucleotidic variations is needed. The VCF file has to be of version 4.0 and higher. It will be converted to a genotype matrix using vcfR where genotypes are represented by:
- "0" for genome positions homozygous for the reference allele.
- "2" for genome positions homozygous for the alternative allele.
- "1" for genome positions heterozygous (one reference allele, one alternative allele).
Some useful tools to work with VCF files:
VCFtools can be used to select split the VCF file into chromosomes: vcftools --gzvcf <input_file.vcf.gz> --chr 1
A tabulated separated file containing two columns:
- The identifier for each individual. Column name should be
id
. - A phenotype column that contains the phenotypic values. Column name should be
phenotype
Table of most important SNPs: a table of the most important SNPs related to the phenotype along with their p-values.
- Plot of the model Q2: a plot of the model Q2 metric compared to a distribution of random Q2 values obtained using N permutations (e.g. N = 100).
-
Manhattan plot: a plot of the SNP chromosome position (x-axis) against the
$$-log_{10}$$ of the computed p-value of each SNP.
If not already done, install R (version >= 3.5) and RStudio for your Operating System.
In the Shell, type git clone /~https://github.com/SilkeAllmannLab/gwas.git
π π That's it! π π
All input data are to be found in data/
.
The VCF data from a study from Arouisse et al. 2019.
The VCF file itself can be downloaded directly from FigShare (file is called "Arabidopsis_2029_Maf001_Filter80").
Number of SNPs per chromosome:
chrom | # of SNPs |
---|---|
chr01 | 734,401 |
chr02 | 513,750 |
chr03 | 616,986 |
chr04 | 496,272 |
chr05 | 661,335 |
Several files were then generated from this initial big VCF file:
- Chromosome VCF files: for instance, for chromosome 1,
chr01.Arabidopsis_2029_Maf001_Filter80.vcf.gz
- Randomly subsampled chromosome VCF files: 10% or more of the initial SNPs were subsampled and extracted on a per-chromosome basis. For instance, for chromosome 1,
chr01.subsampled_10_percent.Arabidopsis_2029_Maf001_Filter80.vcf.gz
- Test file: a small subset available for testing purposes. It can be found at
data/Arabidopsis_2029_Maf001_Filter80.1000lines.vcf
.
Examples of code used:
To subsample 10% of the original VCF file restricted to chromosome 4 (from ~490,000 variants to 49,000):
gzip -c -d VCF_chr04.recode.vcf.gz |tail -n +16|shuf -n 49000 > chr04.49k_SNPs.no_header.vcf \
&& cat VCF.header.txt chr04.49k_SNPs.no_header.vcf > chr04.49k_SNPs.vcf \
&& gzip chr04.49k_SNPs.vcf \
&& rm chr04.49k_SNPs.no_header.vcf
To create one VCF file for each Arabidopsis chromosome:
for i in {1..5};
do echo "working on chromosome $i";
vcftools --gzvcf Arabidopsis_2029_Maf001_Filter80.vcf.gz --chr $i --recode --recode-INFO-all --out VCF_chr0$i;
done
A root_data_fid_and_names.tsv
file contains the genotype line identifier and the phenotypic values. Arabidopsis ecotypes were treated with 2-E-hexanal and their main root length measured. A response ratio was then calculated for each of the ecotypes by comparing the 2-E-hexanal treatment with a mock (methanol).
- Open a new Shell window.
- Execute the random forest script:
Rscript random_forest/gwas.R --help
to see the full list of arguments.
- Open RStudio
- Run the
rainbowr/rainbowr.R
script. Make sure you specify a valid VCF file path and phenotype file.
- Marc Galland (@mgalland)
- Martha van Os (@MvanOs)
- Machiel Clige (@MCligge)
The R package vcfR is heavily described here.
The RAINBOWR package for R is described here.
Sometimes, you might have issues with the installation of the Rccp
package which is a dependency of RAINBOWR
.
In the Shell, try: apt-get install r-base-dev
There are 4 important parameters to set when building an RF model:
-
nTree: Number of trees in the forest
-
mTry: Number of variants evaluated at each node of a tree
-
maxD: Maximum depth of a tree to grow
-
minNS: Minimum number of samples in a node to be processed