From SNPs to phenotype

A pipeline to relate Single Nucleotide Polymorphisms (SNPs) to a continuous phenotype using either a Random Forest approach from the MUVR package or a canonical GWAS pipeline from RAINBOWR.

Both methods will perform a Genome Wide Analysis (GWAS) using genetic variant (Variant Call Format) and phenotype information.

PLINK
1. Inputs and outputs
2. Installation
- Install RStudio
- Clone the repository
3. Test
- Example datasets
  - VCF dataset
  - Phenotypes
- Test runs
  - Random Forest
  - RAINBOWR GWAS
4. References
5. Troubleshooting
- RAINBOWR package
6. Useful links

PLINK

mydata is the prefix of the .ped and .map files.

plink --file mydata --pheno pheno.txt --pheno-name Phenotype --perm --threads 20 --assoc

1. Inputs and outputs

Both MUVR Random Forest and RAINBOWR methods use the same input files (VCF and phenotype). They also have similar outputs (table of significant SNPs) with some disctint graphs and tables.

Inputs

VCF file

A Variant Call Format (VCF) file that contains the nucleotidic variations is needed. The VCF file has to be of version 4.0 and higher. It will be converted to a genotype matrix using vcfR where genotypes are represented by:

"0" for genome positions homozygous for the reference allele.
"2" for genome positions homozygous for the alternative allele.
"1" for genome positions heterozygous (one reference allele, one alternative allele).

Some useful tools to work with VCF files:

VCFtools can be used to select split the VCF file into chromosomes: vcftools --gzvcf <input_file.vcf.gz> --chr 1

Phenotype file

A tabulated separated file containing two columns:

The identifier for each individual. Column name should be id.
A phenotype column that contains the phenotypic values. Column name should be phenotype

Common outputs

Table of most important SNPs: a table of the most important SNPs related to the phenotype along with their p-values.

Distinct outputs

Random Forest

Plot of the model Q2: a plot of the model Q2 metric compared to a distribution of random Q2 values obtained using N permutations (e.g. N = 100).

GWAS RAINBOWR

Manhattan plot: a plot of the SNP chromosome position (x-axis) against the $$-log_{10}$$ of the computed p-value of each SNP.

2. Installation

Install RStudio

If not already done, install R (version >= 3.5) and RStudio for your Operating System.

Clone the repository

In the Shell, type git clone /~https://github.com/SilkeAllmannLab/gwas.git

🎉 🎊 That's it! 🎉 🎊

3. Test

Example datasets

All input data are to be found in data/.

VCF dataset

The VCF data from a study from Arouisse et al. 2019.
The VCF file itself can be downloaded directly from FigShare (file is called "Arabidopsis_2029_Maf001_Filter80").

Number of SNPs per chromosome:

chrom	# of SNPs
chr01	734,401
chr02	513,750
chr03	616,986
chr04	496,272
chr05	661,335

Several files were then generated from this initial big VCF file:

Chromosome VCF files: for instance, for chromosome 1,chr01.Arabidopsis_2029_Maf001_Filter80.vcf.gz
Randomly subsampled chromosome VCF files: 10% or more of the initial SNPs were subsampled and extracted on a per-chromosome basis. For instance, for chromosome 1,chr01.subsampled_10_percent.Arabidopsis_2029_Maf001_Filter80.vcf.gz
Test file: a small subset available for testing purposes. It can be found at data/Arabidopsis_2029_Maf001_Filter80.1000lines.vcf.

Examples of code used:

To subsample 10% of the original VCF file restricted to chromosome 4 (from ~490,000 variants to 49,000):

gzip -c -d VCF_chr04.recode.vcf.gz |tail -n +16|shuf -n 49000 > chr04.49k_SNPs.no_header.vcf \
  && cat VCF.header.txt chr04.49k_SNPs.no_header.vcf > chr04.49k_SNPs.vcf \
  && gzip chr04.49k_SNPs.vcf \
  && rm  chr04.49k_SNPs.no_header.vcf

To create one VCF file for each Arabidopsis chromosome:

for i in {1..5}; 
  do echo "working on chromosome $i";
  vcftools  --gzvcf Arabidopsis_2029_Maf001_Filter80.vcf.gz  --chr $i  --recode --recode-INFO-all --out  VCF_chr0$i; 
done

Phenotypes

A root_data_fid_and_names.tsv file contains the genotype line identifier and the phenotypic values. Arabidopsis ecotypes were treated with 2-E-hexanal and their main root length measured. A response ratio was then calculated for each of the ecotypes by comparing the 2-E-hexanal treatment with a mock (methanol).

Test runs

Random Forest

Open a new Shell window.
Execute the random forest script: Rscript random_forest/gwas.R --help to see the full list of arguments.

RAINBOWR GWAS

Open RStudio
Run the rainbowr/rainbowr.R script. Make sure you specify a valid VCF file path and phenotype file.

4. References

✍️ Authors

Marc Galland (@mgalland)
Martha van Os (@MvanOs)
Machiel Clige (@MCligge)

vcfR

The R package vcfR is heavily described here.

RAINBOWR

The RAINBOWR package for R is described here.

5. Troubleshooting

RAINBOWR package

Sometimes, you might have issues with the installation of the Rccp package which is a dependency of RAINBOWR.

In the Shell, try: apt-get install r-base-dev

6. Useful links

There are 4 important parameters to set when building an RF model:

nTree: Number of trees in the forest
mTry: Number of variants evaluated at each node of a tree
maxD: Maximum depth of a tree to grow
minNS: Minimum number of samples in a node to be processed

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
data		data
img		img
info		info
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
plink_help.txt		plink_help.txt
plink_to_genotype_matrix.R		plink_to_genotype_matrix.R
rainbowr_gwas.R		rainbowr_gwas.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From SNPs to phenotype

PLINK

1. Inputs and outputs

Inputs

VCF file

Phenotype file

Common outputs

Distinct outputs

Random Forest

GWAS RAINBOWR

2. Installation

Install RStudio

Clone the repository

3. Test

Example datasets

VCF dataset

Phenotypes

Test runs

Random Forest

RAINBOWR GWAS

4. References

✍️ Authors

vcfR

RAINBOWR

5. Troubleshooting

RAINBOWR package

6. Useful links

About

Releases 1

Packages

Contributors 3

Languages

License

SilkeAllmannLab/gwas

Folders and files

Latest commit

History

Repository files navigation

From SNPs to phenotype

PLINK

1. Inputs and outputs

Inputs

VCF file

Phenotype file

Common outputs

Distinct outputs

Random Forest

GWAS RAINBOWR

2. Installation

Install RStudio

Clone the repository

3. Test

Example datasets

VCF dataset

Phenotypes

Test runs

Random Forest

RAINBOWR GWAS

4. References

✍️ Authors

vcfR

RAINBOWR

5. Troubleshooting

RAINBOWR package

6. Useful links

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages