Processing scripts and documentation for the soybean downsampling sub-project
This project had two major goals:
- Perform variant calling on the eight whole-genome accessions from Kono et al. 2016 against a newer version of the soybean reference for use as a resource in future projects.
- Determine the tradeoff between sequencing depth and the power to detect variants in soybean.
To accomplish these goals, three of the accessions at ~30x were downsampled in increments of 5x all the way down to 10x and variant discovery was performed on each depth level. The final calls were compared to genotyping from the SoySNP50k chip to determine accuracy and detection power.
Raw FASTQ files were downloaded from the SRA using the script SRA_download.sh
, which relies on Tom Kono's SRA_Fetch.sh
. The list of SRR run numbers that were downloaded is found in the file soy_run_numbers.txt
. The .sra
files were validated and split into forward and reverse FASTQ files using FASTQ_dumper.sh
.
Links to the SRA accessions for the samples:
- IA3023
- M92_220
- Noir
- Minsoy
- Archer
- Williams
- Williams 82 ISU
- Glycine Soya single
- Glycine Soya 35bp paired
- Glycine Soya 76bp paired
After splitting, the FASTQ files for M92_220 were concatenated with zcat
. This was not necessary for the other samples.
zcat SRR1164607_1.fastq.gz\
SRR1164608_1.fastq.gz\
SRR1164609_1.fastq.gz\
SRR1164610_1.fastq.gz\
SRR1164611_1.fastq.gz\
SRR1164612_1.fastq.gz\
SRR1164613_1.fastq.gz\
SRR1164614_1.fastq.gz > M92_220_1.fastq
The FASTQ files were trimmed of adapters, read mapped, and converted to BAM format using commit e82460c of sequence_handling. The config file used to run sequence_handling is found in Config. The following handlers were executed in the listed order:
- Quality_Assessment: Quality summary output for each FASTQ is located in SOYDOWN_quality_summary.txt
- Adapter_Trimming
- Read_Mapping: Different read mapping parameters were used to compensate for different read lengths, and are summarized in Read_Mapping_Parameters.txt. The two different Williams samples and the three different Glycine soya samples were merged after read mapping using
samtools merge
. - SAM_Processing: Mapping summary output for each BAM is located in SOYDOWN_mapping_summary.txt
- Coverage_Mapping: Coverage summary output for each BAM is located in SOYDOWN_coverage_summary.txt. The mean coverage statistic was used as the basis for downsampling each sample. Downsample.sh was run on each raw SAM file using the percentages found in Downsampling_Percentages.xlxs. The downsampled SAM files were processed to BAM files using SAM_Processing and the downsampled coverage was double-checked using Coverage_Mapping.
- Haplotype_Caller
- Genotype_GVCFs: After Genotype_GVCFs, the VCF parts for each sample were merged into a single file using VCFtools
vcf-concat
- Variant_Filtering: Different filtering parameters were used for each depth level and are summarized in Variant_Filtering_Parameters.txt
- Variant_Analysis: Heterozygosity, missingness, Ts/Tv, MAF histogram, and SNP count outputs for each depth level are located here
The final VCF file for all eight samples at their full depth can be downloaded here. (Not available yet)
To be updated later.
- Put final VCF file on DRUM