Soy_Downsampling

Processing scripts and documentation for the soybean downsampling sub-project

Project Description

This project had two major goals:

Perform variant calling on the eight whole-genome accessions from Kono et al. 2016 against a newer version of the soybean reference for use as a resource in future projects.
Determine the tradeoff between sequencing depth and the power to detect variants in soybean.

To accomplish these goals, three of the accessions at ~30x were downsampled in increments of 5x all the way down to 10x and variant discovery was performed on each depth level. The final calls were compared to genotyping from the SoySNP50k chip to determine accuracy and detection power.

Raw Samples

Raw FASTQ files were downloaded from the SRA using the script SRA_download.sh, which relies on Tom Kono's SRA_Fetch.sh. The list of SRR run numbers that were downloaded is found in the file soy_run_numbers.txt. The .sra files were validated and split into forward and reverse FASTQ files using FASTQ_dumper.sh.

Links to the SRA accessions for the samples:

After splitting, the FASTQ files for M92_220 were concatenated with zcat. This was not necessary for the other samples.

zcat SRR1164607_1.fastq.gz\
  SRR1164608_1.fastq.gz\
  SRR1164609_1.fastq.gz\
  SRR1164610_1.fastq.gz\
  SRR1164611_1.fastq.gz\
  SRR1164612_1.fastq.gz\
  SRR1164613_1.fastq.gz\
  SRR1164614_1.fastq.gz > M92_220_1.fastq

sequence_handling

The FASTQ files were trimmed of adapters, read mapped, and converted to BAM format using commit e82460c of sequence_handling. The config file used to run sequence_handling is found in Config. The following handlers were executed in the listed order:

Quality_Assessment: Quality summary output for each FASTQ is located in SOYDOWN_quality_summary.txt
Adapter_Trimming
Read_Mapping: Different read mapping parameters were used to compensate for different read lengths, and are summarized in Read_Mapping_Parameters.txt. The two different Williams samples and the three different Glycine soya samples were merged after read mapping using samtools merge.
SAM_Processing: Mapping summary output for each BAM is located in SOYDOWN_mapping_summary.txt
Coverage_Mapping: Coverage summary output for each BAM is located in SOYDOWN_coverage_summary.txt. The mean coverage statistic was used as the basis for downsampling each sample. Downsample.sh was run on each raw SAM file using the percentages found in Downsampling_Percentages.xlxs. The downsampled SAM files were processed to BAM files using SAM_Processing and the downsampled coverage was double-checked using Coverage_Mapping.
Haplotype_Caller
Genotype_GVCFs: After Genotype_GVCFs, the VCF parts for each sample were merged into a single file using VCFtools vcf-concat
Variant_Filtering: Different filtering parameters were used for each depth level and are summarized in Variant_Filtering_Parameters.txt
Variant_Analysis: Heterozygosity, missingness, Ts/Tv, MAF histogram, and SNP count outputs for each depth level are located here

The final VCF file for all eight samples at their full depth can be downloaded here. (Not available yet)

Results

To be updated later.

To-Do

Put final VCF file on DRUM

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
raw_samples		raw_samples
results		results
sequence_handling		sequence_handling
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Soy_Downsampling

Project Description

Raw Samples

sequence_handling

Results

To-Do

About

Releases

Packages

Contributors 2

Languages

MorrellLAB/Soy_Downsampling

Folders and files

Latest commit

History

Repository files navigation

Soy_Downsampling

Project Description

Raw Samples

sequence_handling

Results

To-Do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages