Skip to content

Recommended_Workflow

Skylar Wyant edited this page Dec 6, 2017 · 15 revisions

Workflow

To start, run Quality_Assessment on your raw FastQ files. The Quality_Assessment handler runs FastQC on a series of samples and outputs metrics used for quality control. It accepts FASTQ, SAM, and BAM files as input and outputs a summary table and individual HTML files for visualization. The Quality_Assessment handler depends on FastQC and GNU Parallel.

The Adapter_Trimming handler uses Scythe to trim specific adapter sequences from FastQ files. This handler differentiates between forward, reverse, and single-end FastQ files automatically. The Adapter_Trimming handler depends on Scythe and GNU Parallel.

After Adapter_Trimming, it is recommended to run Quality_Assessment again on the trimmed FastQ files to ensure that all adapter contamination was properly removed.

The Read_Mapping handler maps sequence reads to a reference genome using BWA-MEM. This handler uses Torque Task Arrays, part of the Portable Batch System. The Read_Mapping handler depends on the Burrows-Wheeler Aligner.

The SAM_Processing handler converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, also using SAMTools. Alignment statistics will also be generated for both raw and finished BAM files. The SAM_Processing handler depends on SAMTools and GNU Parallel.

The Coverage_Mapping handler generates coverage histograms and summary statistics from BAM files using BEDTools. Plots of coverage are generated using R based on coverage maps. The Coverage_Mapping handler depends on BEDTools, R, and GNU Parallel.

To begin the variant discovery process from your finished BAM files, the Haplotype_Caller handler uses GATK to generate genomic VCF files for each sample.

The Genotype_GVCFs hander converts the GVCF files for the entire dataset into VCF files broken up by chromosome or chromosome part using GATK. Breaking the output into chromosome parts allows the process to be split into a task array and greatly speeds up processing time.

The Create_HC_Subset handler creates a single VCF file that contains only the high-confidence sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. Create_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file.

The Variant_Recalibrator handler uses the GATK and user-provided prior sets of "truth" variants to create a model that attempts to separate true variants from false positives. An unfiltered VCF file the the FILTER field annotated is generated.

The Variant_Filtering handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. Variant_Filtering depends on VCFtools and vcflib for manipulating the VCF file.

The Variant_Analysis handler uses a variety of dependencies to produce statistics about the input VCF file. Information generated by the handler includes heterozygosity summaries, missing-ness summaries, a minor allele frequency histogram, the Ts/Tv ratio, and the raw count of SNPs. Additional information is output for barley samples. Variant_Analysis depends on VCFtools, vcflib, molpopgen, Python3, GNU Parallel, BCFtools, R, TeX Live, and the Enthought Python Distribution.