Skip to content

Genomics_DB_Import

Chaochih Liu edited this page Feb 12, 2020 · 2 revisions

Basic Usage

The Genomics_DB_Import handler uses the Genome Analysis Toolkit (GATK) to import GVCFs into a GenomicsDB workspace prior to running Genotype_GVCFs. Genomics_DB_Import merges GVCF files from multiple samples before joint genotyping. It was added in GATK 4 and has the same functionality as CombineGVCFs in previous versions of GATK.

Because this step pools together all of your samples into one file, it is essential that all samples are included for this step. Automatically breaking the process into chromosome parts or smaller regions for each chromosome allows the job to be "parallelized across regions" by running as a task array and speeds up computing time. However, be careful not to break up genes. For exome sequencing, you can split by exons. For whole genome sequencing, GATK recommends splitting by natural boundaries (places where there are Ns in the reference); GATK's Picard has a tool called ScatterIntervalsByNs that will do this.

To run Genomics_DB_Import, all common variables and handler-specific variables must be defined within the configuration file. NOTE: Genomics_DB_Import shares many of its variables with Genotype_GVCFs, so you will need to define all of the variables under the Genotype_GVCFs section of the config in addition to the variables under the Genomics_DB_Import section to run this handler. Once the variables have been defined, Genomics_DB_Import can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Genomics_DB_Import Config

Where Config is the full file path to the configuration file.

To re-run specific task arrays, you can use the same config file and an OPTIONAL -t flag as follows:

./sequence_handling Genomics_DB_Import Config -t 1-5,10,12

Where -t expects a range of arrays and/or comma separated list of specific arrays to re-run. IMPORTANT: No spaces are allowed when providing range or list of arrays, and (currently) the -t flag must be provided as the 3rd argument on the command line (as shown above). If the -t flag is not included, the DEFAULT runs all chromosomes or custom intervals provided.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variables specific to Genomics_DB_Import:

Variable Function
GDBI_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
GDBI_QUEUE Which queue are we submitting the job to?

Shared variables between Genomics_DB_Import and Genotype_GVCFs handler:

Variable Function
GVCF_LIST A list of full file paths to the GVCF files. This can be generated with a one liner find $(pwd -P) -name "*.g.vcf" | sort -V > gvcf_list.txt or using the sample_list_generator.sh script.
REF_DICT The reference dictionary, which should end in .dict.
NUM_CHR The number of chromosomes or chromosome parts the reference has. It is an integer value which varies per species. For barley: 15 (7*2 chromosome parts + chrUn) For soybean: 20 (this excludes scaffolds)
CUSTOM_INTERVALS Leave blank if you do not wish to call SNPs on non-chromosomal sequence. The full file path to a list of the names of any and all scaffolds or parts of the reference not covered by the chromosomes above. It should be a file ending in .intervals containing one scaffold name per line. SAMtools style intervals are also acceptable, one per line (ex: chr1:100-200).
PLOIDY The sample ploidy. Highly inbred samples (most barleys) will have a ploidy of 1.
THETA Genotype_GVCFs uses the THETA parameter under Haplotype_Caller. The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001

Output

Genomics_DB_Import generates gendb://my_database workspace file. If you did not parallelize across regions, you will have a single workspace file output. If you parallelized across regions, you will have one workspace file for each region with a unique name (don't worry, the Genotype_GVCFs handler is set up to handle this as long as you use the naming scheme output from Genomics_DB_Import). The workspace files can be found at:

${OUT_DIR}/Genotype_GVCFs/gendb_workspaces

After you finish running Genomics_DB_Import and confirmed all workspaces were created, you can directly run Genotype_GVCFs.

Dependencies

Genomics_DB_Import handler depends on GATK v4.1.2 or greater. In addition, PBS is required for basic operation, but we are in the process of adding an option to run sequence_handling without PBS. Please check the dependencies page to ensure that you are using the required version of each dependency.