Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Reference Alignment

zskidmor edited this page Apr 22, 2014 · 13 revisions

in progress

Contents

Overview

The Reference Alignment pipeline performs basic alignment and variant detection on sequence data for a single sample.

Guides

Inputs

Instrument Data

The Instrument Data records corresponding to the sequence data to align.

Reference Sequence Build

The Reference Sequence Build to which to align the Instrument Data.

dbSNP Build

The dbSNP Imported Variation List Build to use

Genotype Microarray Model

The Genotype Microarray Model for the Genotype Data for the Subject of this model

Annotation Reference Build

The Annotation Build to use when annotating the variants.

Target Region Set Name

The name of the Target Region Set Feature List used to perform capture to generate the Instrument Data. This should match the target_region_set_name property on the Instrument Data.

Region of Interest Set Name

The name of the Feature List used to perform the Coverage analysis.

Coverage Options

ROI Track Name

For a multi-tracked BED file, this controls which track is used to perform the coverage analysis. Valid values are "tiled_region" and "target_region".

Short ROI Names

This boolean flag controls whether or not the "name" column is replaced in Coverage analysis with "short" names (currently "r" followed by a sequential number).

Merge ROI Set

This boolean flag controls whether or not overlapping and adjacent regions in the BED file are combined before coverage analysis is performed.

Data Products

The top of a Reference Alignment build directory looks like this:

Build directory listing

alignments@  build.xml  logs/  reference_coverage@  reports/  variants/

The "logs" and "reports" directories as well as the "build.xml" file are standard products of running a Build in the GMS.

Example Alignments Directory

Merged AlignmentResult directory

4bf260ad9a244e20b4fd4bbc830b7ac1.bam      4bf260ad9a244e20b4fd4bbc830b7ac1.bam.flagstat  H_patient1-sample1-lib1-2893653597.log
4bf260ad9a244e20b4fd4bbc830b7ac1.bam.bai  4bf260ad9a244e20b4fd4bbc830b7ac1.bam.md5       H_patient1-sample1-lib1lib1-2893653597.metrics

Merged Alignment BAM File

The primary data product from running alignment is a merged BAM File (in this case, "4bf260ad9a244e20b4fd4bbc830b7ac1.bam".) It will also have an index file along side it that downstream tools use for faster lookups inside the BAM file.

Bam File Metadata

MD5 File

An MD5 of the BAM file is generated:

4bf260ad9a244e20b4fd4bbc830b7ac1.bam.md5

355a84d163e30d36ca915677f3c97728  /path/to/4bf260ad9a244e20b4fd4bbc830b7ac1.bam

The path listed in this file will reflect a location in which the BAM file was being worked on in the process of running and may not reflect its current location on disk. In general it is superior to keep track of the ID of the merged Alignment Result rather than the path where the BAM file currently happens to reside.

Flagstat File

The otuput of running samtools flagstat on the BAM file is also saved for future reference:

4bf260ad9a244e20b4fd4bbc830b7ac1.bam.flagstat

Picard Library Duplication Metrics

Two files will be produced for each library contained in the merged BAM if duplication handling was performed. The "H_patient1-sample1-lib1-2893653597.log" will typically be empty and can be disregarded. The corresponding .metrics file will contain the result of running Picard's DuplicationMetrics for the library:

H_patient1-sample1-lib1.metrics

Example Variant Detection Output Directory

The output of the Variant Detection API is a directory of symlinks and subdirectories with a few bookkeeping files:

Example output directory

./        dispatcher.cmd               indels.hq.bed@     indels.lq.v1.bed@  snvs.detailed.vcf.gz@      snvs.hq.v2.bed@  snvs.vcf.gz          sv/
../       indel/                       indels.hq.v1.bed@  indels.lq.v2.bed@  snvs.detailed.vcf.gz.tbi@  snvs.lq.bed@     snvs.vcf.gz.bed@     svs.hq@
cnv/      indels.detailed.vcf.gz@      indels.hq.v2.bed@  indels.vcf.gz      snvs.hq.bed@               snvs.lq.v1.bed@  snvs.vcf.gz.v1.bed@  workflow.xml
cnvs.hq@  indels.detailed.vcf.gz.tbi@  indels.lq.bed@     snv/               snvs.hq.v1.bed@            snvs.lq.v2.bed@  snvs.vcf.gz.v2.bed

The symlinks point to the "final" files inside the subdirectories. If only a detector was run this is the detector output; if filters were run, this is the filter output; and if combiners were run this is the combination output.

The subdirectories themselves contain symlinks to the Software Result output directory for each detector and combiner that was run:

Example snv subdirectory

./                                                   sniper-0.7.3-a5de64567b157457613af1127b6563ab@
../                                                  unionunique-intersect-snv_samtools_r963__8-snv_sniper_0.7.3__4-snv_varscan-somatic-validation_2.2.6__3@
intersect-snv_samtools_r963__8-snv_sniper_0.7.3__4@  varscan-somatic-validation-2.2.6-a9094f13b415c7c967433c8742871425@
samtools-r963-67b654281932619f0b6c5f28fae43235@

The detector links are named for the detector name, version, and a hash of the parameters as specified in the strategy. The combiner links are named for the workflow steps of the detectors that they combine.

Inside an individual detector output directory are whatever output files it generates as well as symlinks to the output directory of a filter applied to that detector. All detectors produce at a minimum a variant_type.hq file representing the native output of running the detector.

Example detector directory

./   snvs.hq       snvs.hq.v1.bed@  snvs.vcf.gz.tbi@
../  snvs.hq.bed@  snvs.hq.v2.bed   snvs.vcf.gz@        false-positive-v1-d41d8cd98f00b204e9800998ecf8427e@

If a second filter is applied, its symlink will appear in the output directory of the first filter. Each filter should produce a minimum of two files, a variant_type.hq and a variant_type.lq file. These are similarly converted to BED files and into a single VCF with a filter column. Filters may require that their detectors produce a file in a particular format in order to run. Most operate on the BED or VCF directly, but a few are specific to particular detectors.

SNVS and Indels

Most detectors and filters should generate both BED files and VCFs.

The same BED file goes by three names, variant_type.hq.bed, variant_type.hq.v1.bed, and variant_type.hq.v2.bed. The first is intended to be a symlink to the highest version of BED file available (currently V2). Since V2 is backwards compatible with V1, the V1 file also symlinks to the V2 file. For the filters, there are a similar set of three files with "lq" instead of "hq" for those variants that failed to pass the filter.

SVs

No BED files or VCFs are generated for SVs. They all use a variation on the output format of Breakdancer.

CNVs

No BED files or VCFs are generated for CNVs. Some CNV detectors produce an empty cnvs.hq file and are only used for the other reports they generate.

Bookkeeping Files

The "dispatcher.cmd" file contains the PERL5LIB that was used when running variant detection and the command-line that would be used to generate the directory again.

dispatcher.cmd

/etc/perl /usr/local/lib/site_perl /usr/local/lib/perl/5.10.1 /usr/local/share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10.1 /usr/share/perl/5.10.1 /usr/local/lib/site_perl
===============================================
gmt detect-variants2 dispatcher --output-directory /path/to/variants --reference-build 106942997 --aligned-reads-input /path/to/75609fb1cd7b4dbabbe3d279e89ff136.bam --control-aligned-reads-input /path/to/84d056b908ec4acabec79dad50c1c78d.bam --aligned-reads-sample TEST-patient1-somval_tumor1 --control-aligned-reads-sample TEST-patient1-somval_normal1 --snv-detection-strategy '(samtools r963 [-A -B] filtered by snp-filter v1 intersect sniper 0.7.3 [-q 1 -Q 15] filtered by false-positive v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15] then somatic-score-mapping-quality v1 [--min-mapping-quality 40 --min-somatic-score 40]) unique union varscan-somatic-validation 2.2.6 [--min-var-freq 0.08 --p-value 0.10 --somatic-p-value 0.01 --validation] filtered by varscan-high-confidence v1 then false-positive v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]' --sv-detection-strategy '(breakdancer 1.2 [-g -h:-a -t -q 10 -d] filtered by novo-realign v1 then tigra-validation v1 union breakdancer 1.2 [-g -h:-a -q 10 -o] filtered by tigra-validation v1) union squaredancer 0.1 filtered by tigra-validation v1' --indel-detection-strategy '(gatk-somatic-indel 5336 union pindel 0.5 filtered by pindel-somatic-calls v1 then pindel-read-support v1) union varscan-somatic-validation 2.2.6 filtered by varscan-high-confidence-indel v1 then false-indel v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]' 

The "workflow.xml" file is used internally for Workflow to run the requested detectors, filters, and combiners.

There are two additional files later added to the variants directory for annotation:

Example Annotator Output

The annotation process results in two files. The names are customizable, for example Reference Alignment uses:

Annotation output files

filtered.variants.pre_annotation          filtered.variants.post_annotation

Annotate Adaptor Output

The annotator runs on a file format that is slightly different from the standard BED files produced by the Variant Detection API. Given this input:

indels.hq.bed

1       109745943       109745945       AA/*    41      1

the corresponding adapted output is:

filtered.variants.pre_annotation

1       109745944       109745945       AA      -

The annotator format uses a one-based start coordinate and splits the reference and variant positions to separate columns. The additional columns of the BED file are ignored.

Annotate Transcript Variants Output

Running the annotator on this adapted file adds the annotation data available from the chosen Annotation Build:

filtered.variants.post_annotation

chromosome_name start   stop    reference       variant type    gene_name       transcript_name transcript_species      transcript_source       transcript_version      strand  transcript_status       trv_type        c_position amino_acid_change       ucsc_cons       domain  all_domains     deletion_substructures  transcript_error        default_gene_name       gene_name_source        ensembl_gene_id
1       109745944       109745945       AA      -       DEL     PSMA5   NM_002790       human   genbank 54_36p  -1      reviewed        3_prime_flanking_region c.*214  NULL    0.022:0.021     -       -       -       no_errors  Unknown Unknown Unknown
1       109745944       109745945       AA      -       DEL     SORT1   NM_002959       human   genbank 54_36p  -1      reviewed        5_prime_flanking_region c.-3907 NULL    0.022:0.021     -       -       -       no_errors  Unknown Unknown Unknown

Example Coverage Report Directory

Running the Reference Coverage analysis produces a directory of files:

Coverage result directory

4bf260ad9a244e20b4fd4bbc830b7ac1-ea2eb5fde49b41adaa1fbb9145cc2a57-wingspan_500-alignment_summary.tsv     ea2eb5fde49b41adaa1fbb9145cc2a57.bed
4bf260ad9a244e20b4fd4bbc830b7ac1-ea2eb5fde49b41adaa1fbb9145cc2a57-wingspan_0-alignment_summary-v2.tsv    wingspan_0/
4bf260ad9a244e20b4fd4bbc830b7ac1-ea2eb5fde49b41adaa1fbb9145cc2a57-wingspan_0-alignment_summary.tsv       wingspan_500/
4bf260ad9a244e20b4fd4bbc830b7ac1-ea2eb5fde49b41adaa1fbb9145cc2a57-wingspan_500-alignment_summary-v2.tsv

The files are named for the Merged Alignment Result's ID and the Feature List's ID.

Processed Bed File

The processed BED File (here ea2eb5fde49b41adaa1fbb9145cc2a57.bed) is produced according to the selected Coverage Options. This file is used as the list of regions to generate the remaining files.

Alignment Summary TSVs

There will be Alignment Summary TSV files produced for each requested wingspan. Each is formatted like the following:

4bf260ad9a244e20b4fd4bbc830b7ac1-ea2eb5fde49b41adaa1fbb9145cc2a57-wingspan_0-alignment_summary-v2.tsv

total_bp        total_aligned_bp        total_unaligned_bp      total_duplicate_bp      paired_end_bp   read_1_bp       read_2_bp       mapped_paired_end_bp    proper_paired_end_bp    singleton_bp    total_target_aligned_bp unique_target_aligned_bp        duplicate_target_aligned_bp     total_off_target_aligned_bp     unique_off_target_aligned_bp    duplicate_off_target_aligned_bp
11035644200     10644323947     391320253       953153277       11035644200     5517822100      5517822100      10906482400     10764388400     142094600       6427456805      5823315543      604141262       4216867142
      3867855127      349012015

Wingspan Directory

Each wingspan directory will contain files with more detailed stats:

Example wingspan directory

4bf260ad9a244e20b4fd4bbc830b7ac1_ea2eb5fde49b41adaa1fbb9145cc2a57_STATS.tsv  4bf260ad9a244e20b4fd4bbc830b7ac1_ea2eb5fde49b41adaa1fbb9145cc2a57_STATS.txt

The TXT file contains a summary of the coverage report:

4bf260ad9a244e20b4fd4bbc830b7ac1_ea2eb5fde49b41adaa1fbb9145cc2a57_STATS.txt

The TSV file has the statistics collected for each region of the processed BED file:

4bf260ad9a244e20b4fd4bbc830b7ac1_ea2eb5fde49b41adaa1fbb9145cc2a57_STATS.tsv

Build Processes

Processing Profiles

Processing Profile: 2635769 "Nov 2011 Default Reference Alignment" (Current Default)

Description

Read Alignment:

Each lane or sub-lane of data is aligned to GRCh37-lite with bwa v0.5.9. Defaults are used in both bwa aln and bwa samse/sampe with the exception that for bwa aln four threads are utilized (-t 4) and bwa's built in quality-based read trimming (-q 5). If the median insert size of the data is available then the median + 5 standard deviations is passed as the maximum insert size to bwa sampe (-a parameter) otherwise this value is specified as 600. ReadGroup entries are added to resulting SAM files using gmt sam add-read-group-tag. This SAM file is then converted to a BAM file using Samtools vr963, name sorted (samtools sort -n), mate pairings assigned (samtools fixmate), resorted by position (samtools sort), and indexed using gmt sam index-bam.

Read Duplication Marking and Merging:

Reads from multiple lanes, but the same sequencing library are merged, if necessary, using Picard v1.46 MergeSamFiles and duplicates are then marked per library using Picard MarkDuplicates v1.4.6. Lastly, each per-library BAM with duplicates marked is merged together to generate a single BAM file for the sample. For MergeSamFiles we run with SORT_ORDER=coordinate and MERGE_SEQUENCE_DICTIONARIES=true. For both tools, ASSUME_SORTED=true and VALIDATION_STRINGENCY=SILENT are specified. All other parameters are set to defaults. Samtools flagstat is run on each BAM file generated (per-lane, per-library, and final merged).

Coverage Reporting: Genotype Concordance: Variant Calling: Variant Annotation:

Tutorials

How do I import external data for reference alignment?

Do you have BAM format files for import?

A. Yes. Should we ask users to validate before import....

A. No, convert from FASTQ to BAM, see below.

A. My data was sequenced on the PacBio.

Here is an example PacBio lister to show the raw primary FASTQ and the CCS FASTQ files per movie:

$ pac-bio input list --filter sample_name~NF1% --show +primary_fastq_files,ccs_fastq_files

How do I convert from FASTQ to BAM format before import?

$ gmt picard fastq-to-sam --use-version=1.82 --fastq=/gscmnt/gc2122/production/paresults/TD13-0003_659/A01_1/Analysis_Results/m130130_003332_00116_c100448422550000001523062202151375_s1_p0.fastq
--platform=pacbio --platform-unit=10044842255000000152306220215137.5 --read-group-name=TD13-0003_A01_s1 --run-date='2013-01-31 14:17:13'
--description='PacBio TD13-0003 A01 NF1_1.5Kb_0.9nM Movie 1' --comment=NF1_1.5Kb_0.9nM --predicted-insert-size=1500 --sequencing-center=TGI \ --sample-name=H_IJ-NA12878 --library-name=H_IJ-NA12878-extlibs --quality-format=Standard --output=$BAM_PATH

Does a sample exist in the database for the external sequence data?

Yes, I know the sample name.

Here is an example sample name that already exists. More information about the sample can be found through the command line lister:

$ genome sample list --filter name=H_IJ-NA12878

ID           NAME           SPECIES_NAME   PATIENT_COMMON_NAME   COMMON_NAME   TISSUE_LABEL   TISSUE_DESC   EXTRACTION_TYPE   EXTRACTION_LABEL   EXTRACTION_DESC

--           ----           ------------   -------------------   -----------   ------------   -----------   ---------------   ----------------   ---------------

2376483307   H_IJ-NA12878   human          <NULL>                <NULL>        <NULL>         <NULL>        genomic dna       NA12878            <NULL>

No, how do I create a new sample?

TO DO

I have all the required information, how do I actually import a BAM file?

$ genome instrument-data import basic --source-files=$BAM_PATH --sample=H_IJ-NA12878 --import-source-name=TGI_PacBio_RS --description='PacBio TD13-0003 A01 NF1_1.5Kb_0.9nM Movie 1'

'sample' may require verification...

Resolving parameter 'sample' from command argument 'H_IJ-NA12878'... found 1

Import instrument data...

Resolve library...

Sample name: H_IJ-NA12878

Sample id: 2376483307

Library name: H_IJ-NA12878-extlibs

Library id: 2891323975

Resolve library...done

Validate source files...

Source file(s): /tmp/test.bam

Source file count: 1

Format: bam

Kilobytes requested: 318632

Validate source files...done

Checking if source files were previously imported...

Source files were NOT previously imported!

Create instrument data...

Instrument data id: 2891323976

Allocation (eea8c19297dd4623ad3752f10954ce82) created at /gscmnt/gc1400/info/instrument_data/imported/2891323976

Allocation id: eea8c19297dd4623ad3752f10954ce82

Allocation path: /gscmnt/gc1400/info/instrument_data/imported/2891323976

Allocation tmp path: /gscmnt/gc1400/info/instrument_data/imported/2891323976/tmp

Create instrument data...done

Transfer bam file and run flagstat...

Source file: /tmp/test.bam

Temp bam file: /gscmnt/gc1400/info/instrument_data/imported/2891323976/tmp/all_sequences.bam

Copy...

Copy...done

Run flagstat...

Flagstat file: /gscmnt/gc1400/info/instrument_data/imported/2891323976/all_sequences.bam.flagstat

RUN: samtools flagstat /gscmnt/gc1400/info/instrument_data/imported/2891323976/tmp/all_sequences.bam > /gscmnt/gc1400/info/instrument_data/imported/2891323976/all_sequences.bam.flagstat

Run flagstat...done

Move tmp bam file to permenant file...

Bam file: /gscmnt/gc1400/info/instrument_data/imported/2891323976/all_sequences.bam

Move tmp bam file to permenant file...done

Transfer bam file and run flagstat...done

Update properties on instrument data...

Bam path: /gscmnt/gc1400/info/instrument_data/imported/2891323976/all_sequences.bam

Read count: 75153

Is paired end: 0

Update properties on instrument data...done

Remove tmp dir...

rmdir /gscmnt/gc1400/info/instrument_data/imported/2891323976/tmp

Remove tmp dir...done

Reallocate...

Setting kilobytes_requested to 267648 based on du for allocation eea8c19297dd4623ad3752f10954ce82

Reallocate...done

Import instrument data...done

How do I align non-human reads and call variants?

How do I make a (imported) reference sequence model?

TO DO (see GMS-Reference Sequence)

How do I align reads and call variants?

Step 2: Create a reference alignment and variant-calling processing profile

Example to align genome reads from a cat breed to cat reference and call variants using VarScan and samtools.

Either locate a processing profile that you want to use, or create a new one.  The processing profile is a collection of the parameters that you want the pipeline to use.  It can be re-used for many different projects.  If you find a processing profile that is close to what you want to use but needs a few tweaks, use the --based-on option (see `genome processing-profile create reference-alignment -h` for more info).

Example create command:

genome processing-profile create reference-alignment --based-on 2774195 --snv-detection-strategy " varscan 2.2.9 [--min-coverage 3 --min-var-freq 0.20 --p-value 0.10 --strand-filter 1 --map-quality 10] filtered by false-positive v1 [--min-var-count 2 --max-mm-qualsum-diff 100 --bam-readcount-version 0.4 --bam-readcount-min-base-quality 15] unique union samtools r963 filtered by snp-filter v1 then false-positive v1 [--max-mm-qualsum-diff 100 --bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]" --indel-detection-strategy "varscan 2.2.9 [--min-coverage 3 --min-var-freq 0.20 --p-value 0.10 --strand-filter 1 --map-quality 10] filtered by false-indel v1 [--min-var-count 2 --max-mm-qualsum-diff 100 --bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]" --merger-name picard --merger-version 1.46 --name "mmontagu align and detect variants" --samtools-version r599

  1. Locate instrument data ID

  2. Define a genome model with appropriate processing profile ID and reference sequence

genome model define reference-alignment --processing-profile 2779197 --subject-name 'FCBK-081219_ARND4_pool_tube1' --model-name 'FCBK Asian to Felis catus 6.2' --instrument-data 2857675652,2857676651,2857676887,2857676888,2857676889,2857679396,2857679850 --reference-sequence-build 131815546

  1. Retrieve Model ID after running model define and start build

genome model build start 2890120880

How do I annotate?

Step 3: Match variants to coordinates

Variant Effect Predictor (VEP) matches variants to coordinates (genome db ensembl vep)

genome db ensembl vep --input-file <VARIANT.CALLING.OUTPUT.bed> --output-file <OUTPUT.NAME> --version 2_7 --gtf-file <LOCATION.FILENAME.gtf> --gtf-cache --species <GENUS_SPECIES> --reference-build-id <IMPORTED.REFERENCE.SEQUENCE.ID> --ensembl-annotation-build-id ENSEMBL.ANNOTATION.BUILD.ID>

Example to run a final filtering step and annotate variants.

Run a clustered variant filtering step for the cat variant calls before annotating (using parameters recommended by Dan Kolboldt; can we add this into processing profile?):

--max-variants 5
--window-size 500

using the following command:

gmt somatic filter-variant-clusters --variant-file <snvs.hq.bed> --output-file snv.cluster.filter --max-variants 5 --window-size 500 > clusters.summary
  1. Run VEP

genome db ensembl vep --input-file /gscmnt/ams1156/info/model_data/2889900372/build131861968/variants/snvs.hq.bed --output-file snvs.hq.annotated --version 2_7 --gtf-file /gscmnt/gc6136/assembly/cat/cat_ensembl70_all_genes.gtf --gtf-cache --species felis_catus --reference-build-id 131815546 --ensembl-annotation-build-id 124434505

CONSIDER: VEP v2.7 can output a vcf file, but this version needs an ensembl api of 69 or higher and this means --ensembl-annotation-build-id 131184146

Note about gtf format to be compatible with VEP: The GTF file have its features sorted in chromosomal order, and must contain the following entry types and data:

* "exon" feature lines with have "transcript_id", "gene_id" and "exon_number" in the description field
* "CDS" feature lines must correspond to coding exons, with "transcript_id" and "exon_number" populated in the description field, and the "frame" column populated with the read frame of the exon
* the "source" column of the GTF should indicate the biotype of the exon/CDS - this should be "protein_coding" for protein coding transcripts

How do I align reads with BWA-SW instead of regular BWA (aka BWA-short)?

Q: Do you have an existing Reference Alignment model?

A : Yes.

genome model copy MODEL processing_profile=2787013

A : No, see instructions for creating new models .

How do I define a region of interest for evaluating coverage?

BED format files are used to define regions of interest. The GMS will use these files to evaluate coverage of the defined regions, please see Targeted Selection Coverage for additional details.

How do I import a BED file for evaluating coverage?

There is a genome command that will create a 'feature-list'. The command takes a BED format file. The valid types of BED files are listed in the command usage.

genome feature-list create --help

What format BED file should I use?

The preference for BED format is in the following order:

1. 'multi-tracked BED' files are a convention used by several Exome providers that support multiple UCSC browser tracks contained within a single file.  Typically the tracks are 'target_regions' and 'tiled_regions'.  The target_regions represent the targeted exons in an Exome product.  The 'tiled_regions' represent the actual placement of hybridization probes.  This is the ideal format for BED files.
2. 'true-BED' format follows all of the defined specifications made by the UCSC Genome Browser BED format.  This includes using zero based start positions and one based end positions.
3. 1-based format means the start position is still in one based genomic coordinates rather than using zero based start positions.
4. 'multi-tracked 1-based' files mean there are multiple tracks for 'target_region' and 'tiled_region' in addition to having one based start positions.
5. 'unknown' means the format is unknown but still needs to be determined in order for analysis to proceed.

How do I use a feature-list to generate coverage reports?

A model input exists called region_of_interest_set_name. When defining a model the name or id of a feature-list is provided as the value for this input:

genome model define reference-alignment --region-of-interest-set-name=$FEATURE_LIST ...

How do I update the feature-list?

You may want to update the region_of_interest_set_name or a reference alignment model in order to evaluate a different region. To do this you need to execute the following commands:

genome model input remove --name=region_of_interest_set_name $MODEL --values=$CURRENT_FEATURE_LIST

genome model input add --name=region_of_interest_set_name $MODEL --values=$NEW_FEATURE_LIST

What is the difference between target_region_set_name and region_of_interest_set_name?

The 'target_region_set_name' is the name of the target region for which instrument data was generated. Each instrument data that has had targeted selection will have a barcoded target_region_set_name. This model input is used to filter the instrument data assigned to a model by this property. Only instrument data that has the same target_region_set_name will be assigned to a model with the defined model input.

The 'region_of_interest_set_name' is the name of a set of regions for which coverage and variant detection will be evaluated. This is often the same value as 'target_region_set_name' since coverage of the regions targeted by hybridization is often desired. However, it is possible to evaluate coverage and variants within any arbitrary set of genomic coordinates using a 'region_of_interest_set_name' that corresponds to a valid feature-list.

How do I list the BAM location for a model?

Models do not have results of their own--they are merely a container for builds. Thus we need to select a build from the model. "last_succeeded_build" (and "last_complete_build", which is synonymous) give us the most recent build that completed successfully.

"whole_rmdup_bam_file" is something particular to reference alignment builds. Both reference alignment and rna-seq have a merged_alignment_result (which is itself a Genome::InstrumentData::AlignmentResult::Merged). Looking in that module we see that there is a bam_path accessor.

$ genome model list --filter id=abcd1234 --show id,name,last_complete_build.merged_alignment_result.bam_path

Known Issues

Clone this wiki locally