- Overview
- How To Run
- Flow Diagram
- Pipeline Steps
- Inputs
- Outputs
- Testing and Validation
- References
- License
The call-gSV nextflow pipeline, calls structural variants (SVs) and copy number variants (CNVs) utilizing Delly and Manta. Additionally, the pipeline can also regenotype previously identified SVs or CNVs with Delly. It is suitable for detecting copy-number variable deletion and tandem duplication events as well as balanced rearrangements such as inversions or reciprocal translocations and validates the output quality with BCFtools. The pipeline has been engineered to run in a 4 layer stack in a cloud-based scalable environment of CycleCloud, Slurm, Nextflow and Docker. Additionally it has been validated with the SMC-HET dataset and the GRCh38 reference genome, using paired-end FASTQ's that were back-extracted from BAMs created by BAM Surgeon.
Config File | Available Node cpus / memory | Designated Process 1; cpus / memory | Designated Process 2; cpus / memory | Designated Process 3; cpus / memory |
---|---|---|---|---|
F2.config |
2 / 3 GB | call_gSV_Delly; 1 / 2 GB | call_gCNV_Delly; 1 / 2 GB | call_gSV_Manta; 1 / 2 GB* |
F16.config |
16 / 26 GB | call_gSV_Delly; 1 / 8 GB | call_gCNV_Delly; 1 / 8 GB | call_gSV_Manta; 6 / 8 GB |
F32.config |
32 / 62.8 GB | call_gSV_Delly; 1 / 15 GB | call_gCNV_Delly; 1 / 15 GB | call_gSV_Manta; 12 / 15 GB |
F72.config |
72 / 136.8 GB | call_gSV_Delly; 1 / 30 GB | call_gCNV_Delly; 1 / 30 GB | call_gSV_Manta; 24 / 30 GB |
M64.config |
64 / 950 GB | call_gSV_Delly; 1 / 60 GB | call_gCNV_Delly; 1 / 60 GB | call_gSV_Manta; 30 / 60 GB |
* Manta SV calling wouldn't work on an F2 node due to incompatible resources. In order to test the pipeline for tasks not relevant to Manta, please set run_manta = false
in the sample specific config file.
Currently supported Nextflow versions: v23.04.2
Below is a summary of how to run the pipeline. See here for full instructions.
Pipelines should be run WITH A SINGLE SAMPLE AT TIME. Otherwise resource allocation and Nextflow errors could cause the pipeline to fail.
-
The recommended way of running the pipeline is to directly use the source code located here:
/hot/software/pipeline/pipeline-call-gSV/Nextflow/release/
, rather than cloning a copy of the pipeline.- The source code should never be modified when running our pipelines
-
Create a config file for input, output, and parameters. An example for a config file can be found here. See Nextflow Config File Parameters for the detailed description of each variable in the config file.
- Do not directly modify the source
template.config
, but rather you should copy it from the pipeline release folder to your project-specific folder and modify it there
- Do not directly modify the source
-
Create the input YAML using the template. See Input YAML for a detailed description.
- Again, do not directly modify the source template YAML file. Instead, copy it from the pipeline release folder to your project-specific folder and modify it there.
-
The pipeline can be executed locally using the command below:
nextflow run path/to/main.nf -config path/to/sample-specific.config
- For example,
path/to/main.nf
could be:/hot/software/pipeline/pipeline-call-gSV/Nextflow/release/5.0.0-rc.1/main.nf
path/to/sample-specific.config
is the path to where you saved your project-specific copy of template.configpath/to/input.yaml
is the path to where you saved your sample-specific copy of call-gSV-input.yaml
To submit to UCLAHS-CDS's Azure cloud, use the submission script here with the command below:
python path/to/submit_nextflow_pipeline.py \
--nextflow_script path/to/main.nf \
--nextflow_config path/to/sample-specific.config \
--nextflow_yaml path/to/input.yaml \
--pipeline_run_name <sample_name> \
--partition_type F16 \
--email <your UCLA email, jdoe@ucla.edu>
In the above command, the partition type can be changed based on the size of the dataset. An F16 node is generally recommended for larger datasets like A-full.
Note: Because this pipeline uses an image stored in the GitHub Container Registry, you must follow the steps listed in the Docker Introduction on Confluence to set up a PAT for your GitHub account and log into the registry on the cluster before running this pipeline.
The "discovery" branch of the call-gSV pipeline allows you to identify germline SVs and CNVs utilizing either Delly or Manta. After variants are identified, basic quality checks are performed on the outputs of the processes.
The first step of the pipeline requires an aligned and sorted BAM file and BAM index as an input for variant calling with Delly or Manta. Delly combines short-range and long-range paired-end mapping and split-read analysis for the discovery of balanced and unbalanced SVs at single-nucleotide breakpoint resolution (deletions, tandem duplications, inversions and translocations.) SVs are called, annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data. Manta calls SVs and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.
Currently the following filters are applied by Delly when calling SVs. Parameters with a "call-gSV default" can be updated in the nextflow.config file.
Parameter | Delly default | call-gSV default | Description |
---|---|---|---|
svtype |
ALL | SV type to compute (DEL, INS, DUP, INV, BND, ALL) | |
map-qual |
1 | 20 | Minimum paired-end (PE) mapping quality |
qual-tra |
20 | Minimum PE quality for translocation | |
mad-cutoff |
9 | Insert size cutoff, median+s*MAD (deletions only) | |
minclip |
25 | Minimum clipping length | |
min-clique-size |
2 | Minimum PE/SR clique size | |
minrefsep |
25 | Minimum reference separation | |
maxreadsep |
40 | Maximum read separation |
The second step of the pipeline identifies CNVs. To do this, Delly requires an aligned and sorted BAM file, as well as the BCF output from the SV calling step (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.
Currently the following filters are applied by Delly when calling CNVs. Parameters with a "call-gSV default" can be updated in the sample specific nextflow config file.
Parameter | Delly default | call-gSV default | Description |
---|---|---|---|
quality |
10 | Minimum mapping quality | |
ploidy |
2 | Baseline ploidy | |
sdrd |
2 | Minimum SD read-depth shift | |
cn-offset |
0.100000001 | Minimum CN offset | |
cnv-size |
1000 | Minimum CNV size | |
window-size |
10000 | Window size | |
window-offset |
10000 | Window offset | |
fraction-window |
0.25 | Minimum callable window fraction [0,1] | |
scan-window |
10000 | Scanning window size | |
fraction-unique |
0.800000012 | Uniqueness filter for scan windows [0,1] | |
mad-cutoff |
3 | Median + 3 * mad count cutoff | |
percentile |
0.000500000024 | Excl. extreme GC fraction |
For Delly, VCF files are generated from the BCFs to run the vcf-validate command from VCFTools and vcfstats from RTGTools. Outputs from both provide preliminary summary statistics that can be viewed and evaluated in preparation for downstream cohort-wide re-calling and re-genotyping. In the Manta branch of the pipeline, a stats directory is generated under the specific output directory /Manta-/results/stats where information can be found regarding the SVs identified.
The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.
Similar to the "discovery" process, the first step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF (from the merge-SVsites pipeline) as inputs for SV regenotyping with Delly. The provided sample is genotyped with the merged sites list. SVs are annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data.
The second possible step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.
Field | Type | Description |
---|---|---|
sample_id | string | Sample ID |
normal | path | Set to absolute path to input BAM |
---
input:
BAM:
normal:
- "/path/to/input/BAM"
Note: This pipeline is designed to detect germline SVs. To maintain consistency with other Boutros Lab Nextflow pipelines, the input YAML format mirrors that of other somatic or germline variant calling pipelines. However, it's important to note that the sample type tags, whether labeled as
normal
ortumor
, do NOT influence the germline SV/CNV calling processes in this pipeline.
Input Parameter | Required | Type | Description |
---|---|---|---|
dataset_id |
yes | string | Boutros lab dataset id. |
blcds_registered_dataset |
yes | boolean | Affirms if dataset should be registered in the Boutros Lab Data registry. Default value is false . |
variant_type |
yes | list | List containing variant types to call. Default is ["gSV", "gCNV"] |
run_discovery |
yes | boolean | Specifies whether or not to run the "disovery" branch of the pipeline. Default value is true . (either run_discovery or run_regenotyping must be true ) |
run_regenotyping |
yes | boolean | Specifies whether or not to run the "regenotyping" branch of the pipeline. Default value is false . (either run_discovery or run_regenotyping must be true ) |
merged_sites |
yes | path | The path to the merged sites.bcf file. Must be populated if running the regenotyping branch. |
run_delly |
true | boolean | Whether or not the workflow should run Delly (either run_delly or run_manta must be set to true ) |
run_manta |
true | boolean | Whether or not the workflow should run Manta (either run_delly or run_manta must be set to true ) |
run_qc |
no | boolean | Optional parameter to indicate whether subsequent quality checks should be run on Delly outputs. Default value is false . |
reference_fasta |
yes | path | Absolute path to the reference genome FASTA file. The reference genome is used by Delly for SV calling. |
exclusion_file |
yes | path | Absolute path to the delly reference genome exclusion file utilized to remove suggested regions for SV calling. On Slurm, an HG38 exclusion file is located at /hot/resource/tool-specific-input/Delly/hg38/human.hg38.excl.tsv |
mappability_map |
yes | path | Absolute path to the delly mappability map to support GC and mappability fragment correction in CNV calling |
map_qual |
no | path | minimum paired-end (PE) mapping quaility threshold for Delly. |
save_intermediate_files |
yes | boolean | Optional parameter to indicate whether intermediate files will be saved. Default value is false . |
output_dir |
yes | path | Absolute path to the directory where the output files to be saved. |
work_dir |
optional | path | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds , the default is /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively. |
docker_container_registry |
optional | string | Registry containing tool Docker images. Default: ghcr.io/uclahs-cds |
An example of the NextFlow Input Parameters Config file can be found here.
To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the input.config file. The default allocations can be found in the node-specific config files
base_resource_update {
memory = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
cpus = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
}
Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given. Examples:
- To double memory of all processes:
base_resource_update {
memory = [
[[], 2]
]
}
- To double memory for
call_gSV_Delly
and triple memory forrun_validate_PipeVal
andcall_gSV_Manta
:
base_resource_update {
memory = [
['call_gSV_Delly', 2],
[['run_validate_PipeVal', 'call_gSV_Manta'], 3]
]
}
- To double CPUs and memory for
call_gSV_Manta
and double memory forrun_validate_PipeVal
:
base_resource_update {
cpus = [
['call_gSV_Manta', 2]
]
memory = [
[['call_gSV_Manta', 'run_validate_PipeVal'], 2]
]
}
Output | Description |
---|---|
.bcf |
Binary VCF output format with SVs if found. |
.vcf |
VCF output format with SVs if found. If output by Manta, these VCFs will be compressed. |
.bcf.csi |
CSI-format index for BAM files. |
.validate.txt |
output file from vcf-validator. |
.stats.txt |
output file from RTG Tools. |
report.html , timeline.html and trace.txt |
A Nextflow report, timeline and trace files. |
*.log.command.* |
Process and sample specific logging files created by nextflow. |
*.sha512 |
generates SHA-512 hash to validate file integrity. |
Testing was performed leveraging aligned and sorted BAMs generated using bwa-mem2-2.1
against reference GRCh38 (SMC-HET was aligned against hs37d5):
- A-mini: BWA-MEM2-2.1_TEST0000000_TWGSAMIN000001-T001-S01-F.bam
- A-partial: BWA-MEM2-2.1_TEST0000000_TWGSAPRT000001-T001-S01-F.bam
- A-full: a-full-CPCG0196-B1.bam*
- A-partial: CPCG0196-B1-downsampled-a-partial-sorted.bam*
- SMC-HET: HG002.N.bam
* In Delly v1.1.3
, a coverage check
has been introduced which checks for coverage quality in a given window before CNV calling. Successful CNV calling was observed on samples with coverages across the genome, such as, a-full-CPCG0196-B1.bam
and CPCG0196-B1-downsampled-a-partial-sorted.bam
(WGS samples). For more details, please refer to Discussion #64.
Test runs for the A-mini/partial/full samples were performed using the following reference files
- reference_fasta: /hot/resource/reference-genome/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta
- exclusion_file: /hot/resource/tool-specific-input/Delly/GRCh38/human.hg38.excl.tsv
- mappability_map: /hot/resource/tool-specific-input/Delly/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.r101.s501.blacklist.gz
Testing was performed primarily in the Boutros Lab Slurm Development cluster but additional functional tests were performed on the SGE cluster on 2/26/2021 and the Slurm Covid cluster. Metrics below will be updated where relevant with additional testing and tuning outputs.
Test Case | Test Date | Node Type | Duration | CPU Hours | Virtual Memory Usage (RAM) -peak rss |
---|---|---|---|---|---|
A-mini | 2021-02-12 | F2 | 1m 29s | a few seconds | 208.8 MB |
A-partial | 2021-02-10 | F72 | 42m 5s | 48.8 | 8.9 GB |
A-full | 2021-02-10 | F72 | 7h 10m 43s | 509.0 | 10.9 GB |
SMC-HET | 2021-02-12 | F72 | 3h 9m 60s | 223.5 | 8.9 GB |
Metrics below are based on the integration of Delly v1.13 in the call-gSV
pipeline.
Test Case | Test Date | Node Type | Duration | CPU Hours | Virtual Memory Usage (RAM) -peak rss |
---|---|---|---|---|---|
CPCG0196-B1 A-partial | 2022-08-08 | F72 | 1h 9m 15s | 2.2 | 10.85 GB |
CPCG0196-B1 A-full | 2022-08-06 | F72 | 21h 3m 19s | 37.3 | 24.68 GB |
Metric | A-mini | A-partial | A-full | SMC-HET | Source |
---|---|---|---|---|---|
Count Pass | 3 | 2593 | 62704 | 15196 | grep -c -w "PASS" filename.vcf -1 |
Count Deletion | 2 | 1475 | 49433 | 9317 | grep -c -w "SVTYPE=DEL" filename.vcf |
Count Duplication | 1 | 170 | 2311 | 1705 | grep -c -w "SVTYPE=DUP" filename.vcf |
Count Inversion | 0 | 317 | 2801 | 2197 | grep -c -w "SVTYPE=INV" filename.vcf |
Count Translocation | 0 | 384 | 7439 | 0 | grep -c -w "SVTYPE=BND" filename.vcf |
Count Insertion | 0 | 267 | 1265 | 2059 | grep -c -w "SVTYPE=INS" filename.vcf |
PRECISE Calls | 3 | 1850 | 11541 | 8267 | grep -c -w "PRECISE" filename.vcf |
IMPRECISE Calls | 2 | 764 | 51709 | 7012 | grep -c -w "IMPRECISE" filename.vcf |
Failed Filters | 0 | 653 | 44991 | 2619 | .stats.txt |
Passed Filters | 3 | 1959 | 18257 | 12658 | .stats.txt |
SV breakends | 0 | 219 | 1124 | 0 | .stats.txt |
Symbolic SVs | 2 | 1559 | 12500 | 11156 | .stats.txt |
Same as reference | 1 | 263 | 4595 | 1471 | .stats.txt |
Missing Genotype | 0 | 8 | 38 | 31 | .stats.txt |
Total Het/Hom ratio | (2/0) | 1.00 (843/845) | 2.37 (9580/4044) | 1.86 (7251/3905) | .stats.txt |
Breakend Het/Hom ratio | (0/0) | 0.84 (59/70) | 13.41 (1046/78) | (0/0) | .stats.txt |
Symbolic SV Het/Hom ratio | (2/0) | 1.01 (784/775) | 2.15 (8534/3966) | 1.86 (7251/3905) | .stats.txt |
Duplicate entries | 0 errors total | 1 error chr8:3893339 | 1 error chr1:16050024 | 1 error chr1:187464829 | .validate.txt |
Note, per Nature the following benchmarks exist for the human genome: “Structural variants affect more bases: the typical genome contains an estimated 2,100 to 2,500 structural variants (∼1,000 large deletions, ∼160 copy-number variants, ∼915 Alu insertions, ∼128 L1 insertions, ∼51 SVA insertions, ∼4 NUMTs, and ∼10 inversions), affecting ∼20 million bases of sequence.”
Included is a template for validating your input files. For more information on the tool check out: /~https://github.com/uclahs-cds/public-tool-PipeVal
- Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):i333-i339. doi:10.1093/bioinformatics/bts378
- Chen, X. et al. (2016) Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics, 32, 1220-1222. doi:10.1093/bioinformatics/btv710
- VCFtools - vcf-validator
- Real Time Genomics RTG Tools Operations Manual - vcfstats
- Boutros Lab -CallSV Quality Control pipeline
- The 1000 Genomes Project Consortium., Corresponding authors., Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). https://doi.org/10.1038/nature15393
Please see list of Contributors at GitHub.
The pipeline-call-gSV is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.
The pipeline-call-gSV takes BAM and BCF files and utilizes Delly to call/regenotype gSV/gCNV.
Copyright (C) 2021-2024 University of California Los Angeles ("Boutros Lab") All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.