nSPECTRa is a nextflow workflow that implements several methods to compute the mutation spectra for a given set of variants and multiple genome alignments which includes the ancestral genomes.
The workflow comes with an anaconda environment which delivers most of the dependencies. However, you need to install:
- anaconda
- nextflow
- beagle v5 or newer jar file. The workflow can download this automatically if not specified.
- relate software suite Most of the remaining dependencies are downloaded by nSPECTRa at runtime.
Check the CHANGELOG for updates to this workflow.
To run the workflow effectively, you'll need:
- A vcf file with the genotypes for the samples to analyse, tbi-indexed
- An HAL alignment archive inclusive of the ancestral genome and the same genome used to call the variants in the VCF
- If running relate, the population lists (one text file with the IDs of the samples in each population; e.g. Pop1.txt includes the list of samples in Pop1) or a poplist prepared as described in the relate documentation
- Identifier of the reference genome in the HAL archive
- Identifier of the ancestral genome in the HAL archive
- An effective population size estimate to use at imputation time
To generate the HAL alignment archive, we suggest to use cactus which automatically infers the ancestral sequence.
The workflow constitutes of five separate components, repesented in figure below:
In short, there are six separate units:
- VCF preprocessing: the vcf file gets imputed using either Beagle or SHAPEIT4, and then annotated using the VEP.
- Ancestral genome generation: the workflow generate a version of the reference genome carrying the ancestral allele estimated by cactus
- Exclude constrained elements: the workflow extract the constrained elements using halPhyloP from the HAL suite alignments.
- K-mer mutation spectra: compute the mutation spectra for each individual at different K-mers using mutyper
- Sequential dinucleotide mutation: compute the sequential dinucleotide mutation (SDM) spectra for the different individuals
- Compute the mutation spectra evolution: compute the trend of the different mutation classes in the different populations using relate
The different components can be switches off depending on the analytical requirements and public resources available.
The user can choose between two different imputation algorithms and three algorithms to compute the mutation spectra.
The user can choose to impute either with shapeit v4 or beagle v5 or newer.
Both are supported, but only shapeit is included in the anaconda environment, which is the reason why it is the default choice.
In case the user prefers to use beagle, this needs to be downloaded manually and the path provided with the flag --beagle
.
If --beagle
is not provided, the workflow will download the version 5.2 automatically.
The user can choose what software to use, but keep in mind that they are mutually exclusive, and shapeit will be prioritized over beagle.
The software currently supports three software to compute the mutation spectra:
--mutyper
: the default choice, it can compute the spectra for different K-mer size;--relate
: used to define the mutation spectra changes over time and an accurate estimate of the effective population size;--sdm
: method to define sequential dinucleotide polymorphisms, MNPs and adjacent SNPs from Prendergast et al., 2019.
Some steps of the workflow are very computationally intensive, and can benefit from a high degree of chunking. For large datasets (e.g. 1000GP size), we recommend lower the chunk size (e.g. to 50000), facilitating some compute intensive steps.
We recommend to pre-filter the vcf file to obtain samples with a reasonable coverage (>8-10 mean DP) and with variants with low call rate removed from the dataset (CCR > 90%) and minor allele count MAC >= 2.