-
Notifications
You must be signed in to change notification settings - Fork 1
6 Use cases
The following commands perform VIRIDIC-like analysis by calculating the total ANI (tANI) between complete virus genomes and classifying these viruses into species and genera based on 95% and 70% tANI, respectively.
# Create a pre-alignment filter for genome pairs with a minimum of 20 common k-mers
# and a minimum sequence identity of 70% (relative to the shortest sequence).
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.7
# Calculate ANI measures for genome pairs specified in the filter.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt
# Assign viruses into putative species (tANI ≥ 95%).
vclust cluster -i ani.tsv -o species.tsv --ids ani.ids.tsv --algorithm complete \
--metric tani --tani 0.95
# Assign viruses into putative genera (tANI ≥ 70%).
./vclust.py cluster -i ani.tsv -o genus.tsv --ids ani.ids.tsv --algorithm complete \
--metric tani --tani 0.70
The following commands assign contigs into viral operational taxonomic units (vOTUs) based on the MIUViG thresholds (ANI ≥ 95% and aligned fraction ≥ 85%).
# Create a pre-alignment filter.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95
# Calculate ANI measures for genome pairs specified in the filter.
vclust align -i genomes -o ani.tsv --filter fltr.txt
# Cluster contigs into vOTUs using the MIUVIG thresholds and the Leiden algorithm.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85
The following commands reduce the sequence dataset to representative genomes.
# Create a pre-alignment filter.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95
# Calculate ANI measures for genome pairs specified in the filter.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.95 --out-qcov 0.85
# Cluster contigs using the CD-HIT algorithm and show representative genome.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm cd-hit \
--metric ani --ani 0.95 --qcov 0.85 --out-repr
The following command calculates ANI measures between all genome pairs in the dataset. For small datasets, using prefilter
is optional. However, note that without it, Vclust will perform all-versus-all pairwise sequence alignments.
vclust align -i genomes.fna -o ani.tsv
This command combines three FASTA files (RefSeq, GenBank, IMG/VR) into a single non-redundant FASTA file. If identical sequences are found, RefSeq is prioritized over GenBank, and GenBank over IMG/VR, based on the input file order.
# Combine RefSeq, GenBank, and IMG/VR datasets into a non-redundant FASTA file,
# prioritizing RefSeq over GenBank and GenBank over IMG/VR for duplicates.
vclust deduplicate -i refseq.fna.gz genbank.fna-gz imgvr.fna.gz -o nr.fna.gz --gzip-output
Vclust is optimized for efficiently comparing large datasets, especially those containing diverse viruses across a broad range of sequence identities. Below is an example of processing over 15 million metagenomic contigs from the IMG/VR database.
# Create a pre-alignment filter by processing batches of 2 million genomes and
# analyzing 20% of k-mers in each genome sequence
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 4 --min-ident 0.95 \
--batch-size 2000000 --kmers-fraction 0.2
# Calculate ANI values for genome pairs specified in the filter. To keep the output
# file compact, use the `lite` format and only report genome pairs with ANI ≥ 95%
# and query coverage ≥ 85%.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite \
--out-ani 0.95 --out-qcov 0.85
# Cluster contigs into vOTUs using the MIUVIG standards and the Leiden algorithm.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85
When working with large datasets containing highly redundant sequences (e.g., hundreds of thousands of nearly identical genomes), prefiltering may still pass a large number of genome pairs for alignment, even when using high thresholds for --min-kmers
and --min-ident
. Since most sequences in these datasets are almost identical, this can lead to increased memory usage and longer runtimes for all three Vclust commands (prefilter
, align
, cluster
). To address this, Vclust offers three additional options in the prefilter
step to reduce RAM usage and improve performance (as detailed in: 5. Optimizing sensitivity and resource usage). The example below shows the use of all three options simultaneously:
# Create a pre-alignment filter by processing batches of 100,000 genomes,
# analyzing only 10% of k-mers, and limiting each query genome to 1,000
# target sequences with the highest sequence identity.
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 10 --min-ident 0.95 \
--batch-size 100000 --kmers-fraction 0.2 --max-seqs 1000
# Calculate ANI values for genome pairs specified in the filter. To keep the output
# file compact, use the `lite` format and only report genome pairs with ANI ≥ 95%
# and query coverage ≥ 95%.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite \
--out-ani 0.95 --out-qcov 0.95
# Cluster highly similar contigs using the Leiden algorithm.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.97 --qcov 0.95
The following commands cluster plasmid genomes into plasmid taxonomic units (PTUs).
# Create a pre-alignment filter passing genome pairs with at least common 20 k-mers
# and have minimum sequence identity of 50%.
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 20 --min-ident 0.5
# Calculate ANI values for genome pairs specified in the filter. Report genome pairs
# that meet the IMG/PR thresholds (ANI ≥ 70% and query coverage ≥ 50%)
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.70 --out-qcov 0.50
# Cluster the genomes using the Leiden algorithm. The cluster edges are weighted by
# the gani metric (ANI x coverage), with a Leiden resolution parameter set to 0.9.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric gani --gani 0.35 --leiden-resolution 0.9
- Features
- Installation
- Quick Start
- Usage
- Optimizing sensitivity and resource usage
-
Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Calculate pairwise similarities between all-versus-all genomes
- Deduplicate (remove identical sequences) across multiple datasets
- Process large dataset of diverse virus genomes (IMG/VR)
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- FAQ: Frequently Asked Questions