Skip to content

6 Use cases

Andrzej Zielezinski edited this page Dec 21, 2024 · 10 revisions

6.1. Classify viruses into species and genera following ICTV standards

The following commands perform VIRIDIC-like analysis by calculating the total ANI (tANI) between complete virus genomes and classifying these viruses into species and genera based on 95% and 70% tANI, respectively.

# Create a pre-alignment filter for genome pairs with a minimum of 20 common k-mers
# and a minimum sequence identity of 70% (relative to the shortest sequence).
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.7
# Calculate ANI measures for genome pairs specified in the filter.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt
# Assign viruses into putative species (tANI ≥ 95%).
vclust cluster -i ani.tsv -o species.tsv --ids ani.ids.tsv --algorithm complete \
--metric tani --tani 0.95

# Assign viruses into putative genera (tANI ≥ 70%).
./vclust.py cluster -i ani.tsv -o genus.tsv --ids ani.ids.tsv --algorithm complete \
--metric tani --tani 0.70

6.2. Assign viral contigs into vOTUs following MIUViG standards

The following commands assign contigs into viral operational taxonomic units (vOTUs) based on the MIUViG thresholds (ANI ≥ 95% and aligned fraction ≥ 85%).

# Create a pre-alignment filter.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95
# Calculate ANI measures for genome pairs specified in the filter.
vclust align -i genomes -o ani.tsv --filter fltr.txt
# Cluster contigs into vOTUs using the MIUVIG thresholds and the Leiden algorithm.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85

6.3. Dereplicate viral contigs into representative genomes

The following commands reduce the sequence dataset to representative genomes.

# Create a pre-alignment filter.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95
# Calculate ANI measures for genome pairs specified in the filter.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.95 --out-qcov 0.85
# Cluster contigs using the CD-HIT algorithm and show representative genome.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm cd-hit \
--metric ani --ani 0.95 --qcov 0.85 --out-repr

6.4. Calculate pairwise similarities between all-versus-all genomes

The following command calculates ANI measures between all genome pairs in the dataset. For small datasets, using prefilter is optional. However, note that without it, Vclust will perform all-versus-all pairwise sequence alignments.

vclust align -i genomes.fna -o ani.tsv

6.5. Deduplicate (remove identical sequences) across multiple datasets

This command combines three FASTA files (RefSeq, GenBank, IMG/VR) into a single non-redundant FASTA file. If identical sequences are found, RefSeq is prioritized over GenBank, and GenBank over IMG/VR, based on the input file order.

# Combine RefSeq, GenBank, and IMG/VR datasets into a non-redundant FASTA file,
# prioritizing RefSeq over GenBank and GenBank over IMG/VR for duplicates.
vclust deduplicate -i refseq.fna.gz genbank.fna-gz imgvr.fna.gz -o nr.fna.gz --gzip-output

6.6. Process large dataset of diverse virus genomes (IMG/VR)

Vclust is optimized for efficiently comparing large datasets, especially those containing diverse viruses across a broad range of sequence identities. Below is an example of processing over 15 million metagenomic contigs from the IMG/VR database.

# Create a pre-alignment filter by processing batches of 2 million genomes and
# analyzing 20% of k-mers in each genome sequence
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 4 --min-ident 0.95 \
--batch-size 2000000 --kmers-fraction 0.2
# Calculate ANI values for genome pairs specified in the filter. To keep the output
# file compact, use the `lite` format and only report genome pairs with ANI ≥ 95%
# and query coverage ≥ 85%.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite \ 
--out-ani 0.95 --out-qcov 0.85
# Cluster contigs into vOTUs using the MIUVIG standards and the Leiden algorithm.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85

6.7. Process large dataset of highly redundant virus genomes

When working with large datasets containing highly redundant sequences (e.g., hundreds of thousands of nearly identical genomes), prefiltering may still pass a large number of genome pairs for alignment, even when using high thresholds for --min-kmers and --min-ident. Since most sequences in these datasets are almost identical, this can lead to increased memory usage and longer runtimes for all three Vclust commands (prefilter, align, cluster). To address this, Vclust offers three additional options in the prefilter step to reduce RAM usage and improve performance (as detailed in: 5. Optimizing sensitivity and resource usage). The example below shows the use of all three options simultaneously:

# Create a pre-alignment filter by processing batches of 100,000 genomes,
# analyzing only 10% of k-mers, and limiting each query genome to 1,000
# target sequences with the highest sequence identity.
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 10 --min-ident 0.95 \
--batch-size 100000 --kmers-fraction 0.2 --max-seqs 1000
# Calculate ANI values for genome pairs specified in the filter. To keep the output
# file compact, use the `lite` format and only report genome pairs with ANI ≥ 95%
# and query coverage ≥ 95%.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite \ 
--out-ani 0.95 --out-qcov 0.95
# Cluster highly similar contigs using the Leiden algorithm.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.97 --qcov 0.95

6.8. Cluster plasmid genomes into pOTUs

The following commands cluster plasmid genomes into plasmid taxonomic units (PTUs).

# Create a pre-alignment filter passing genome pairs with at least common 20 k-mers
# and have minimum sequence identity of 50%.
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 20 --min-ident 0.5
# Calculate ANI values for genome pairs specified in the filter. Report genome pairs
# that meet the IMG/PR thresholds (ANI ≥ 70% and query coverage ≥ 50%)
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.70 --out-qcov 0.50
# Cluster the genomes using the Leiden algorithm. The cluster edges are weighted by
# the gani metric (ANI x coverage), with a Leiden resolution parameter set to 0.9.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric gani --gani 0.35 --leiden-resolution 0.9