Skip to content

using sourmash to select best genome for mapping #2334

Open
@ctb

Description

This came up on microbial bioinformatics slack, thought I'd share - topic was selecting from many viral genomes.

it should be possible to use large-scale ANI-style analyses to select the closest genome for mapping. we’ve been doing this with sourmash and genome-grist for metagenomes, and I know tools like ganon and I think kmcp can do the same thing.
with sourmash I would say the first thing to try is:

  • sketch all your genome references with sourmash sketch dna -p scaled=100 *.fna
  • do the same with your metagenome(s)/shotgun reads
  • run sourmash prefetch <metagenome>.sig genome*.fna.sig --threshold-bp=0 -o matches.csv
  • sort matches.csv on f_match_query and pick the highest value (this is "k-mer detection" per use 'detection' terminology for fraction-of-genome-kmers-found #2170) and use that match_name as the reference genome.

happy to help troubleshoot here or on sourmash issue tracker if there is interest.

A more fun and “sophisticated” approach that could go horribly awry is to use sourmash gather after the prefetch to develop a minimum metagenome cover, but that’s only for people that are ok with expending some of their time and energy on a potential wild goose chase (which I am happy to support, but, you know, still)

Metadata

Assignees

No one assigned

    Labels

    faqthings to add to an FAQ or docs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions