using sourmash to select best genome for mapping

This came up on microbial bioinformatics slack, thought I'd share - topic was selecting from many viral genomes.

> it should be possible to use large-scale ANI-style analyses to select the closest genome for mapping. we’ve been doing this with sourmash and genome-grist for metagenomes, and I know tools like ganon and I think kmcp can do the same thing.
> with sourmash I would say the first thing to try is:
> * sketch all your genome references with `sourmash sketch dna -p scaled=100 *.fna`
> * do the same with your metagenome(s)/shotgun reads
> * run s`ourmash prefetch <metagenome>.sig genome*.fna.sig --threshold-bp=0 -o matches.csv`
> * sort matches.csv on `f_match_query` and pick the highest value (this is "k-mer detection" per /~https://github.com/sourmash-bio/sourmash/issues/2170) and use that `match_name` as the reference genome.
> 
> happy to help troubleshoot here or on sourmash issue tracker if there is interest.
> 
> A more fun and “sophisticated” approach that could go horribly awry is to use sourmash gather after the prefetch to develop a minimum metagenome cover, but that’s only for people that are ok with expending some of their time and energy on a potential wild goose chase (which I am happy to support, but, you know, still)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using sourmash to select best genome for mapping #2334

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development