using sourmash to select best genome for mapping #2334
Description
This came up on microbial bioinformatics slack, thought I'd share - topic was selecting from many viral genomes.
it should be possible to use large-scale ANI-style analyses to select the closest genome for mapping. we’ve been doing this with sourmash and genome-grist for metagenomes, and I know tools like ganon and I think kmcp can do the same thing.
with sourmash I would say the first thing to try is:
- sketch all your genome references with
sourmash sketch dna -p scaled=100 *.fna
- do the same with your metagenome(s)/shotgun reads
- run s
ourmash prefetch <metagenome>.sig genome*.fna.sig --threshold-bp=0 -o matches.csv
- sort matches.csv on
f_match_query
and pick the highest value (this is "k-mer detection" per use 'detection' terminology for fraction-of-genome-kmers-found #2170) and use thatmatch_name
as the reference genome.happy to help troubleshoot here or on sourmash issue tracker if there is interest.
A more fun and “sophisticated” approach that could go horribly awry is to use sourmash gather after the prefetch to develop a minimum metagenome cover, but that’s only for people that are ok with expending some of their time and energy on a potential wild goose chase (which I am happy to support, but, you know, still)