-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash lca
vs sourmash gather/tax
for taxonomy - why?
#2760
Labels
doc
documentation content or issues
Comments
This was referenced Sep 19, 2023
ctb
added a commit
that referenced
this issue
Sep 25, 2023
… analysis (#2777) This PR adds cautionary notes to the command line docs, and updates the information on classifying signatures to suggest using tax instead of LCA, and even explains why :). There is more work to be done - we need to add more tutorials, and adjust the language in classifying-signatures around gather and LCA - but this is a nice standalone PR! Fixes #2562 Fixes #2772 Fixes #2773 Adds information from #2760 Addresses #2535
ctb
added a commit
that referenced
this issue
Oct 15, 2023
This PR rearranges the docs to according to the https://diataxis.fr/ structure, per #2054. New pages: * [A heavily revised index page](https://sourmash--2184.org.readthedocs.build/en/2184/index.html) * [A guide to the internals of sourmash](https://sourmash--2184.org.readthedocs.build/en/2184/sourmash-internals.html) * [Frequently asked questions](https://sourmash--2184.org.readthedocs.build/en/2184/faq.html) * [Publications](https://sourmash--2184.org.readthedocs.build/en/2184/publications.html) * [Funding](https://sourmash--2184.org.readthedocs.build/en/2184/funding.html) Fixes #2054 (document restructuring) Fixes #932 (add an FAQ) Fixes #2760 (tax preferred to lca) Tackles #1227 (what is gather) Fixes #971 (funding acks) Fixes #1289 (p_match and p_query) Fixes #1531 (document memory tradeoffs in save formats) Fixes #1532 (order of database load/reporting) Fixes #1609 (better gather description, `f_unique_query`, etc.) Fixes #2170 (use `detection`) Fixes #1881 (correlation with read mapping) Fixes #2566 (retrieving reads) Fixes #2775 (vision & mission) --------- Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@jaebeom-kim asked on slack why we recommend gather+tax over lca.
My partial / quick answer:
lca classify uses single k-mers to make taxonomic assessments, and then aggregates those - it’s analogous to Kraken’s lca approach. gather uses combinations of k-mers to find the best matching genome, and then sourmash taxonomy assigns taxonomy using the taxa of the genomes. All our published benchmarking has been done with sourmash gather and it seems much, much, much more accurate than LCA. I’m working on a writeup of why and will update here!
per #2758, our publications support this -
The other preprint that discusses this is here: Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers
Neither paper explores why it works so well - that's an upcoming one, hopefully ;). But we have strong results showing that you cannot use single k-mers or individual reads (short OR long) to distinguish between genomes, in general.
The text was updated successfully, but these errors were encountered: