Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash lca vs sourmash gather/tax for taxonomy - why? #2760

Closed
ctb opened this issue Sep 14, 2023 · 0 comments · Fixed by #2184
Closed

sourmash lca vs sourmash gather/tax for taxonomy - why? #2760

ctb opened this issue Sep 14, 2023 · 0 comments · Fixed by #2184
Labels
doc documentation content or issues

Comments

@ctb
Copy link
Contributor

ctb commented Sep 14, 2023

@jaebeom-kim asked on slack why we recommend gather+tax over lca.

My partial / quick answer:

lca classify uses single k-mers to make taxonomic assessments, and then aggregates those - it’s analogous to Kraken’s lca approach. gather uses combinations of k-mers to find the best matching genome, and then sourmash taxonomy assigns taxonomy using the taxa of the genomes. All our published benchmarking has been done with sourmash gather and it seems much, much, much more accurate than LCA. I’m working on a writeup of why and will update here!

per #2758, our publications support this -

This is why (I strongly believe, with solid but not yet published receipts ;) sourmash performs so well in Portik et al, 2022.

The other preprint that discusses this is here: Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

Neither paper explores why it works so well - that's an upcoming one, hopefully ;). But we have strong results showing that you cannot use single k-mers or individual reads (short OR long) to distinguish between genomes, in general.

@ctb ctb added the doc documentation content or issues label Sep 23, 2023
ctb added a commit that referenced this issue Sep 25, 2023
… analysis (#2777)

This PR adds cautionary notes to the command line docs, and updates the
information on classifying signatures to suggest using tax instead of
LCA, and even explains why :).

There is more work to be done - we need to add more tutorials, and
adjust the language in classifying-signatures around gather and LCA -
but this is a nice standalone PR!

Fixes #2562
Fixes #2772
Fixes #2773

Adds information from
#2760
Addresses #2535
@ctb ctb closed this as completed in #2184 Oct 15, 2023
ctb added a commit that referenced this issue Oct 15, 2023
This PR rearranges the docs to according to the https://diataxis.fr/
structure, per #2054.

New pages:
* [A heavily revised index
page](https://sourmash--2184.org.readthedocs.build/en/2184/index.html)
* [A guide to the internals of
sourmash](https://sourmash--2184.org.readthedocs.build/en/2184/sourmash-internals.html)
* [Frequently asked
questions](https://sourmash--2184.org.readthedocs.build/en/2184/faq.html)
*
[Publications](https://sourmash--2184.org.readthedocs.build/en/2184/publications.html)
*
[Funding](https://sourmash--2184.org.readthedocs.build/en/2184/funding.html)

Fixes #2054 (document restructuring)
Fixes #932 (add an FAQ)
Fixes #2760 (tax preferred to lca)
Tackles #1227 (what is
gather)
Fixes #971 (funding acks)
Fixes #1289 (p_match and
p_query)
Fixes #1531 (document
memory tradeoffs in save formats)
Fixes #1532 (order of
database load/reporting)
Fixes #1609 (better
gather description, `f_unique_query`, etc.)
Fixes #2170 (use
`detection`)
Fixes #1881 (correlation
with read mapping)
Fixes #2566 (retrieving
reads)
Fixes #2775 (vision &
mission)

---------

Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc documentation content or issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant