Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what does 'disagree' mean in output? #2772

Closed
sapuizait opened this issue Sep 19, 2023 · 2 comments · Fixed by #2777
Closed

what does 'disagree' mean in output? #2772

sapuizait opened this issue Sep 19, 2023 · 2 comments · Fixed by #2777
Labels
doc documentation content or issues taxonomy

Comments

@sapuizait
Copy link

Apologies if this is sth trivial or has been asked before but I could not find an answer in this forum or by googling it.

I use sourmash to classify genomes/MAGs and while I understand in the output table the successful classification is marked as 'found' and the unsuccessful as 'nomatch' - what does the 'disagree' mean? I assume that it means that matches were found but based on the software's cutoff it is not very happy? The matches are not great? Is that it? if yes, can I still use the result? How reliable is it?

thanks

here is an example:

2023_1030076_1_MG_127_23112020_S0_L001bin.47.fa,disagree,d__Bacteria,p__Bacteroidota,c__Bacteroidia,o__Bacteroidales,f__Barnesiellaceae,g__Barnesiella_A,,

2023_1030076_1_MG_127_23112020_S0_L001bin.4.fa,nomatch,,,,,,,,

2023_1030076_1_MG_127_23112020_S0_L001bin.52.fa,found,d__Bacteria,p__Spirochaetota,c__Spirochaetia,o__Treponematales,f__Treponemataceae,g__Treponema_D,s__Treponema_D sp900767955,

2023_1030076_1_MG_127_23112020_S0_L001bin.57.fa,found,d__Bacteria,p__Verrucomicrobiota,c__Verrucomicrobiae,o__Verrucomicrobiales,f__Akkermansiaceae,g__Akkermansia,,

@ctb
Copy link
Contributor

ctb commented Sep 19, 2023

great question!

I vaguely remembered writing something about it somewhere, but had to go digging 😆 . And it was by no means easy to find!

Here's what I wrote in this blog post:

Interpreting the CSV file

The CSV file has five columns: name, taxid, status, rank_info, and lineage. name is the name in the signature file; taxid and lineage are the NCBI taxonomic ID assigned to that name, and the taxonomic lineage of that ID.

The status and rank_info require a bit more explanation.

There are three possible status values at present: nomatch, found, and disagree.

nomatch is hopefully obvious - no (or insufficient) taxonomic information was found for k-mers in that signature, so you get a taxid of 0, and an empty rank and lineage:

TARA_IOS_MAG_00005,0,nomatch,,

found is hopefully also pretty obvious: it says that a simple straightforward lineage was found across all of the databases, and gives you that lineage down to the most detailed rank found:

TARA_PSW_MAG_00129,28108,found,species,Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Alteromonas;Alteromonas macleodii

disagree is the least clear. Here, the assigned rank is the rank immediately above where there is a taxonomic disagreement, and the taxid & lineage refer to the name at that rank (the least-common-ancestor at which an assignment can be made).

For example, look at this line in the CSV file:

TARA_ASW_MAG_00029,1224,disagree,phylum,Bacteria;Proteobacteria

TARA_ASW_MAG_00029 has k-mers that are shared between different orders: 'Pseudomonadales' and 'Rhodobacterales'. Therefore, the classifier status is disagree, and the classified taxid is at rank phylum - just above order.


make sense?

thanks again for asking!

Since you're using the LCA module, I also wanted to point you at this issue: #2760

Nowadays we suggest using sourmash gather to produce a CSV file containing a list of overlapping genomes, followed by sourmash tax to classify the genome or metagenome contents - it's proven to be much more accurate (specific) than our LCA approaches. See #2699 for a tutorial demonstrating it on both metagenomes and genomes!

If you have any more questions, ask away! And please do leave this issue open so I can update the docs appropriately.

@ctb ctb added doc documentation content or issues taxonomy labels Sep 19, 2023
@sapuizait
Copy link
Author

WOW, thanks a lot for the detailed answer! And thanks for pointing me to the gather > tax approach. For the record, I am very happy with the classification because the 'nomatch', if I try to blast them for example, they only have some very distant hits with some uncutured/unclassified isolates that nobody knows what they are :D
Again, thanks for the great software!

ctb added a commit that referenced this issue Sep 25, 2023
… analysis (#2777)

This PR adds cautionary notes to the command line docs, and updates the
information on classifying signatures to suggest using tax instead of
LCA, and even explains why :).

There is more work to be done - we need to add more tutorials, and
adjust the language in classifying-signatures around gather and LCA -
but this is a nice standalone PR!

Fixes #2562
Fixes #2772
Fixes #2773

Adds information from
#2760
Addresses #2535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc documentation content or issues taxonomy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants