-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application of f_unique_to_query
and threshold_bp
#3154
Comments
f_unique_to_query
and threshold_bp
great question! The statement is in contrast to Suppose that you have a metagenome, and you find two matches, with the second match being to This means that if you were only considering the Lokiarchaeota match, it would account for 10% of the unique k-mers in the metagenome. But since it's the second match, the first match may have also matched to some of those k-mers - and in this case, it did: the first match "consumed" half of the k-mers that would have matched to Lokiarchaeota, resulting in This is analogous to saying, "how many reads would map to this genome?" ( Coming back around to your original question: it depends on the analysis, but I hope that's useful. It's complicated to explain and I'm afraid I didn't do a great job! (I've also noticed a mistake in the docs - it says, "The fraction of matching hashes (unweighted) that are unique to this query; rank dependent." It's actually the fraction unique to this match. I'll fix.)
this is complicated - there's lots of discussion elsewhere, see #2360 for example. My rule of thumb is that k=31, s=1000, and threshold-bp=3*s is good. So I'd recommend using 3000. There's lots more to say here, but rather than thinking too hard about it, I'd suggest following up with mapping-based validation of a subset of your matches. That is, take your top 10 matches, and map reads to them. You should see good correspondence between what sourmash reports and what read mapping shows. Please feel free to ask for more details! I have lots! I just don't want to overwhelm ;) |
Such a clear explanation of Is the abundance metric calculated such that matches aren't doubled counted (ie abundance relates to On a slightly different note, can you expand on the Thanks again @ctb :) |
hah! You picked out an issue I didn't want to cover for fear of confusing you more ;).
tl;dr use
Ahh! This is just about ANI estimates - from these docs,
So it's about the internal estimation of ANI, not anything else about the match, if that makes sense. Although, intuitively, if the match is too small to estimate ANI robustly, maybe that suggests the match itself isn't that robust... hmm. @bluegenes @dkoslicki thoughts? |
thanks! @bluegenes @dkoslicki would be good to get youre thoughts on dropping matches that may be false positives. Also, the column name is actually |
hmm, I ... don't know. Are we doing double negatives here? 😭 @bluegenes your thoughts welcome! |
Addresses #3154 (comment) - >(I've also noticed a mistake in the docs - it says, "The fraction of matching hashes (unweighted) that are unique to this query; rank dependent." It's actually the fraction unique to this match. I'll fix.)
Hi,
I am using Sourmash to profile bacteria composition and abundance in shotgun WGS stool samples, and have two questions:
Could you expand on what you mean by this statement with regards to the f_unique_to_query column?:
'This column should be used in any analysis that needs to avoid double-counting matches.'
Currently, I am using all the rows in the output table, am I double counting by not 'using' f_unique_to_query?
My current parameters are k=31, s=1000, threshold_bp=2000
In your experience will this low threshold return a very high number of false positives?
Many thanks
The text was updated successfully, but these errors were encountered: