update documentation re k-mer trimming and gather vs search #2122
Description
a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.
but, for gather in particular, this is not particularly important. we should update documentation with a few points -
- erroneous k-mers from sequencing errors will typically not match between samples; this is a straightforward statistical argument.
- erroneous k-mers from sequencing errors will also generally be low-abundance
- Jaccard similarity and containment are both fractions that take into account total number of k-mers - including erroneous k-mers. so both similarity and containment fractions (as produced by
search
andsearch --containment
andprefetch
) will be decreased by the presence of erroneous k-mers. If this is a problem for you, you should trim. - however,
sourmash gather
uses shared k-mers, and is robust to erroneous k-mers - in particular,p_match
to genomes will be robust b/c the denominator uses a genome size, and the numerator is shared k-mers and hence will ignore errors. ref explain p_match and p_query in sourmash documentation #1289 - similarly, ANI estimates based on anchor or max_containment will generally be more robust to sequencing errors
- also, if you're worried about erroneous k-mers,
sourmash gather
with an abundance-weighted query will perform better because low-abundance k-mers will have less impact on the final coverage estimate. ref provide both abundance-weighted coverage & flattened coverage insourmash gather
output? #1818
@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.
ref trimming paragraph in #1135:
Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance