update documentation re k-mer trimming and gather vs search

a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.

but, for gather in particular, this is not particularly important. we should update documentation with a few points -
* erroneous k-mers from sequencing errors will typically not match between samples; this is a straightforward statistical argument.
* erroneous k-mers from sequencing errors will also generally be low-abundance
* Jaccard similarity and containment are both fractions that take into account total number of k-mers - including erroneous k-mers. so both similarity and containment fractions (as produced by `search` and `search --containment` and `prefetch`) will be decreased by the presence of erroneous k-mers. If this is a problem for you, you should trim.
* however, `sourmash gather` uses shared k-mers, and is robust to erroneous k-mers - in particular, `p_match` to genomes will be robust b/c the denominator uses a genome size, and the numerator is shared k-mers and hence will ignore errors. ref /~https://github.com/sourmash-bio/sourmash/issues/1289
* similarly, ANI estimates based on anchor or max_containment will generally be more robust to sequencing errors
* also, if you're worried about erroneous k-mers, `sourmash gather` with an abundance-weighted query will perform better because low-abundance k-mers will have less impact on the final coverage estimate. ref /~https://github.com/sourmash-bio/sourmash/issues/1818 

@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.

ref trimming paragraph in /~https://github.com/sourmash-bio/sourmash/issues/1135:
>Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update documentation re k-mer trimming and gather vs search #2122

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development