Skip to content

update documentation re k-mer trimming and gather vs search #2122

Open
@ctb

Description

a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.

but, for gather in particular, this is not particularly important. we should update documentation with a few points -

  • erroneous k-mers from sequencing errors will typically not match between samples; this is a straightforward statistical argument.
  • erroneous k-mers from sequencing errors will also generally be low-abundance
  • Jaccard similarity and containment are both fractions that take into account total number of k-mers - including erroneous k-mers. so both similarity and containment fractions (as produced by search and search --containment and prefetch) will be decreased by the presence of erroneous k-mers. If this is a problem for you, you should trim.
  • however, sourmash gather uses shared k-mers, and is robust to erroneous k-mers - in particular, p_match to genomes will be robust b/c the denominator uses a genome size, and the numerator is shared k-mers and hence will ignore errors. ref explain p_match and p_query in sourmash documentation #1289
  • similarly, ANI estimates based on anchor or max_containment will generally be more robust to sequencing errors
  • also, if you're worried about erroneous k-mers, sourmash gather with an abundance-weighted query will perform better because low-abundance k-mers will have less impact on the final coverage estimate. ref provide both abundance-weighted coverage & flattened coverage in sourmash gather output? #1818

@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.

ref trimming paragraph in #1135:

Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance

Metadata

Assignees

No one assigned

    Labels

    docdocumentation content or issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions