leftover text from discussion #9

ctb · 2020-10-28T16:53:50Z

leftover text discussion:

Other containment estimation methods such as
CMash [@koslicki_improving_2019] and mash screen [@ondov_mash_2019],
can also implement gather.

Running a

search requires access to the original dataset (mash screen) for the
query, or a Bloom Filter derived from the original dataset (CMash),
and when the collection of reference sketches is updated the Bloom
Filter from CMash can be reused, but mash screen needs access to
the original dataset again.

(Maybe already covered above, or maybe should be moved to "gather can
be applied to all the data"?)
Since Scaled MinHash sketches allow using the sketch directly for
gather, which are a fraction of the original data in size and also
allow enumerating all the elements, an operation not possible with
Bloom Filters, they can be stored and reused for large collections of
sequencing datasets, including public databases like the Sequence Read
Archive [@leinonen_sequence_2011]. A service that calculate these
Scaled MinHash sketches and make them available can improve
discoverability of these large collections, as well as support future
use cases derived from other Scaled MinHash features.

Compared to previous taxonomic profiling methods, Scaled MinHash can
also be seen as a mix of two other approaches: It uses exact $k$-mer
matching and assignment, and the $k$-mers selected by the MinHashing
process are equivalent to implicitly-defined markers. It differs from
previous approaches because only a subset of the $k$-mer composition
is used for matching, and traditional gene markers are explicitly
chosen due to sequence conservation and low mutation rates, while
MinHashing $k$-mers generates a randomized, but consistent across
datasets, set of marker $k$-mers.

—

Despite improvements to standardization and

reproducibility of previous analysis, benchmarking taxonomic profiling
tools is still challenging, since tools can generate their reference
databases from multiple sources and choosing only one source can bias
or make it impossible to evaluate them properly. This is especially
true for real metagenomic datasets derived from samples collected from
soil and marine environments, where the number of unknown organisms is
frequently larger than those contained in reference databases. With
the advent of metagenome-assembled genomes (MAGs) there are more
resources available for usage as reference datasets, even if they are
usually incomplete or draft quality. sourmash is well positioned to
include these new references to taxonomic profiling given the minimal
requirements (a Scaled MinHash sketch of the original dataset) and
support for indexing hundreds of thousands of datasets.

—

In this chapter gather is described in terms of taxonomic profiling
of metagenomes. That is one application of the algorithm, but it can
applied to other biological problems too. If the query is a genome
instead of a metagenome, gather can be used to detect possible
contamination in the assembled genome by using a collection of genomes
and removing the query genome from it (if it is present). This allows
finding matches that contain the query genome and evaluating if they
agree at specific taxonomic rank, and in case of large divergence (two
different phyla are found, for example) it is likely to indicative
that the query genome contains sequences from different organisms.
This is especially useful for quality control and validation of
metagenome-assembled genomes (MAGs), genomes assembled from reads
binned and clustered from metagenomes, as well as a verification
during submission of new assembled genomes to public genomic databases
like GenBank.
—

Improvements to the calculation of Scaled MinHash sketches can also
improve the taxonomic profiling use case. Exact $k$-mer matching is
limited in phylogenetically distant organisms, since small nucleotide
differences lead to distinct $k$-mers, breaking homology
assumptions. Different approaches for
converting the datasets into a set to be hashed (shingling) than
computing the nucleotide $k$-mer composition, such as spaced $k$-mers
[@leimeister_fast_2014] and minimizers [@roberts_reducing_2004] and
alternative encodings for the nucleotides using 6-frame translation to
amino acid [@gish_identification_1993] or other reduced alphabets
[@peterson_reduced_2009], can allow comparisons on longer evolutionary
distances and so improve taxonomic profiling by increasing the
sensitivity of the containment estimation. These improvements don't
fundamentally change the gather method, since it would still be
based on the same containment and remove element operations, but
show how gather works as a more general method that can leverage
characteristics from different building blocks and explore new or
improved use cases.

—

gather is a new method for decomposing datasets into its components
with application in biological sequencing data analysis (taxonomic
profiling) that can scale to hundreds of thousands of reference
datasets with computational resources requirements that are accessible
to a large number of users when used in conjunction with Scaled
MinHash sketches and efficient indices such as LCA and MHBT. It
outperforms current methods in community-develop benchmarks, and opens
the way for new methods that explore a top-down approach for profiling
microbial communities, including further refinements that can resolve
larger evolutionary distances and also speed up the method
computationally.

—

Scaled MinHash sketches are fundamentally a subset of the $k$-mer
composition of a dataset, and so any of the techniques described in
[@marchet_data_2019] are potential candidates for improving current
indices or implementing new ones. The MHBT index can be improved by
using more efficient representations for the internal nodes
[@solomon_improved_2017] and constructing the MHBT by clustering
[@harris_improved_2018], and the LCA index can use more efficient
storage of the list of signatures IDs by representing the list as
colors [@pandey_mantis:_2018]. The memory consumption of the LCA
index can also be tackled by implementing it in external memory using
memory-mapped files, letting the operating system cache and unload
pages as needed.

Current indices are also single-threaded, and don't benefit from
multicore systems. Both indices can be used in parallel by loading as
read-only and sharing for multiple searches, but is is also possible
to explore parallelization for single queries by partitioning the LCA
and assigning each partition to a thread, as well as using a
work-stealing thread pool for expanding the search frontier in the
MHBT in parallel. In any case, the current implementations serve as a
baseline for future scalability and can be used to guide optimization
and avoid extraneous overhead and common failings of such projects
[@mcsherry_scalability_2015].
—

"online" approaches

sourmash gather, the command-line interface that adds further user
experience improvements to the API level, also allows passing multiple
indices to be searched, providing incremental support for rerunning
with additional data without having to recompute, merge or update the
original databases.

Some limitations of gather and database types (equal results can be
hard to detect efficiently with current SBT implementation)

The Linear index is appropriate for operations that must check every
signature, since it doesn't have any indexing overhead. An example is
building a distance matrix for comparing signatures all-against-all.
Search operations greatly benefit from extra indexing structure. The
MHBT index and $k$-mer aggregative methods in general are appropriate
for searches with query thresholds, like searching for similarity or
containment of a query in a collection of datasets. The LCA index and
color aggregative methods are appropriate for querying which datasets
contain a specific query $k$-mer.

As implemented in sourmash, the MHBT index is more memory efficient
because the data can stay in external memory and only the tree
structure for the index need to be loaded in main memory, and data for
the datasets and internal nodes can be loaded and unloaded on demand.
The LCA index must be loaded in main memory before it can be used, but
once it is loaded it is faster, especially for operations that need to
summarize $k$-mer assignments or require repeated searches.

—

Due to these characteristics, and if memory usage is not a concern,
then the LCA index is the most appropriate choice since it is faster.
The MHBT index is currently recommended for situations where memory is
limited, such as with smaller scaled values ($s\le2000$) that increase
the size of signatures, or when there are a large number (hundreds of
thousands or more) of datasets to index.

Converting between indices

Both MHBT and LCA index can recover the original sketch collection.
In the MHBT case, it outputs all the leaf nodes. In the LCA index, it
reconstruct each sketch from the hash-to-dataset-ID mapping. This
allows trade-offs between storage efficiency, distribution, updating
and query performance.

Because both are able to return the original sketch collection, it is
also possible to convert one index into the other.

gather Conclusion

Scaled MinHash sketches allow scaling analysis to thousands of
datasets, but efficiently searching and sharing them can benefit from
data structures that index and optimize these use cases. This chapter
introduces an index abstraction that can be trivially implementing
using a list of sketches (Linear index) and more advanced
implementations based on inverted indices (LCA index) and
hierarchical indices (MHBT) providing options for fast and
memory-efficient operations, as well as making it easier to share and
analyze collections of sketches. All these functionalities are
implemented in sourmash.

—

These indices also serve as another set of building blocks for
constructing more advanced methods for solving other relevant
biological problems like taxonomic profiling, described in Chapter
3, and approaches for increasing the resilience and
shareability of biological sequencing data, described in Chapter
5.

—

Scaled MinHash sketches are fundamentally a subset of the $k$-mer
composition of a dataset, and so any of the techniques described in
[@marchet_data_2019] are potential candidates for improving current
indices or implementing new ones. The MHBT index can be improved by
using more efficient representations for the internal nodes
[@solomon_improved_2017] and constructing the MHBT by clustering
[@harris_improved_2018], and the LCA index can use more efficient
storage of the list of signatures IDs by representing the list as
colors [@pandey_mantis:_2018]. The memory consumption of the LCA
index can also be tackled by implementing it in external memory using
memory-mapped files, letting the operating system cache and unload
pages as needed.

Current indices are also single-threaded, and don't benefit from
multicore systems. Both indices can be used in parallel by loading as
read-only and sharing for multiple searches, but is is also possible
to explore parallelization for single queries by partitioning the LCA
and assigning each partition to a thread, as well as using a
work-stealing thread pool for expanding the search frontier in the
MHBT in parallel. In any case, the current implementations serve as a
baseline for future scalability and can be used to guide optimization
and avoid extraneous overhead and common failings of such projects
[@mcsherry_scalability_2015].

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leftover text from discussion #9

leftover text from discussion #9

ctb commented Oct 28, 2020

leftover text from discussion #9

leftover text from discussion #9

Comments

ctb commented Oct 28, 2020

Converting between indices

gather Conclusion