some more hot take ideas on making search and gather faster, for interactivity purposes #1578

ctb · 2021-06-09T15:10:09Z

while I'm thinking about it... it would be nice to move towards genuinely interactive search, gather,and MAGsearch.

greyhound #1226 and greyhound.sourmash.bio/ is super cool, of course, but it's "only" searching GTDB r95 and isn't well integrated with the core sourmash code base. I think oxidized LinearIndex #1526 will probably help a great deal with core sourmash, and I think that's planned for merge fairly soon (?). So what other options are there?

over in the MAGsearch blog post, I had a few suggestions for improving MAGsearch that I think apply pretty well -

First, the MAGsearch code doesn't do anything special in terms of loading; it's using the default sourmash signature format, which is JSON. For example, binary encodings would decrease the collection size a lot, while also speeding up search (by decreasing the load time).
Second, searching the signatures is done linearly, and uses Rust to do so in parallel. It uses the same Rust code that underlies sourmash (but is several versions behind the latest version). Making use of recent improvements in sourmash Rust code would probably speed this up several fold.

A few more thoughts, in no particular order, mixing optimization tricks with sourmash internal knowledge with whatever else.

better/more functional inverted indices, in rust

something like the LCA database, in rust, would be cool. #1131 for example. I think @luizirber had some code somewhere, too, but not sure if it was being integrated into sourmash.

techniques to ameliorate load time of LCA databases

LCA databases should be pretty fast for most searches, but they're slow to load.

#1484 talks about client/server search, where we could preload SBT or LCA indices, and even LinearIndexes, and query them.

we could also try out a sqlite backend for revindex/LCA, #821; see #1808.

taking advantage of banding with scaled signatures

one theoretically solid approach that remains to be implemented would be to parallelize signature search across different "bands" of hash space.

e.g. take a scaled=1000 signature, produce N=10 (or 100 :) non-overlapping scaled=N*1000 signatures from it.

search those (smaller, faster) signatures against a database in parallel, and do some minimally clever sorting foo to identify signatures that collectively go over the threshold.

voila, massively parallel scaled signature search.

this is also potentially a way to make higher resolution signature search work, of course.

there are many variations on this approach that could be better, but I haven't thought through them yet.

cheap and hacky but maybe effective trick: raise the threshold for gather as quickly as possible

because of the way gather works with indexed databases, we might be able to get significant mileage out of providing smaller databases first on the command line, even if there's redundancy with later databases. this is because gather will raise the threshold for its searches quite quickly.

for example, if you put the GTDB-genomic-reps (or more generically any smaller dereplicated database) on the command line before GTDB-entire, then the search threshold for matches from GTDB entire would be raised based on the matches to GTDB-genomic-reps.

(in thinking about this, I'm actually pretty sure it will not work properly with prefetch, actually. but it might be possible to change the code to make it work better across multiple databases, or something :)

ctb · 2022-04-04T04:16:07Z

This is kind of included in the banded discussion above, but is a simpler concept to grok:

We could also start out with high scaled values that can detect big overlaps robustly; these go very fast. Then, remove those big overlaps, and lower your scaled values iteratively, and repeat.

The challenge that has prevented me from doing this so far is that I don't know exactly how to decide what size match is small enough that we should abandon the current scaled and increase the resolution. For example, if you get 5 hash matches at a scaled of 100,000, is that still good enough? Or should you then bump down to (say) 50,000 and repeat the search?

A poor/dumber version of this would be to use picklists and/or multiple databases:

run search at scaled=100000, extract those signatures at full resolution, run them against database to remove matches
run at scaled=50000, extract at full resolution, remove matches, continue

Anyway, I think we have lots of mechanisms to do the iteration. Just not sure when to trigger the iteration.

ctb · 2024-06-29T19:41:55Z

this issue has aged well :)

ctb mentioned this issue Jun 13, 2021

support revindex indices that point at signatures #1591

Open

ctb mentioned this issue Apr 21, 2022

brainstorming: alternative signature storage/loading/query formats #1262

Open

This was referenced Jul 24, 2022

thoughts on opportunities to speed up CounterGather and sourmash gather #2145

Open

dynamically adapt scaled for gather #2160

Open

ctb added the speeding-up-gather label Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

ctb commented Jun 9, 2021 •

edited

Loading

ctb commented Apr 4, 2022

ctb commented Jun 29, 2024

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

Comments

ctb commented Jun 9, 2021 • edited Loading

better/more functional inverted indices, in rust

techniques to ameliorate load time of LCA databases

taking advantage of banding with scaled signatures

cheap and hacky but maybe effective trick: raise the threshold for gather as quickly as possible

ctb commented Apr 4, 2022

ctb commented Jun 29, 2024

ctb commented Jun 9, 2021 •

edited

Loading