Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sqlite3 implementations for Index, CollectionManifest, and LCA_Database #1808

Merged
merged 281 commits into from
Apr 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
281 commits
Select commit Hold shift + click to select a range
4ef1f48
switch to get_matching_sketches
ctb Feb 6, 2022
cc3ddde
change default cache size
ctb Feb 6, 2022
bdc83af
count overlaps in SQL?
ctb Feb 6, 2022
49af6f2
initial addition of 'sig fileinfo'
ctb Feb 12, 2022
f3b399a
finish first-draft implementation of fileinfo and get_manifest
ctb Feb 12, 2022
ca7630b
cleanup and move over to sourmash_args
ctb Feb 12, 2022
190d53f
add manifest and length support to LCA_Database
ctb Feb 12, 2022
f814e01
add rebuild/no-rebuild args
ctb Feb 12, 2022
cca74e2
Merge branch 'latest' of /~https://github.com/dib-lab/sourmash into add…
ctb Feb 12, 2022
9464118
Merge branch 'add/sig_fileinfo' into add/sqlite_index_bitflip
ctb Feb 13, 2022
4b34471
use BitArray to convert uint to int
ctb Feb 13, 2022
34b9cc5
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Feb 13, 2022
3b7c612
cleanup
ctb Feb 13, 2022
e8d8276
fix the things?
ctb Feb 13, 2022
8a4518d
Merge branch 'add/sqlite_index_bitflip2' into add/sqlite_index
ctb Feb 13, 2022
e07cb3e
cleanup
ctb Feb 13, 2022
0d7e96a
more cleanup
ctb Feb 13, 2022
98ff7cb
flag when scores are diff
ctb Feb 13, 2022
323651b
fix __len__ for zipfiles, __bool__ interpretation
ctb Feb 13, 2022
5f2fd1e
Merge branch 'add/sig_fileinfo' into add/sqlite_index
ctb Feb 13, 2022
3825981
add more index, etc
ctb Feb 13, 2022
6d5d8d3
more cleanup
ctb Feb 13, 2022
00a3a73
correct for rust panic a la zip
ctb Feb 13, 2022
4795efb
commit every so often...
ctb Feb 14, 2022
aabd459
add some comments
ctb Feb 14, 2022
de9c9be
Merge branch 'add/sig_fileinfo' of /~https://github.com/sourmash-bio/so…
ctb Feb 14, 2022
3f21fdb
get basic manifest-generating machinery working
ctb Feb 14, 2022
30b0905
update manifest stuff
ctb Feb 15, 2022
40c146b
add bitstring in support of SqliteIndex
ctb Feb 15, 2022
4e1e82d
more cleanup
ctb Feb 15, 2022
68ad08c
add more tests
ctb Feb 15, 2022
bdd7e8c
add conditions to _get_matching_sketches
ctb Feb 15, 2022
c1df0c9
remove conditions
ctb Feb 16, 2022
79494ac
Merge branch 'add/sqlite_index' of /~https://github.com/dib-lab/sourmas…
ctb Feb 16, 2022
4539a30
Merge branch 'add/sqlite_index' of /~https://github.com/sourmash-bio/so…
ctb Feb 25, 2022
d578cbb
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Feb 26, 2022
27bb661
remove errant raise
ctb Feb 26, 2022
24f54c5
update structure
ctb Feb 26, 2022
35cd73a
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Mar 5, 2022
4c95d10
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 3, 2022
ba00a77
some commentary
ctb Apr 3, 2022
a67d2ca
switch over to debug_literal
ctb Apr 3, 2022
a00eb96
switch to debug_literal; test tricky ordering
ctb Apr 3, 2022
21bc7bd
add LCA database test for tricky ordering
ctb Apr 3, 2022
2675741
add test for jaccard ordering to SBTs
ctb Apr 3, 2022
a3389bf
add LCA database test for tricky ordering
ctb Apr 3, 2022
628d722
add test for jaccard ordering to SBTs
ctb Apr 3, 2022
271bf4e
Merge branch 'add/test_jaccard_ordering' into add/sqlite_index
ctb Apr 3, 2022
31d8f93
add bitstring to setup
ctb Apr 3, 2022
0356a72
factor out CollectionManifest_Sqlite
ctb Apr 3, 2022
15e15ab
some basic manifests
ctb Apr 3, 2022
bf8effb
add sqlite manifest rows interface
ctb Apr 3, 2022
9a7d653
minor refactor
ctb Apr 3, 2022
f48c403
support sig manifest / test it
ctb Apr 4, 2022
e01a545
move row insert into manifest class
ctb Apr 4, 2022
76e9d89
test creation of sqlite mf
ctb Apr 4, 2022
15f91fe
switch to explicit moltype
ctb Apr 4, 2022
3f360a9
cleanup and refactoring
ctb Apr 4, 2022
f62efda
cleanup
ctb Apr 4, 2022
3627bb4
SQLite manifests are now first class
ctb Apr 4, 2022
8aec72b
pip cache should be looking at setup.cfg I think?
ctb Apr 4, 2022
31003c8
and tox cache should be looking at setup.cfg, too
ctb Apr 4, 2022
27ef5fe
try again/invalidate cache
ctb Apr 4, 2022
d2d115c
try again
ctb Apr 4, 2022
ac368a9
remove print
ctb Apr 4, 2022
c470b29
fix some stuff
ctb Apr 4, 2022
62f6b70
even more
ctb Apr 4, 2022
7b0efc8
add 'sourmash_versions' table
ctb Apr 4, 2022
153aaf3
test direct sqlmf creation & loading
ctb Apr 4, 2022
6c5e888
improve version checkingc
ctb Apr 4, 2022
7f96494
test various insertion errors
ctb Apr 4, 2022
e2296a3
fix num support in sqlite manifests (but not index)
ctb Apr 4, 2022
be04e0e
add explicit validation code, to be removed later
ctb Apr 4, 2022
29d4c8b
explicit check of 'num'
ctb Apr 4, 2022
a6351b5
add more docs/notes/annotations for work
ctb Apr 4, 2022
7611afc
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 5, 2022
cb42a4d
rename CollectionManifest_Sqlite to SqliteCollectionManifest
ctb Apr 5, 2022
84c2552
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 6, 2022
b87b432
preliminary victory over rankinfo
ctb Apr 6, 2022
72bafc9
provide generic LCA Database functionality via sqlite
ctb Apr 6, 2022
d44b21e
refactor and comment
ctb Apr 6, 2022
2342bc0
refactor and document
ctb Apr 6, 2022
8551384
add sqlite_utils
ctb Apr 7, 2022
1196554
cleanup
ctb Apr 7, 2022
37ae598
parse out SqliteIndex.create
ctb Apr 7, 2022
91f4649
rm comment
ctb Apr 7, 2022
efcb36e
add database_format to lca index
ctb Apr 7, 2022
e8819b1
get sql database output working for LCA index
ctb Apr 7, 2022
1e745f3
get all lca tests working on SQL version of LCA_Database
ctb Apr 7, 2022
2607c82
add test_index_protocol
ctb Apr 8, 2022
74b7022
add tests of indices after save/load
ctb Apr 8, 2022
baf88b0
match Index definition of __len__ in sbt
ctb Apr 8, 2022
f4f8bb9
Merge branch 'add/test_jaccard_ordering' into add/index_tests
ctb Apr 8, 2022
65fab4e
more index tests
ctb Apr 8, 2022
d243992
add some generic manifest tests
ctb Apr 8, 2022
7739afc
define abstract base class for CollectionManifest
ctb Apr 8, 2022
741f260
fix GTDB example, sigh
ctb Apr 9, 2022
f605cba
test hashval_to_idx
ctb Apr 9, 2022
106de97
add actual test for min num in rankinfo
ctb Apr 9, 2022
2378aa0
update 'get_lineage_assignments' in lca_db
ctb Apr 9, 2022
af565f7
update comment
ctb Apr 9, 2022
8dc859b
make lid_to_idx and idx_to_ident private
ctb Apr 9, 2022
6789150
moar comment
ctb Apr 9, 2022
a5bb822
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 9, 2022
e043862
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 9, 2022
a6a6523
add sqlite clases to protocol tests
ctb Apr 9, 2022
ce4b467
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 9, 2022
7fd3a94
adjust protocol
ctb Apr 9, 2022
36cfc4b
update to match protocol
ctb Apr 9, 2022
16caa54
add, then hide, RevIndex test
ctb Apr 9, 2022
0338657
update the LCA_Database protocol
ctb Apr 9, 2022
fb1209e
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 9, 2022
7e0e9a1
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 9, 2022
1ec70a8
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 9, 2022
fee10b0
SqliteCollectionManifest now passes all the tests
ctb Apr 9, 2022
214dcaf
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 9, 2022
e3ff9f0
update row check to ignore _ prefixes
ctb Apr 9, 2022
b7191de
implement remaining lca_db protocol for sqlite
ctb Apr 9, 2022
3139e4c
fix up rankinfo for sqlite LCA_Database
ctb Apr 9, 2022
110c5ea
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 10, 2022
7e2e033
finish testing the rest of the Index classes
ctb Apr 10, 2022
de8b5fb
cleanup
ctb Apr 10, 2022
d1b259e
upd
ctb Apr 10, 2022
f490354
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 10, 2022
dace619
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 10, 2022
08ac110
cleanup LCA_Database creation
ctb Apr 10, 2022
7735cee
backport 08ac110dfad4afb76
ctb Apr 10, 2022
7df30d6
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 10, 2022
fe5ce83
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 10, 2022
a3fea8a
add sqlite loading to CollectionManifest
ctb Apr 10, 2022
10b0ff3
update manifest writing to support SQL, too
ctb Apr 10, 2022
00c98f5
switch to using generic manifest.write_to_filename
ctb Apr 10, 2022
cd84ca6
catch pre-existing sqlite DBs
ctb Apr 10, 2022
7af8555
remove test for now-implemented func
ctb Apr 10, 2022
338eed3
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 10, 2022
a36b197
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 10, 2022
4f8d069
work through various merge implications
ctb Apr 10, 2022
b8da770
switch away from a row tuple in CollectionManifest
ctb Apr 10, 2022
11ef719
more clearly separate internals of LCA_Database from public API
ctb Apr 10, 2022
e8535e4
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 10, 2022
d4259f9
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 10, 2022
c297edd
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 10, 2022
8ab82ee
add saved/loaded manifest
ctb Apr 10, 2022
c422f39
add test coverage for exceptions in LazyLoadedIndex
ctb Apr 11, 2022
daf93d4
add docstrings to manifest code
ctb Apr 11, 2022
11f0add
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 11, 2022
88f8c78
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 11, 2022
2e5bc5d
add docstrings / comments
ctb Apr 11, 2022
b311d36
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 11, 2022
79903b4
Merge branch 'add/sqlite_index' into add/sqlite_index_lca
ctb Apr 11, 2022
d1e67a2
fix sig check reliance on internal manifest mechanism
ctb Apr 11, 2022
32c2f0a
fix picklist stuff when using Sqlite manifests
ctb Apr 11, 2022
1bafd43
add lots of debug stmts
ctb Apr 12, 2022
a279e84
remove SQLite pickset as impractical
ctb Apr 12, 2022
6c176d3
remove some expensive debugs
ctb Apr 12, 2022
66b2c8c
remove sql picklist code as too slow
ctb Apr 12, 2022
ba8928f
comments and cleanup
ctb Apr 12, 2022
50976f7
much cleanup
ctb Apr 12, 2022
e2ff0d7
re-add debug_literal
ctb Apr 12, 2022
745379c
more cleanup
ctb Apr 12, 2022
fcac173
comment
ctb Apr 12, 2022
59dbdf0
fix 'num' select
ctb Apr 12, 2022
5956e11
test and document locations()
ctb Apr 13, 2022
7b39253
use names in namedtuple; add containment test
ctb Apr 13, 2022
5de7d67
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 13, 2022
043e4cb
add numerical values to jaccard order tests
ctb Apr 13, 2022
65920c0
cleanup
ctb Apr 13, 2022
1228f1c
remove redundant tests
ctb Apr 13, 2022
4697cd4
test scaled=1 stuff pretty explicitly
ctb Apr 13, 2022
cfcf6cf
rename 'create_from_manifest' method
ctb Apr 13, 2022
57a65b1
cleanup
ctb Apr 13, 2022
1cb8773
add required_keys check
ctb Apr 13, 2022
12cbb82
check manifest equality only on required keys
ctb Apr 13, 2022
0be189c
add required_keys check
ctb Apr 13, 2022
cfddea8
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 13, 2022
06a9194
add index tests for LCA_SqliteDatabase
ctb Apr 13, 2022
6b55aba
constructor/etc refactoring
ctb Apr 13, 2022
c4e0a93
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 13, 2022
2fc0ca3
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 13, 2022
f4c0207
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 13, 2022
e97387f
add scaled/dowsample test
ctb Apr 14, 2022
f4824b0
add downsample_scaled etc
ctb Apr 14, 2022
8101157
remove unused code
ctb Apr 14, 2022
f07e394
cleanup
ctb Apr 14, 2022
6432315
Merge branch 'latest' into add/index_tests
ctb Apr 14, 2022
c4bd1ac
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 14, 2022
a968749
update comment
ctb Apr 14, 2022
1f531d3
rename tables to have prefix sourmash_
ctb Apr 15, 2022
3e9ed68
update with many a test
ctb Apr 15, 2022
de1417a
fix diagnostic output during sourmash index #1949
ctb Apr 15, 2022
3c109ba
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 15, 2022
ef9a7d9
handle bad versions of stuff
ctb Apr 15, 2022
66b4a2f
update/simplify version checking
ctb Apr 15, 2022
3ef88de
add append test
ctb Apr 15, 2022
8141f63
add notes about further tests
ctb Apr 15, 2022
bfc25e0
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 15, 2022
ceea282
Merge branch 'add/index_tests' into add/sqlite_index
ctb Apr 15, 2022
7a0ceb8
minor comment update
ctb Apr 15, 2022
9b3f72a
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 15, 2022
8004f5d
fix after merge
ctb Apr 15, 2022
06261f5
update table name for lineage db
ctb Apr 16, 2022
780cb30
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 16, 2022
7b171c7
more docs
ctb Apr 16, 2022
0b3f1af
implement loading of LCA_SqliteDatabases at command line
ctb Apr 16, 2022
5411bf0
cleanup and testing
ctb Apr 16, 2022
ea04f8f
start adding some documentation
ctb Apr 16, 2022
22117c4
add location and manifest properties to LCA_SqliteDatabase
ctb Apr 16, 2022
00ee8ba
update
ctb Apr 16, 2022
8973f2f
Merge farm:sourmash into add/sqlite_index
ctb Apr 16, 2022
b36fea7
update index protocol tests to check location, manifest
ctb Apr 16, 2022
c3b6477
add tests for fileinfo on all sql db variants
ctb Apr 17, 2022
b788ac1
add test for signatures_with_location
ctb Apr 17, 2022
b9bab62
upd
ctb Apr 17, 2022
9379355
add test of new-style lineage db file
ctb Apr 18, 2022
6cfe86a
upd/cleanup
ctb Apr 18, 2022
2fd18cd
try out inheritance instead of composition
ctb Apr 18, 2022
4f3ba01
comment
ctb Apr 18, 2022
8086443
more cleanup
ctb Apr 18, 2022
34c0c1e
clean up LCA_SqliteDatabase
ctb Apr 18, 2022
5b181b9
create some more tests...
ctb Apr 18, 2022
6103781
update checklist
ctb Apr 18, 2022
8c6c23c
refactor and cleanup
ctb Apr 19, 2022
0786569
round out the tests a bit
ctb Apr 19, 2022
8806cf0
allow append
ctb Apr 19, 2022
7444858
cleanup, doc
ctb Apr 19, 2022
c35f569
cleanup/simplify
ctb Apr 19, 2022
ee48e08
support picklists in LCA_Database.signatures
ctb Apr 19, 2022
47132ca
fix up @CTB in LCA tests
ctb Apr 19, 2022
7845949
cleanup @CTB in test_cmd_signature
ctb Apr 19, 2022
85cfec5
add tests for picklist support in LCA_database.signatures()
ctb Apr 19, 2022
19e775f
many minor updates
ctb Apr 19, 2022
320e1ed
more tests
ctb Apr 19, 2022
0096c42
add more manifest tests
ctb Apr 19, 2022
9e1ea60
add some final? tests
ctb Apr 19, 2022
f27a059
one final test
ctb Apr 19, 2022
41e5eac
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 20, 2022
8fa6a34
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 20, 2022
c05d044
fix typo via @mr-eyes
ctb Apr 21, 2022
ade801c
remove unnecessary PARSE_DECLTYPES
ctb Apr 22, 2022
14e460a
Merge branch 'latest' into add/sqlite_index
ctb Apr 22, 2022
659a720
Merge branch 'latest' into add/sqlite_index
mr-eyes Apr 23, 2022
192ed1b
Merge branch 'add/sqlite_index' of /~https://github.com/sourmash-bio/so…
ctb Apr 24, 2022
16b8b9b
add docs for creating sqldb
ctb Apr 24, 2022
c443ae1
do not allow overwrite/append to xisting lca database
ctb Apr 24, 2022
ee3749e
Update src/sourmash/lca/lca_db.py
ctb Apr 24, 2022
23fb8bd
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 25, 2022
52bc90b
fix bug with duplicate lineages in LCA_SqliteDatabase
ctb Apr 25, 2022
edf959b
fix test broken by duplicate lineage fix
ctb Apr 25, 2022
d5ae718
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 25, 2022
9e616dd
Merge branch 'latest' of /~https://github.com/sourmash-bio/sourmash int…
ctb Apr 26, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# note: to invalidate caches, adjust the pip-v? and tox-v? numbers below.
name: Python tests

on:
Expand Down Expand Up @@ -35,9 +36,9 @@ jobs:
uses: actions/cache@v3
with:
path: ${{ steps.pip-cache.outputs.dir }}
key: ${{ runner.os }}-pip-${{ hashFiles('**/setup.py') }}
key: ${{ runner.os }}-pip-v2-${{ hashFiles('**/setup.cfg') }}
restore-keys: |
${{ runner.os }}-pip-
${{ runner.os }}-pip-v2-

- name: Install dependencies
run: |
Expand All @@ -64,9 +65,9 @@ jobs:
uses: actions/cache@v3
with:
path: .tox/
key: ${{ runner.os }}-tox-${{ hashFiles('**/setup.py') }}
key: ${{ runner.os }}-tox-v2-${{ hashFiles('**/setup.cfg') }}
restore-keys: |
${{ runner.os }}-tox-
${{ runner.os }}-tox-v2-

- name: Test with tox
run: tox
Expand Down
55 changes: 44 additions & 11 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -727,7 +727,7 @@ database. It can be used to combine multiple taxonomies into a single file,
as well as change formats between CSV and sqlite3.

The following command will take in two taxonomy files and combine them into
a single taxonomy sqlite database.
a single taxonomy SQLite database.

```
sourmash tax prepare --taxonomy file1.csv file2.csv -o tax.db
Expand Down Expand Up @@ -931,6 +931,15 @@ As of sourmash 4.2.0, `lca index` supports `--picklist`, to
can be used to index a subset of a large collection, or to
exclude a few signatures from an index being built from a large collection.

As of sourmash 4.4.0, `lca index` can produce an _on disk_ LCA
database using SQLite. To prepare such a database, use
`sourmash lca index ... -F sql`.

All sourmash commands work with either type of LCA database (the
default JSON database, and the SQLite version). SQLite databases are
larger than JSON databases on disk but are typically much faster
to load and search, and use much less memory.

### `sourmash lca rankinfo` - examine an LCA database

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -1399,6 +1408,14 @@ iterating over the signatures in the input file. This can be slow for
large collections. Use `--no-rebuild-manifest` to load an existing
manifest if it is available.

As of sourmash 4.4.0, `sig manifest` can produce a manifest in a fast
on-disk format (a SQLite database). SQLite manifests can be _much_
faster when working with very large collections of signatures.
To produce a SQLite manifest, use `sourmash sig manifest ... -F sql`.

All sourmash commands that work with manifests will accept both
CSV and SQLite manifest files.

### `sourmash signature check` - compare picklists and manifests

Compare picklists and manifests across databases, and optionally output matches
Expand Down Expand Up @@ -1452,7 +1469,7 @@ Briefly,

None of these commands currently support searching, comparing, or indexing
signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
to pick the ksize and moltype to use for your query. Where possible,
scaled values will be made compatible.

### Selecting signatures
Expand Down Expand Up @@ -1549,9 +1566,10 @@ In addition to `sig extract`, the following commands support
### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
signatures. `sourmash` supports storing and loading signatures from JSON
files, directories, lists of files, Zip files, and indexed databases.
These can all be used interchangeably for sourmash operations.
signatures. `sourmash` supports storing and loading signatures from
JSON files, directories, lists of files, Zip files, custom indexed
databases, and SQLite databases. These can all be used
interchangeably for most sourmash operations.

The simplest is one signature in a single JSON file. You can also put
many signatures in a single JSON file, either by building them that
Expand All @@ -1567,7 +1585,7 @@ signatures from zip files. You can create a compressed collection of
signatures using `zip -r collection.zip *.sig` and then specify
`collections.zip` on the command line.

### Saving signatures, more generally
### Choosing signature output formats

(sourmash v4.1 and later)

Expand All @@ -1583,6 +1601,7 @@ This behavior is triggered by the requested output filename --
* to save to gzipped JSON signature files, use `.sig.gz`;
* to save to a Zip file collection, use `.zip`;
* to save signature files to a directory, use a name ending in `/`; the directory will be created if it doesn't exist;
* to save to a SQLite database, use `.sqldb` (as of sourmash v4.4.0).

If none of these file extensions is detected, output will be written
in the JSON `.sig` format, either to the provided output filename or
Expand Down Expand Up @@ -1614,22 +1633,36 @@ Indexed databases can make searching signatures much faster. SBT
databases are low memory and disk-intensive databases that allow for
fast searches using a tree structure, while LCA databases are higher
memory and (after a potentially significant load time) are quite fast.
SQLite databases (new in sourmash v4.4.0) are typically larger on disk
than SBTs and LCAs, but in turn are fast to load and support very low
memory search.

(LCA databases also directly permit taxonomic searches using `sourmash lca`
functions.)

Commands that take multiple signatures or collections of signatures
will also work with databases.
will also work with indexed databases.

One limitation of indexed databases is that both SBT and LCA database
can only contain one "type" of signature (one ksize/one moltype at one
scaled value). If the database signature type is incompatible with the
other signatures, sourmash will complain appropriately.
One limitation of indexed databases is that they are all restricted in
to certain kinds of signatures. Both SBT and LCA databases can only
contain one "type" of signature (one ksize/one moltype at one scaled
value). SQLite databases can contain multiple ksizes and moltypes, but
only at one scaled value. If the database signature type is
incompatible with the other signatures, sourmash will complain
appropriately.

In contrast, signature files, zip collections, and directory
hierarchies can contain many different types of signatures, and
compatible ones will be selected automatically.

Use the `sourmash index` command to create an SBT.

Use the `sourmash lca index` command to create an LCA database; the
database can be saved in JSON or SQL format with `-F json` or `-F sql`.

Use `sourmash sig cat <list of signatures> -o <output>.sqldb` to create
a SQLite indexed database.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
Expand Down
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ install_requires =
scipy
deprecation>=2.0.6
cachetools>=4,<6
bitstring>=3.1.9,<4
python_requires = >=3.8

[bdist_wheel]
Expand Down
6 changes: 6 additions & 0 deletions src/sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ def subparser(subparsers):
'--fail-on-missing-taxonomy', action='store_true',
help='fail quickly if taxonomy is not available for an identifier',
)
subparser.add_argument(
'-F', '--database-format',
help="format of output database; default is 'json')",
default='json',
choices=['json', 'sql'],
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
Expand Down
4 changes: 4 additions & 0 deletions src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ def subparser(subparsers):
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'--threshold', metavar='T', default=0.08, type=float,
help='minimum threshold for reporting matches; default=0.08'
Expand Down
7 changes: 7 additions & 0 deletions src/sourmash/cli/sig/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,13 @@ def subparser(subparsers):
help='do not require a manifest; generate dynamically if needed',
action='store_true'
)
subparser.add_argument(
'-F', '--manifest-format',
help="format of manifest output file; default is 'csv')",
default='csv',
choices=['csv', 'sql'],
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
Expand Down
7 changes: 6 additions & 1 deletion src/sourmash/cli/sig/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,12 @@ def subparser(subparsers):
'--no-rebuild-manifest', help='use existing manifest if available',
action='store_true'
)

subparser.add_argument(
'-F', '--manifest-format',
help="format of manifest output file; default is 'csv')",
default='csv',
choices=['csv', 'sql'],
)

def main(args):
import sourmash
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,7 @@ def search(args):
from .search import (search_databases_with_flat_query,
search_databases_with_abund_query)

set_quiet(args.quiet)
set_quiet(args.quiet, args.debug)
moltype = sourmash_args.calculate_moltype(args)
picklist = sourmash_args.load_picklist(args)
pattern_search = sourmash_args.load_include_exclude_db_patterns(args)
Expand Down
26 changes: 13 additions & 13 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -868,6 +868,15 @@ class MultiIndex(Index):
Note: this is an in-memory collection, and does not do lazy loading:
all signatures are loaded upon instantiation and kept in memory.

There are a variety of loading functions:
* `load` takes a list of already-loaded Index objects,
together with a list of their locations.
* `load_from_directory` traverses a directory to load files within.
* `load_from_path` takes an arbitrary pathname and tries to load it
as a directory, or as a .sig file.
* `load_from_pathlist` takes a text file full of pathnames and tries
to load them all.

Concrete class; signatures held in memory; builds and uses manifests.
"""
def __init__(self, manifest, parent, *, prepend_location=False):
Expand Down Expand Up @@ -1212,8 +1221,7 @@ def load(cls, location, *, prefix=None):
if not os.path.isfile(location):
raise ValueError(f"provided manifest location '{location}' is not a file")

with open(location, newline='') as fp:
m = CollectionManifest.load_from_csv(fp)
m = CollectionManifest.load_from_filename(location)

if prefix is None:
prefix = os.path.dirname(location)
Expand Down Expand Up @@ -1245,20 +1253,12 @@ def _signatures_with_internal(self):
manifest in this class.
"""
# collect all internal locations
iloc_to_rows = defaultdict(list)
for row in self.manifest.rows:
iloc = row['internal_location']
iloc_to_rows[iloc].append(row)

# iterate over internal locations, selecting relevant sigs
for iloc, iloc_rows in iloc_to_rows.items():
# prepend with prefix?
picklist = self.manifest.to_picklist()
for iloc in self.manifest.locations():
# prepend location with prefix?
if not iloc.startswith('/') and self.prefix:
iloc = os.path.join(self.prefix, iloc)

sub_mf = CollectionManifest(iloc_rows)
picklist = sub_mf.to_picklist()

idx = sourmash.load_file_as_index(iloc)
idx = idx.select(picklist=picklist)
for ss in idx.signatures():
Expand Down
Loading