-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add sig fileinfo
and standard manifest-generating functionality
#1837
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1837 +/- ##
==========================================
+ Coverage 81.92% 82.25% +0.33%
==========================================
Files 118 119 +1
Lines 12783 12918 +135
Branches 1705 1727 +22
==========================================
+ Hits 10472 10626 +154
+ Misses 2047 2026 -21
- Partials 264 266 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
note from chatting with @bluegenes : the combinatorial information would be good (this ksize,scaled,moltype => this many sketches) |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
…o add/sig_fileinfo
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Note to all: this adds yaml as a sourmash requirement. I like yaml output, but @luizirber has made the point that YAML is ill-specified elsewhere. Maybe we should make this optional - if you have it, you can use @sourmash-bio/devs ready for initial review! But not merge yet. |
removed pyyaml altogether, as an unnecessary complication. Ready for review! |
sig fileinfo
and standard manifest-generating functionalitysig fileinfo
and standard manifest-generating functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Left a few minor comments/suggestions.
Quite excited about fileinfo, more unified manifest functionality, and manifest-based db sig extraction 🎉
thanks @bluegenes! Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Inspired by the discussion in #1830, this PR adds
sourmash sig fileinfo
for summary information on a path, and uses that functionality as inspiration to add genericget_manifest(...)
internal functionality.Specifically, this PR:
sourmash sig fileinfo
which provides a useful human-readable summary of a path; see below for example output.get_manifest(idx, *, require=True, rebuild=False)
that can be used to get a manifest for any index object.extract
to use manifests, which should be (much) faster for databasesIndex
base class to have__len__
In addition, this PR:
__len__()
and manifest-generating support toLCA_Database
sourmash_args
a bit in the module docstringMultiIndex.location
to work properly in all circumstancesZipFileLinearIndex.__len__
, which speeds it up tremendouslywith_abundance
from a CSVMultiIndex
class with manifest support, instead of just aLinearIndex
; this means that all sourmash CLI functions now use Index classes with manifests__eq__
to manifest objects so thatm1 == m2
is based on content equality, not object equalityFixes #1190
Fixes #1439
Example output
Limitations
sourmash sig extract
will not work when used with--picklist
on certain database types (LCA DBs, SBTs, and zipfiles w/o manifests). This is because we now convert manifests into picklists to do the extraction and these database types do not support multiple picklists. See #1848 for one possible resolution.IMO this is a pretty minor limitation so I decided not to complicate this PR by fixing it. Let me know if you disagree :).
See test
test_sig_extract_8_picklist_md5_lca
for a test that demonstrates this behavior.Questions
sourmash sig summarize
instead of, or in addition to,fileinfo
? I'm not surefileinfo
is the first place I'd look for this functionality, but it is very descriptive 😆TODO
fileinfo
get_manifest(...)
on various class typesfileinfo
on all the index typesMultiIndex.location
fixLCA_Database
use of__len__
and new manifest capabilityZipFileLinearIndex
proper use of manifests + selectextract
on database types that don't support multiple rounds of picklists