-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Oxidize ZipStorage #1909
[MRG] Oxidize ZipStorage #1909
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1909 +/- ##
==========================================
- Coverage 83.23% 82.93% -0.30%
==========================================
Files 124 125 +1
Lines 13606 13755 +149
Branches 1863 1877 +14
==========================================
+ Hits 11325 11408 +83
- Misses 2014 2075 +61
- Partials 267 272 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
698559c
to
a803f6f
Compare
I was going to start adding more changes to this PR, but probably better to merge this now and improve later, especially because:
Before merging:
ready for review @sourmash-bio/devs |
did a I will see if I can do some benchmarking tomorrow. But overall this PR looks very good, at least on the Python side. |
Notes to maybe add to the top description:
Notes from a review of the Python code:
Everything looks good to me, other than that! I'd appreciate a review of the Rust code tho, @mr-eyes :) One other thought: I'm pretty sure that with manifests this:
will not be a problem :). But in any case the benchmark suggests that it's already not a problem even when there are no manifests 🎉 |
Working on it =] |
No hurry! And please wait to approve until the issues I noted are addressed ;) |
Indeed. I will make a fixture for
So... it is unrelated to manifests. For gory details: piz, the library I'm using for reading zip files, has some lifetime annotations that make it pretty hard to add to a struct ( pub struct ZipStorage<'a> {
mapping: Option<memmap2::Mmap>,
archive: Option<piz::ZipArchive<'a>>,
}
In the For The consequence of that is that you can see a lot of condition checking to see if pub fn list_sbts(&self) -> Result<Vec<String>, Error> {
let sbts_in_archive = |archive: &'_ piz::ZipArchive<'_>| {
// <snip> do something interesting with archive
};
if let Some(archive) = &self.archive {
sbts_in_archive(archive)
} else {
let archive = piz::ZipArchive::new((&self.mapping.as_ref()).unwrap())
.map_err(|_| StorageError::EmptyPathError)?;
sbts_in_archive(&archive)
}
} and that To make things worse, again due to lifetime shenanigans, it is hard to keep the highly useful file tree that What I'm saying: the current implementation is INCREDIBLY inefficient, but it is fixable (probably with (why choose |
I benchmarked Latest
PR
|
self-referential crimes successfully committed with
Can you please rerun this, @mr-eyes ? A note on memory: since this is using |
@luizirber That's quite an improvement in run time!
The memory here is the maximum actual physical memory assigned to the whole process. |
Programming Language :: Python :: 3.8 | ||
Programming Language :: Python :: 3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean sourmash is installable on a py3.10/py3.9 env but sourmash itself does not support >py3.8 in the dev code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tags are inclusive, so it means it is being advertised as supporting 3.8, 3.9, and 3.10.
|
||
let files = storage.filenames()?; | ||
|
||
// FIXME: use the ForeignObject trait, maybe define new method there... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would really appreciate elaborating more on the FIXME/TODO messages, like why? or a little hint :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is pretty much copy-and-pasted from /~https://github.com/sourmash-bio/sourmash/blob/1229dc1d6664004e580e3b80d4c13a8ae12a19be/src/core/src/ffi/signature.rs#L284L292
ForeignObject
is defined here, and collect some common functionality for working with the FFI layer (going from a Rust object to a raw pointer, or from a raw pointer to a Rust object). What is missing there is working with arrays of objects (like this function here, which returns a vector of SourmashStr
).
Most of the decisions/designs in the FFI layer in sourmash comes from symbolic, and you can see other ways to use ForeignObject
over there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely looks so great and amazing to me!
With Wall time: 2:16.25 does that mean that the main performance-related outcome of this PR is to keep wall time the same and increase memory usage by a factor of 7? (That's fine 'cause I understand this PR enables other things, just want to make sure I'm getting it right.) |
Yes. Sort of. (This is also something to keep in mind any time you evaluate benchmarks between software that uses I checked this behavior in Linux, curious how it works on macOS (especially on M1).
There is another approach that keeps the current |
I tested this hypothesis by creating a docker image out of this branch, and running with a limited memory container:
and it finished in about the same time (58s) as other previous tests. RSS outside the container was 1.7GB. Meaning: yes, the OS will avoid purging pages from memory in case they are needed, but if you have the memory it doesn't matter much. Looming question is the interaction of this with job schedulers in HPC clusters, because they might try to kill the job. |
7b55c52
to
1784ffb
Compare
cool - all sounds good to me, if you want to merge ;) |
Expose the (read-only for now) Rust ZipStorage and use it instead of the regular (read-write) ZipStorage.
if writing is needed, fall back to current ZipStorage (as a stopgap while the Rust one doesn't support writing)
This PR also removes Python 3.7 support (which we actually dropped in #1839), and adds Python 3.9 and Python 3.10 to the setup.cfg classifiers.