-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export all headers from MultiGzDecoder #348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing!
I am far from qualified to evaluate if headers()
is the best way for accessing all headers encountered when decoding the gz
stream, but definitely see why one would have a need to collect them.
As a summary, I think it's valuable is to retain backwards compatibility, and to make header-collection opt-in entirely to prevent regression in memory consumption.
Thanks
@@ -237,9 +237,9 @@ impl<R: Read> MultiGzDecoder<R> { | |||
} | |||
|
|||
impl<R> MultiGzDecoder<R> { | |||
/// Returns the current header associated with this stream, if it's valid. | |||
pub fn header(&self) -> Option<&GzHeader> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to make this a breaking change by removing the single-header access method. It can live side-by-side with the headers()
method.
@@ -449,9 +453,13 @@ impl<R: BufRead> MultiGzDecoder<R> { | |||
} | |||
|
|||
impl<R> MultiGzDecoder<R> { | |||
/// Returns the current header associated with this stream, if it's valid | |||
pub fn header(&self) -> Option<&GzHeader> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to make this a breaking change by removing the single-header access method. It can live side-by-side with the headers()
method.
pub fn header(&self) -> Option<&GzHeader> { | ||
self.0.header() | ||
/// Returns the headers processed so far | ||
pub fn headers(&self) -> Vec<&GzHeader> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be allocation-free by returning impl Iterator<Item = &GzHeader>
with this implementation:
self.0.headers.iter().chain(self.0.header())
@@ -366,9 +369,10 @@ impl<R: BufRead> Read for GzDecoder<R> { | |||
return Err(err); | |||
} | |||
}; | |||
headers.push(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if performance wouldn't be an issue, these headers contain a few allocations themselves and I can imagine that archives with millions of members would now see a problem with memory consumption where previously they would not have an issue.
In order to prevent regression in that regard, I think this feature must be opt-in, controllable from the MutliGzDecoder
which already sets the multi
flag.
Hi Byron,
I agree on the change to Iterator for headers(). Thanks |
Thanks for explaining. However, I do believe that |
Do you have a use case for getting multiple headers from a multi-gzip file? The use cases I've seen for multi-gzip files use it to represent a single stream broken into chunks for the convenience of compressing separately. In these cases, the metadata from different chunks isn't particularly useful. Of course it is needed for decompressing, but the filename metadata, for example, is usually not meaningfully different between chunks. |
After having seen #301 and after having understood what multi-stream Gzip files are usually about, I see how one could use this format to either encode multiple streams of the same file or different files as individual streams. This is the case this PR addresses by allowing to extract the original filenames. However, what I don't understand is how one would extract and separate the decoded data in this case to turn it into individual files. This makes me think that merely allowing access to all encountered headers is not enough to handle such a case. |
The main uses of multi-gzip files that I know are:
In both cases, the data represents one final document. For the Wikipedia case, there's no reason to extract the sections separately. The use of sections is solely for the convenience of the creating server. For BGZF, you do need to examine the header and then read the data in the section. But the index gets you to the start of the correct section and then you can use the simple GzDecoder to decode the section you need. Even if there was a case of someone using multi-gzip for multiple files, I think repeatedly using the simple GzDecoder would be the way to get the filenames and data in a synchronized manner. |
I think there is a general sentiment here that keeping the current API is already suitable for handling the common scenarios of how multi-stream zip files are typically used, even though I think it wouldn't be super trivial to use The reason I am bringing this up is that an outcome of this PR could be that the documentation is improved or an example is added that shows how to handle the case that this PR attempts to address. Your help with this is definitely appreciated. |
An example code demonstrating how to extract multiple distinct files from a multi-gzip file using |
Unfortunately I have to close this PR as by now it's sufficiently clear that it can't be merged in its current form. In #324 it's made clear that in order to reliably decode a GZ encoded file, one would use the To decode multiple streams where each stream represents a single file with additional meta-data to name the file (or set the path), I agree that the current primitives are hard to use or maybe not usable at all. For that, like @jongiddy suggests, we should start with an example to show how this would be written today, and make improvements from there. I am very much looking forward to such a contribution. Let me state that for me this was an incredible journey of discovery - having recently joined the team of maintainers I can't claim to know too much about the intricacies of the GZ format or the implementation (all I needed this crate for is quite a narrow use-case, after all). However, by now and thanks to this PR, I have a much better idea on what to do with |
Hi and thanks for your crate.
This patch replaces the
header
interface with aheaders
interface in MultiGzDecoder as it's otherwise very hard if not impossible to inspect all the gzip "members" that are processed.It has no impact on GzDecoder itself except for an extra empty vector added to it.
I've kept the diff as compact as possible.