Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on ZipFile and remove GC call from read #273

Conversation

TimG1964
Copy link
Contributor

@TimG1964 TimG1964 commented Oct 4, 2024

Replacing ZipFile with ZipArchive has also removed the need for the GC call in read.

Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 95.06%. Comparing base (f4767c4) to head (7600fd5).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/read.jl 90.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #273      +/-   ##
==========================================
+ Coverage   95.02%   95.06%   +0.04%     
==========================================
  Files          15       15              
  Lines        2009     1985      -24     
==========================================
- Hits         1909     1887      -22     
+ Misses        100       98       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TimG1964
Copy link
Contributor Author

TimG1964 commented Oct 4, 2024

I'm not sure I understand the failed test on codecov. If I look at the details, 100% of my new code is covered by the tests. All the code that is not covered was there before this PR. Despite this, overall coverage has gone down and the test has failed!

Copy link
Contributor

@nhz2 nhz2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on how performance can be improved.

src/read.jl Outdated Show resolved Hide resolved
src/read.jl Outdated Show resolved Hide resolved
src/read.jl Outdated
xf.files[filename] = true # set file as read

try
xf.data[filename] = EzXML.readxml(f)
xf.data[filename] = EzXML.parsexml(ZipArchives.zip_readentry(xf.io, f, String))
catch err
@error("Failed to parse internal XML file `$filename`")
rethrow()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole for loop and file_not_found check can be simplified to:

try
    xf.data[filename] = EzXML.parsexml(ZipArchives.zip_readentry(xf.io, filename))
catch err
    @error("Failed to parse internal XML file `$filename`")
    rethrow()
end

Because zip_readentry will take care of looping through the entry names, reading a matching entry, or erroring if the filename isn't found.

Copy link
Contributor Author

@TimG1964 TimG1964 Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for all these suggestions. Where possible, I've accepted them here, as you can see.

Your last suggestion also seems to need the line

xf.files[filename] = true # set file as read

in the try block, otherwise the file is recorded as not present.
I've added this to my fork and will figure out how to include it here.

src/stream.jl Outdated Show resolved Hide resolved
TimG1964 and others added 3 commits October 9, 2024 15:53
This should be faster because it avoids creating a String and there is a check of the uncompressed_size in /~https://github.com/JuliaIO/ZipArchives.jl/blob/f955785e237a0a8b3607cf651eaebc1eb1037b8c/src/reader.jl#L344

Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>
zip_openentry can be used here to avoid decompressing the entire entry into memory.

Also, the error on the line after this can be removed with this change.

Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>
This should be faster because it avoids allocating all of the entry names at once.

Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>
@TimG1964
Copy link
Contributor Author

TimG1964 commented Oct 9, 2024

One reflection here is that my proposed changes have removed the mmap option to read in place. I don't know enough to be sure but I guess this would be a breaking change for some use cases (large files). It wasn't a deliberate omission!

I am working on adding it back in but, unfortunately, It has caused the original problem to arise, namely, cannot write after read:

SystemError: opening file "output_tables.xlsx": Invalid argument

I've spent some time on this now but - for me at least - it seems pretty intractable! :-(

@nhz2
Copy link
Contributor

nhz2 commented Oct 9, 2024

The segfaults are very strange, it seems like there is a bug in EzXML.jl.

I wonder if it would be possible to switch to /~https://github.com/JuliaComputing/XML.jl. This would help fix the multi-threading issues as well.

@TimG1964
Copy link
Contributor Author

TimG1964 commented Oct 9, 2024

Gulp!
Day job permitting I might have a go at this but, so far, the more I dig the more of a pickle I seem to get into!
Don't hold your breath!

@TimG1964
Copy link
Contributor Author

Just found this on the hdf5.jl docs

Note: if you use readmmap on a dataset and subsequently close the file, the array data are still available---and file continues to be in use---until all of the arrays are garbage-collected. 

Might this be something that is going on here?

@TimG1964 TimG1964 closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants