-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pyhdf-based arrs to be manually tokenized #2886
Conversation
This avoids a bug in dask or cloudpickle that alters the state of the pyhdf SDS object in some way making it unusable.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2886 +/- ##
==========================================
+ Coverage 95.99% 96.05% +0.06%
==========================================
Files 368 370 +2
Lines 53985 54320 +335
==========================================
+ Hits 51821 52177 +356
+ Misses 2164 2143 -21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Pull Request Test Coverage Report for Build 10527846171Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot for sorting this out.
I have a suggestion inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
This avoids a bug in dask or cloudpickle that alters the state of the pyhdf SDS object in some way making it unusable. The PR in dask that started triggering this was in dask/dask#11320. My guess is that the repeated pickling/serialization is losing some hidden state in the pyhdf SDS object and then pyhdf or HDF-C no longer knows how to work with it.
We could register SDS with normalize_token in dask or we could just do what I do in this PR and come up with the token/name ourselves. Note this is the name for the dask array/task. This is similar to work done in the past by @mraspaud for the HDF5 utility package to make sure things are consistent across file variable loads.
One alternative to the PR as it is right now would be to copy
from_sds
as a method on the HDF utility class so it knows how to useself.filename
automatically for the tokenizing. Also I'm realizing maybe the src_path shouldn't be required as it is only used if thename
kwarg isn't provided. Thoughts @mraspaud @gerritholl ?