-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality that directly writes variables to a temporary zarr store #774
Conversation
…ys directly into the EchoData object and the temp zarr directory persists until EchoData object is completely destroyed
…begin documenting parsed_to_zarr, and add the padding of range_sample in get_np_chunk
… modify the inputs of these functions, and begin working on set_groups_ek80 for straight to zarr
…imensional arrays
…cally determines if large variables should be written to a temporary zarr store
@lsetiawan please note that this PR is currently in a draft state. There are currently two tests failing and I have not implemented the unit tests for the above work. |
@b-reyes could you resolve the branch conflict that's happening? Thanks! |
…ucture for direct to zarr unit tests, run pre-commit on all files
@lsetiawan this PR is ready for review. Offline @leewujung and I discussed and agreed that we should delay unit tests until after this release. We will consider the functionality introduced in this PR to be in beta until we implement these unit tests. |
Codecov Report
@@ Coverage Diff @@
## dev #774 +/- ##
===========================================
- Coverage 82.16% 64.13% -18.04%
===========================================
Files 48 52 +4
Lines 4244 4731 +487
===========================================
- Hits 3487 3034 -453
- Misses 757 1697 +940
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Alright @b-reyes, I've run through the testing. You can find my line profiling results here: https://gist.github.com/lsetiawan/de779bf1e48c22c55b3f0616623d5225 Your offloading to zarr definitely works great, and I see that temp files don't stick around, at least when restarting after each run.. so probably take another tests for when doing parallel reads.. future stuff. I don't think the 'auto' worked for me. It seems to keep choosing to offload_to_zarr Other than that I really enjoyed reviewing this and I think it's probably good to go. Seems like all checks passed. Since the memory consumption is low enough for OOI and NOAA files, and you said they're in google drive. Could you create a quick test for these with |
Yes, I thought that there may be issues with this as it is "heuristically" determined based off of only my computer. As it stands right now, the choice of whether to offload to zarr is dependent on the percentage of memory consumed if the variables were to be expanded. For example, see The other idea I had (which may be more appropriate) is to write those variables that when expanded exceed the user input value
@leewujung asked that I hold off on this, please see the comment above. Right now the CI will not crash because the default value for |
Hmm... I thought this is only for a chunk of the data and not the whole dataset. But I guess that would work. I think at this point, let's hold off on improving this so we can get this beta feature out. Can you potentially turn off the 'auto' for now and not give anyone that option for this release until we work it out more? That would probably be better than someone getting confused why it's not working. Thanks!
Okay. And you're not using any of the memory test files in current tests right? I guess for now, we've manually tested it and we know it can work so this is probably okay to hold off. |
I agree. For now, it is probably better to turn this option off.
Correct, I did put the 95 MB in the Google Drive, but I am never interacting with it. |
Add simple test for noaa file
Yes I did ask @b-reyes to hold off on the tests. IMHO the 95 Mb file is too large, and what we need are tests to unit test the functions that the convert mechanisms use, instead of actually converting the large files. |
I added a simple test anyway since brandon put it in the google drive 😛 haha hope it works! |
Co-authored-by: Don Setiawan <landungs@uw.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me. Thanks @b-reyes for working on this and implementing changes as I review. I have tested out the functionalities manually and added a small CI test with the problematic NOAA data. The result is great! With offload_to_zarr=True
, memory expansion only happened minimally! Everything seem to work as expected at this state and it's ready for merging with a beta
stamp. 😄
I'll wait for CI to finish to merge this to |
This PR addresses issues #489, #407, and #408. The primary concern with these issues is that there is large memory expansion occurring when data is converted and organized as an EchoData object. To combat these issues, this PR implements the following items:
offload_to_zarr
toopen_raw
that can be one of: True, False, ‘auto’max_zarr_mb
toopen_raw
, which is the maximum MB that each zarr chunk should hold, when offloading variables with a large memory footprint to a temporary zarr store.set_groups
.Note: the above items were only implemented for the echosounders EK60, ES70, EK80, ES80, and EA640.
In #489 a 725 MB OOI file and a 95 MB file (shared by @oftfrfbf) were provided. Below we provide the new memory profiling results for opening/parsing the files using
open_raw
and then writing the EchoData object to a zarr file usinged.to_zarr()
. These results were obtained on a MacBook Pro with an Apple M1 Pro chip and 16GB of RAM.From the results above we see that we are able to successfully open the raw file and then write the EchoData object to a zarr file. The dip in memory usage around 28 seconds signifies that
open_raw
has completed and the subsequent spike in memory usage is caused byed.to_zarr()
.In comparison to @oftfrfbf’s comment, we see from the above results that we are no longer getting a substantial increase in memory usage when creating the EchoData object, the action of writing the data to zarr around 8 seconds maintains a reasonable memory usage, and the process is significantly faster.