-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Numpy files with zarr for resampling LUT caching #880
Conversation
And naturally I forgot to select "Draft", and it seems already made PRs can't be converted to a draft. Blargh. |
Also, fix flake8 warnings
Having the resampling LUTs in the same Some performance tests (HRIT SEVIRI, times in seconds) below. The bilinear performance sky-rockets for the later runs when the resampling data are computed to Numpy arrays. This is due to how the arrays are used in To run this PR this is also required to fix an issue in output index tiling: pytroll/pyresample#207 npz (current master) zarr, multiple files zarr, multiple files, no persist zarr, one file zarr, one file, no persist zarr, bilinear, no env variables zarr, bilinear, OMP=1 DASK=4, CHUNKS=1024 zarr, bilinear, OMP=1 DASK=4, CHUNKS=1024, cache.compute() |
Correct me if I'm wrong but since you are running |
Indeed. Didn't find an example to write several dask arrays in one go, only the way I used. But now that you mention it, I remember seeing something that used |
When using nearest, xr.Dataset.to_zarr nearest, xr.Dataset.to_zarr, OMP=1 DASK=4, CHUNKS=1024 bilinear, xr.Dataset.to_zarr, OMP=1 DASK=4, CHUNKS=1024 bilinear, xr.Dataset.to_zarr, OMP=1 DASK=4, CHUNKS=1024, cache.compute() Note that the |
I'm now forcing the resample LUTs to be loaded in memory for bilinear resampler for useable performance. |
Was the Those bilinear times are extremely concerning. How is it possible for the 2nd and 3rd run to take that long just from the cache? The cache should be loading from the zarr shouldn't it? Your numbers make it seem like the cache is being computed multiple times and it is doing it from the original inputs not the cached arrays. Does that sound possible? |
Yes, for Multiple |
Yeah that could be helpful. I'm not a fan of setting/getting the attributes on the pyresample Resampler object so if you can find a cleaner way to do it 👍. |
Hmm. Caching the indices doesn't seem to help much. If the dask arrays are persisted the cached runs are still slow (343.4 s). So I was thinking it's the indexing done with the dask arrays instead of Numpy arrays that is slow, and computed only I think I need to do some testing with a custom dask scheduler to count the computations for different runs. |
And now some results with a custom scheduler that keeps count on how many All the runs were made with Things I see from these results:
|
Codecov Report
@@ Coverage Diff @@
## master #880 +/- ##
==========================================
+ Coverage 84.6% 84.63% +0.03%
==========================================
Files 169 171 +2
Lines 24946 25086 +140
==========================================
+ Hits 21105 21231 +126
- Misses 3841 3855 +14
Continue to review full report at Codecov.
|
zarr_out[idx_name] = (coord, fid[idx_name]) | ||
|
||
# Write indices to Zarr file | ||
zarr_out.to_zarr(fname_zarr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the npz cache files be deleted after the conversion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that, but was hesitant to do that (automatically) if the user happens to have another streams still using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fine by me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "stream"? If there is an active process using the file then deleting it shouldn't be a problem (it should still be held in memory). I suppose we are assuming that two different versions of satpy may be using the same resampling cache? I guess it would be a problem then.
I'm fine with leaving them there, but I don't think it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Stream" = "processing chain". Someone might have another process running an older Satpy for whatever reason, e.g. for comparison reasons before upgrading an operational chain to the latest versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@djhoese do you want to review this one ? |
Short answer: I'm ok merging this as is. Long answer: @pnuu Could you re-summarize the performance you're seeing? Is it as expected (1 overall compute for If you make your chunk size really large how many computations occur? You mentioned ~100 compute calls, but that is because it includes each chunk as a separate compute, right? I was thinking one chunk might make things easier to debug. Is the computation only during Random thought: Is it possible to make |
Short summary:
|
Ok, so the speed increase isn't the big deal here, it's the memory usage, right ? |
Indeed. And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I suppose the main thing is that this isn't any slower than before.
Another go at making a draft PR for using Zarr instead of Numpy for storing resampling LUTs. The first try, from wrong branch, is here for reference #879
flake8 satpy