Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Numpy files with zarr for resampling LUT caching #880

Merged
merged 18 commits into from
Sep 6, 2019

Conversation

pnuu
Copy link
Member

@pnuu pnuu commented Aug 16, 2019

Another go at making a draft PR for using Zarr instead of Numpy for storing resampling LUTs. The first try, from wrong branch, is here for reference #879

  • Tests added and test suite added to parent suite
  • Tests passed
  • Passes flake8 satpy
  • Fully documented

@pnuu pnuu requested review from djhoese and mraspaud as code owners August 16, 2019 05:49
@pnuu pnuu changed the title Feature zarr resample luts Replace Numpy files with zarr for resampling LUT caching Aug 16, 2019
@pnuu
Copy link
Member Author

pnuu commented Aug 16, 2019

And naturally I forgot to select "Draft", and it seems already made PRs can't be converted to a draft. Blargh.

@pnuu pnuu self-assigned this Aug 16, 2019
Also, fix flake8 warnings
@pnuu
Copy link
Member Author

pnuu commented Aug 16, 2019

Having the resampling LUTs in the same zarr doesn't affect the performance at all, as they are in any case in separate files on disk. I anyway combined all the separate arrays to one zarr to have the cache_dir cleaner.

Some performance tests (HRIT SEVIRI, times in seconds) below. The bilinear performance sky-rockets for the later runs when the resampling data are computed to Numpy arrays. This is due to how the arrays are used in get_sample_from_bil_info() in pyresample has been implemented. This needs to be fixed in another PR for pyresample.

To run this PR this is also required to fix an issue in output index tiling: pytroll/pyresample#207

npz (current master)
first run: 61.2
cache run: 43.7
third run: 43.8

zarr, multiple files
first run: 60.6
cache run: 41.1
third run: 41.4

zarr, multiple files, no persist
first run: 95.7
cache run: 41.6
third run: 41.5

zarr, one file
first run: 61.9
cache run: 41.1
third run: 41.2

zarr, one file, no persist
first run: 100
cache run: 41.9
third run: 41.9

zarr, bilinear, no env variables
first run: 134
cache run: 141.4
third run: 141.2

zarr, bilinear, OMP=1 DASK=4, CHUNKS=1024
first run: 110.1
cache run: 429.4
third run: -

zarr, bilinear, OMP=1 DASK=4, CHUNKS=1024, cache.compute()
first run: -
cache run: 28.7
third run: 28.2

@djhoese
Copy link
Member

djhoese commented Aug 16, 2019

Correct me if I'm wrong but since you are running to_zarr three separate times it is still computing them 3 separate times. If they share any calculations then those calculations will be (unnecessarily) repeated each time to_zarr is executed. Not sure if a xarray Dataset would be the simplest alternative.

@pnuu
Copy link
Member Author

pnuu commented Aug 16, 2019

Indeed. Didn't find an example to write several dask arrays in one go, only the way I used. But now that you mention it, I remember seeing something that used xarray.Dataset when I first looked at zarr. Will have another look at this on monday.

@pnuu
Copy link
Member Author

pnuu commented Aug 19, 2019

When using xr.Dataset.to_zarr() the timings are the following (note that the area for bilinear is 1/4th of the nearest in each dimension.):

nearest, xr.Dataset.to_zarr
first run: 60.3
cache run: 38.3
third run: 37.3

nearest, xr.Dataset.to_zarr, OMP=1 DASK=4, CHUNKS=1024
first run: 49.3
cache run: 32.7
third run: 33.9

bilinear, xr.Dataset.to_zarr, OMP=1 DASK=4, CHUNKS=1024
first run: 115.0
cache run: 456.1
third run: 486.1

bilinear, xr.Dataset.to_zarr, OMP=1 DASK=4, CHUNKS=1024, cache.compute()
first run: 115.0
cache run: 27.8
third run: 27.1

Note that the compute() call in the last test run does not affect the memory footprint.

@pnuu
Copy link
Member Author

pnuu commented Aug 19, 2019

I'm now forcing the resample LUTs to be loaded in memory for bilinear resampler for useable performance.

@djhoese
Copy link
Member

djhoese commented Aug 19, 2019

Was the nearest test case the same as the previous timing? Like those with master? If so, does this mean the zarr implementation is equal or at least similar in timing to the original numpy? Any idea what the environment variables were set to for the original master branch tests?

Those bilinear times are extremely concerning. How is it possible for the 2nd and 3rd run to take that long just from the cache? The cache should be loading from the zarr shouldn't it? Your numbers make it seem like the cache is being computed multiple times and it is doing it from the original inputs not the cached arrays. Does that sound possible?

@pnuu
Copy link
Member Author

pnuu commented Aug 19, 2019

Yes, for nearest the base tests were without any environment variables set, like the first in this new set. So with the same setting zarr seems to be a bit faster.

Multiple compute calls could indeed be the explanation. And now that I look at the code, the BilinearResampler class doesn't implement any in-memory caching. I think I should add that in similar way KDreeResampler does it with setattr(resampler, idx_name, cache).

@djhoese
Copy link
Member

djhoese commented Aug 19, 2019

Yeah that could be helpful. I'm not a fan of setting/getting the attributes on the pyresample Resampler object so if you can find a cleaner way to do it 👍.

@pnuu
Copy link
Member Author

pnuu commented Aug 19, 2019

Hmm. Caching the indices doesn't seem to help much. If the dask arrays are persisted the cached runs are still slow (343.4 s). So I was thinking it's the indexing done with the dask arrays instead of Numpy arrays that is slow, and computed only valid_input_index and index_array and kept bilinear_s and bilinear_t as dask arrays. But it still was slow. But doing the opposite and computing bilinear_* takes the cached runs down below 30 s times. And these two arrays are only in a multiplication when getting the output pixel value.

I think I need to do some testing with a custom dask scheduler to count the computations for different runs.

@pnuu
Copy link
Member Author

pnuu commented Aug 20, 2019

And now some results with a custom scheduler that keeps count on how many compute() calls are made in bilinear resampling:
precompute bilinear_?: 97 compute calls, 29.0 s
precompute all indices: 97 compute calls, 29.3 s
no precomputations for indices: 95 calls, 194.3 s

All the runs were made with OMP_NUM_THREADS=1 DASK_NUM_WORKERS=4 PYTROLL_CHUNK_SIZE=1024 using zarr as cache storage.

Things I see from these results:

  1. now, without any changes to the prior similar test, running without the explicit compute when loading the resampling LUTs is much faster (see point 4.)
  2. there are least computations in the version without precomputation, but that version is still the slowest by far
  3. it doesn't matter for computation count whether only bilinear_{s,t} or all of the indices are precomputed, even if that is done one-by-one
  4. using a custom scheduler, even with scheduler='threads' kwarg uses single-threaded scheduler, which seems to be faster for the "don't precompute anything" case

@coveralls
Copy link

coveralls commented Aug 22, 2019

Coverage Status

Coverage increased (+0.03%) to 84.63% when pulling 365d761 on pnuu:feature-zarr-resample-luts into d90e7f5 on pytroll:master.

@codecov
Copy link

codecov bot commented Sep 3, 2019

Codecov Report

Merging #880 into master will increase coverage by 0.03%.
The diff coverage is 87.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #880      +/-   ##
==========================================
+ Coverage    84.6%   84.63%   +0.03%     
==========================================
  Files         169      171       +2     
  Lines       24946    25086     +140     
==========================================
+ Hits        21105    21231     +126     
- Misses       3841     3855      +14
Impacted Files Coverage Δ
satpy/tests/test_resample.py 98.21% <100%> (-0.48%) ⬇️
satpy/resample.py 89.86% <80.3%> (-1.48%) ⬇️
satpy/writers/cf_writer.py 91.43% <0%> (-0.55%) ⬇️
satpy/scene.py 90.47% <0%> (-0.18%) ⬇️
satpy/readers/mersi2_l1b.py 96.38% <0%> (-0.09%) ⬇️
satpy/tests/writer_tests/test_mitiff.py 97.9% <0%> (-0.07%) ⬇️
satpy/composites/abi.py 100% <0%> (ø) ⬆️
satpy/readers/hdf4_utils.py 92.72% <0%> (ø) ⬆️
satpy/tests/enhancement_tests/test_abi.py 100% <0%> (ø)
satpy/enhancements/abi.py 100% <0%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d90e7f5...365d761. Read the comment docs.

zarr_out[idx_name] = (coord, fid[idx_name])

# Write indices to Zarr file
zarr_out.to_zarr(fname_zarr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the npz cache files be deleted after the conversion ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, but was hesitant to do that (automatically) if the user happens to have another streams still using them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fine by me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "stream"? If there is an active process using the file then deleting it shouldn't be a problem (it should still be held in memory). I suppose we are assuming that two different versions of satpy may be using the same resampling cache? I guess it would be a problem then.

I'm fine with leaving them there, but I don't think it is necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stream" = "processing chain". Someone might have another process running an older Satpy for whatever reason, e.g. for comparison reasons before upgrading an operational chain to the latest versions.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mraspaud
Copy link
Member

mraspaud commented Sep 3, 2019

@djhoese do you want to review this one ?

@djhoese
Copy link
Member

djhoese commented Sep 4, 2019

Short answer: I'm ok merging this as is.

Long answer: @pnuu Could you re-summarize the performance you're seeing? Is it as expected (1 overall compute for to_zarr, future resamples are much faster)? If stuff doesn't make sense could you provide updated information on timings and when computations are occurring? At least for the nearest neighbor resampling since it is the default (and I'm not sure I understand everything you're doing in bilinear).

If you make your chunk size really large how many computations occur? You mentioned ~100 compute calls, but that is because it includes each chunk as a separate compute, right? I was thinking one chunk might make things easier to debug. Is the computation only during to_zarr? Is it possible for the user to force persist for the cached arrays?

Random thought: Is it possible to make to_zarr a task in the dask chain so it isn't saved to disk until it is needed/computed?

@pnuu
Copy link
Member Author

pnuu commented Sep 4, 2019

Short summary:

  • one to_zarr call (61.9 s) vs. several calls (60.6 s) didn't matter much with nearest
  • chunk size doesn't affect the number of compute() calls made. Not sure, but this might be due to dask forcing single-threaded scheduler when wrapping things in custom scheduler. This is seen when dask profilers are used; only one thread in use.
  • for bilinear the usage of zarr makes it possible to use much larger areas without running out of memory
  • for bilinear the resampling indices need to be computed; slicing with dask arrays produces huge result arrays with shapes of (target_area.size, target_area.size). Also the data needs to be computed (via accessing .values) as dask doesn't support fancy indexing and vindex is just slow
  • for bilinear this PR doesn't include caching the resampler, but that doesn't matter as the same resampler (and the indices) will be used for all the channels with the same resolution and no re-computations are needed

@mraspaud
Copy link
Member

mraspaud commented Sep 4, 2019

Ok, so the speed increase isn't the big deal here, it's the memory usage, right ?

@pnuu
Copy link
Member Author

pnuu commented Sep 4, 2019

Indeed. And zarr should be more cluster-friendly. Not sure if the resamplers themselves are cluster-friendly, but anyway 😁

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I suppose the main thing is that this isn't any slower than before.

@pnuu pnuu merged commit ff5ed76 into pytroll:master Sep 6, 2019
@pnuu pnuu deleted the feature-zarr-resample-luts branch September 6, 2019 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants