-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: Found chunk size mismatch #218
Comments
@peterm790 , you looked at era5, did you use these specific files? This is probably explicitly in one of your tutorials already. @agrigoriev , it would be useful to show the xarray text view of the two datasets opened with chunks={}, so we can see the native chunk size. |
Hi @agrigoriev I have a tutorial for the AWS ERA5-pds available here: /~https://github.com/peterm790/ERA5_Kerchunk_tutorial/blob/master/ERA5_tutorial.ipynb , this is considering hourly data in monthly files. I assume your dataset is Monthly files of daily means? which would result in mismatched chunks. If that is the case I am not certain that there is a solution at present. Unless it were chunked as 1 chunk per time step. |
It is certainly correct that kerchunk cannot change the fundamental chunk sizes in the underlying files, and also that zarr requires all the chunks along a given axis to be the same size (except the last). I am hoping to remove this requirement, but that need agreement within the zarr project itself. |
Using scripts (see draft work building a CLI tool) I found out that, indeed, some source NetCDF files from DOI:10.5676/EUM_SAF_CM/SARAH/V003, differ in their Block definition (as per GDAL) for example between year 2020 ( ❯ gdalinfo NETCDF:"SDUds202001010000004231000101MA.nc":SDU |grep Block # First day
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
❯ gdalinfo NETCDF:"SDUds202012310000004231000101MA.nc":SDU |grep Block # Last day
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
❯ gdalinfo NETCDF:"SDUds2021010100000042310001I1MA.nc":SDU |grep Block
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined
❯ gdalinfo NETCDF:"SDUds2022010100000042310001I1MA.nc":SDU |grep Block
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined I see nothing, and wouldn't expect, different in their structure : ❯ gdalinfo NETCDF:"SDUds202001010000004231000101MA.nc":SDU -nomd -mm -hist
Driver: netCDF/Network Common Data Format
Files: SDUds202001010000004231000101MA.nc
Size is 2600, 2600
Origin = (-64.999998473533992,64.999998473533992)
Pixel Size = (0.049999998825795,-0.049999998825795)
Corner Coordinates:
Upper Left ( -64.9999985, 64.9999985)
Lower Left ( -64.9999985, -64.9999985)
Upper Right ( 64.9999985, 64.9999985)
Lower Right ( 64.9999985, -64.9999985)
Center ( 0.0000000, 0.0000000)
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
Computed Min/Max=0.000,16826.000
0...10...20...30...40...50...60...70...80...90...100 - done.
256 buckets from -31.2686 to 15978.3:
467342 40261 52663 52011 46506 31090 27445 25339 23633 23290 22746 23262 24742 21057 19571 19080 18612 18956 18431 19037 19048 17618 17125 16782 17034 16769 16696 16770 16809 16232 15834 15799 15891 15655 15670 15314 14966 15214 15086 15190 14874 15212 14919 15050 14733 14800 15037 14869 15200 14833 15006 15151 15183 14797 15200 14978 15383 15272 15696 15825 15989 15718 15478 15979 15849 16251 16030 16396 16280 16676 16418 16662 16616 17024 17220 16949 17277 17163 17758 17499 17794 17684 18057 17866 18335 18069 18194 18512 18235 19162 18763 19239 19074 19205 18863 19366 19594 20183 19899 20923 20775 21282 21249 21207 21959 22143 22932 22462 23100 23505 24785 24325 24773 24628 25823 26099 26319 26651 27288 27981 27884 29528 29708 31114 31418 30804 31030 32053 32397 32498 34588 34812 34814 33896 35101 36339 37499 36549 37926 36866 38944 40005 40957 44034 42018 43687 44113 46579 49330 50137 50086 53450 54112 54046 59402 60320 60950 73065 78197 72032 66601 56165 53708 53309 57130 48266 49690 45168 37245 30702 26818 26691 25001 25085 24415 23307 22653 22691 23935 23741 22637 21416 20936 20156 19663 18889 18150 16446 16723 15264 14665 13640 13660 12832 13392 12443 10820 10301 9992 10353 10469 9712 9943 10564 8366 7821 8051 7962 8917 7671 8276 7946 8500 7299 7896 7307 3670 2651 2183 1818 1975 1310 701 531 477 457 186 167 132 127 91 73 53 47 47 36 25 24 26 22 22 23 17 21 22 19 29 17 17 22 14 13 12 12 6 87
NoData Value=-999
Unit Type: h
Offset: 0, Scale:0.00100000004749745
❯ gdalinfo NETCDF:"SDUds2022010100000042310001I1MA.nc":SDU -nomd -mm -hist
Driver: netCDF/Network Common Data Format
Files: SDUds2022010100000042310001I1MA.nc
Size is 2600, 2600
Origin = (-64.999998473533992,64.999998473533992)
Pixel Size = (0.049999998825795,-0.049999998825795)
Corner Coordinates:
Upper Left ( -64.9999985, 64.9999985)
Lower Left ( -64.9999985, -64.9999985)
Upper Right ( 64.9999985, 64.9999985)
Lower Right ( 64.9999985, -64.9999985)
Center ( 0.0000000, 0.0000000)
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined
Min=0.000 Max=15505.000 Computed Min/Max=0.000,15505.000
Minimum=0.000, Maximum=15505.000, Mean=6665.472, StdDev=3824.519
256 buckets from -30.402 to 15535.4:
365805 32375 43754 45164 42195 28573 24135 22574 22238 21668 21330 21519 22036 19163 18429 17775 17603 17744 16924 17314 18372 16608 16132 15651 15718 15743 15933 16066 15822 16018 15788 15365 15355 15815 15929 16048 16074 15990 15818 16080 15929 16530 16024 16410 16702 16793 16532 16513 16387 16808 16640 16593 16711 16792 16490 16786 16713 16474 16764 16464 17015 17352 17459 17262 17124 17443 16848 17002 17323 17267 17453 17535 17202 17159 17086 17746 17811 17848 17900 17868 17964 18059 17997 18280 18319 18990 18958 19047 18927 18870 19053 19376 19300 19806 19350 19957 20073 20264 20233 20158 20845 21035 21287 21409 21570 21172 21817 22212 22583 22778 22399 23381 23449 23921 24259 24480 25141 25204 25929 26322 26352 27833 27606 27874 28121 28486 29185 30227 29895 30842 30670 31508 31794 31616 32215 32724 32061 33218 34789 36574 36879 38181 37302 37101 38380 38667 41836 44951 45601 44806 44356 45319 47977 48065 50990 52268 51439 54150 57542 61182 67320 74175 70339 76261 65646 51593 54918 50178 50036 47310 44524 39475 32437 27083 28290 28231 26635 26760 27043 27792 26086 26277 26241 26888 26928 24498 22093 22386 21467 21560 21561 21469 20890 18655 17181 17099 16900 16476 15936 15243 13440 11387 10791 11907 12058 11483 10037 9511 9708 9506 9312 9565 8932 6838 5345 4980 4635 3914 4277 3674 3298 2110 1116 939 837 955 1106 1785 1267 948 665 107 82 57 53 52 52 48 21 19 23 17 21 13 14 15 11 3 0 4 4 2 2 0 0 1
NoData Value=-999
Unit Type: h
Offset: 0, Scale:0.00100000004749745 My understanding is that kerchunk cannot handle this, i.e. cannot edit this attribute. What can/should be done in this case? Edit the NetCDF attributes before kerchunking ? Also curious on why this difference, but I guess I should ask the producer. |
@NikosAlexandris , if you would like the challenge, you could try to follow the example in #374 to create a "var-chunk" zarr array, which is based on my PR zarr-developers/zarr-python#1483 and allows working around the current limitations of zarr. |
Some communication from the producer :
and
.. For the records :
and
|
I imagine this happens a lot in practice |
We rechunked ERA5 using rechunker into 18 zarr datasets. Each dataset has 360244 hours of data and 20x20 in lon/lat, except the last dataset, which is shorter. So every chunk is uniform except the last one in the time dimension, which should be fine by Zarr v2, right? Yet we still get this error when using
|
If the last chunk of each input along a concat dimension is truncated, then you have a problem - these would become partial chunks not-at-the-end in the combined dataset. That was why we needed variable chunks. I think vzarr does cope with this situation, though ( @TomNicholas ). Were you saying that only the last chunk of the last dataset is shorter? Then this ought to be workable, but the logic to get it right may need to be written. |
Yes, only the last chunk (the 18th) is shorter! The other 17 are identical. I would think this would be a very common case -- it's surprising it hasn't been hit before! |
No it doesn't, because that would be outside the (current) zarr model. zarr-developers/VirtualiZarr#12 tracks this. VirtualiZarr will also currently error if you try to concatenate e.g. 3 arrays with chunks |
OK, I'll look to make this work within MZZ. I think it's probably only the check that's incorrect. |
Hello @rsignell : I am restarting working on data re-chunking these days, including global ERA5 datasets (temperature and windspeed). I have left an effort to identify the "optimal" chunking shape for my use-case (on-the-fly reading and processing location-specific time series) in the middle due to other priorities. Sooner or later will come back to this or other similar threads. One observation : only |
We picked that chunk size and shape to facilitate climate model bias correction workflows that require having the whole time series in memory. The whole time series can be loaded with 18 of those chunks, which only requires about a gig of RAM. |
@rsignell How long does it take to load the whole time series in memory ? |
On my 2-core laptop on my home network, it takes 6s. |
Hello guys!
I'm trying to build kerchunk index for ERA5 full dataset.
Everythg goes well when I'm merging daily data, but when I try to build combined index for months with different amount of days like January and February for example:
e5.oper.an.sfc.128_134_sp.ll025sc.2019010100_2019013123.nc
e5.oper.an.sfc.128_134_sp.ll025sc.2019020100_2019022823.nc
I get the AssertionError: Found chunk size mismatch.
Probably I'm doing something wrong?
The text was updated successfully, but these errors were encountered: