Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Found chunk size mismatch #218

Closed
agrigoriev opened this issue Sep 9, 2022 · 16 comments
Closed

AssertionError: Found chunk size mismatch #218

agrigoriev opened this issue Sep 9, 2022 · 16 comments

Comments

@agrigoriev
Copy link

agrigoriev commented Sep 9, 2022

Hello guys!

I'm trying to build kerchunk index for ERA5 full dataset.

Everythg goes well when I'm merging daily data, but when I try to build combined index for months with different amount of days like January and February for example:

e5.oper.an.sfc.128_134_sp.ll025sc.2019010100_2019013123.nc
e5.oper.an.sfc.128_134_sp.ll025sc.2019020100_2019022823.nc

from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
    singles,
    remote_protocol="s3",
    remote_options={'anon': False},
    concat_dims=["time"],
    identical_dims = ['latitude', 'longitude'],
)

I get the AssertionError: Found chunk size mismatch.

Probably I'm doing something wrong?

@martindurant
Copy link
Member

@peterm790 , you looked at era5, did you use these specific files? This is probably explicitly in one of your tutorials already.

@agrigoriev , it would be useful to show the xarray text view of the two datasets opened with chunks={}, so we can see the native chunk size.

@peterm790
Copy link
Collaborator

Hi @agrigoriev I have a tutorial for the AWS ERA5-pds available here: /~https://github.com/peterm790/ERA5_Kerchunk_tutorial/blob/master/ERA5_tutorial.ipynb , this is considering hourly data in monthly files.

I assume your dataset is Monthly files of daily means? which would result in mismatched chunks. If that is the case I am not certain that there is a solution at present. Unless it were chunked as 1 chunk per time step.

@martindurant
Copy link
Member

It is certainly correct that kerchunk cannot change the fundamental chunk sizes in the underlying files, and also that zarr requires all the chunks along a given axis to be the same size (except the last). I am hoping to remove this requirement, but that need agreement within the zarr project itself.

@NikosAlexandris
Copy link

NikosAlexandris commented Oct 27, 2023

Using scripts (see draft work building a CLI tool) I found out that, indeed, some source NetCDF files from DOI:10.5676/EUM_SAF_CM/SARAH/V003, differ in their Block definition (as per GDAL) for example between year 2020 (2600 x 1) and years 2021, 2022 (2600 x 2600) :

❯ gdalinfo NETCDF:"SDUds202001010000004231000101MA.nc":SDU |grep Block  # First day
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined

❯ gdalinfo NETCDF:"SDUds202012310000004231000101MA.nc":SDU |grep Block  # Last day
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined

❯ gdalinfo NETCDF:"SDUds2021010100000042310001I1MA.nc":SDU |grep Block
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined

❯ gdalinfo NETCDF:"SDUds2022010100000042310001I1MA.nc":SDU |grep Block
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined

I see nothing, and wouldn't expect, different in their structure :

❯ gdalinfo NETCDF:"SDUds202001010000004231000101MA.nc":SDU -nomd -mm -hist
Driver: netCDF/Network Common Data Format
Files: SDUds202001010000004231000101MA.nc
Size is 2600, 2600
Origin = (-64.999998473533992,64.999998473533992)
Pixel Size = (0.049999998825795,-0.049999998825795)
Corner Coordinates:
Upper Left  ( -64.9999985,  64.9999985)
Lower Left  ( -64.9999985, -64.9999985)
Upper Right (  64.9999985,  64.9999985)
Lower Right (  64.9999985, -64.9999985)
Center      (   0.0000000,   0.0000000)
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
    Computed Min/Max=0.000,16826.000
0...10...20...30...40...50...60...70...80...90...100 - done.
  256 buckets from -31.2686 to 15978.3:
  467342 40261 52663 52011 46506 31090 27445 25339 23633 23290 22746 23262 24742 21057 19571 19080 18612 18956 18431 19037 19048 17618 17125 16782 17034 16769 16696 16770 16809 16232 15834 15799 15891 15655 15670 15314 14966 15214 15086 15190 14874 15212 14919 15050 14733 14800 15037 14869 15200 14833 15006 15151 15183 14797 15200 14978 15383 15272 15696 15825 15989 15718 15478 15979 15849 16251 16030 16396 16280 16676 16418 16662 16616 17024 17220 16949 17277 17163 17758 17499 17794 17684 18057 17866 18335 18069 18194 18512 18235 19162 18763 19239 19074 19205 18863 19366 19594 20183 19899 20923 20775 21282 21249 21207 21959 22143 22932 22462 23100 23505 24785 24325 24773 24628 25823 26099 26319 26651 27288 27981 27884 29528 29708 31114 31418 30804 31030 32053 32397 32498 34588 34812 34814 33896 35101 36339 37499 36549 37926 36866 38944 40005 40957 44034 42018 43687 44113 46579 49330 50137 50086 53450 54112 54046 59402 60320 60950 73065 78197 72032 66601 56165 53708 53309 57130 48266 49690 45168 37245 30702 26818 26691 25001 25085 24415 23307 22653 22691 23935 23741 22637 21416 20936 20156 19663 18889 18150 16446 16723 15264 14665 13640 13660 12832 13392 12443 10820 10301 9992 10353 10469 9712 9943 10564 8366 7821 8051 7962 8917 7671 8276 7946 8500 7299 7896 7307 3670 2651 2183 1818 1975 1310 701 531 477 457 186 167 132 127 91 73 53 47 47 36 25 24 26 22 22 23 17 21 22 19 29 17 17 22 14 13 12 12 6 87
  NoData Value=-999
  Unit Type: h
  Offset: 0,   Scale:0.00100000004749745

❯ gdalinfo NETCDF:"SDUds2022010100000042310001I1MA.nc":SDU -nomd -mm -hist
Driver: netCDF/Network Common Data Format
Files: SDUds2022010100000042310001I1MA.nc
Size is 2600, 2600
Origin = (-64.999998473533992,64.999998473533992)
Pixel Size = (0.049999998825795,-0.049999998825795)
Corner Coordinates:
Upper Left  ( -64.9999985,  64.9999985)
Lower Left  ( -64.9999985, -64.9999985)
Upper Right (  64.9999985,  64.9999985)
Lower Right (  64.9999985, -64.9999985)
Center      (   0.0000000,   0.0000000)
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined
  Min=0.000 Max=15505.000   Computed Min/Max=0.000,15505.000
  Minimum=0.000, Maximum=15505.000, Mean=6665.472, StdDev=3824.519
  256 buckets from -30.402 to 15535.4:
  365805 32375 43754 45164 42195 28573 24135 22574 22238 21668 21330 21519 22036 19163 18429 17775 17603 17744 16924 17314 18372 16608 16132 15651 15718 15743 15933 16066 15822 16018 15788 15365 15355 15815 15929 16048 16074 15990 15818 16080 15929 16530 16024 16410 16702 16793 16532 16513 16387 16808 16640 16593 16711 16792 16490 16786 16713 16474 16764 16464 17015 17352 17459 17262 17124 17443 16848 17002 17323 17267 17453 17535 17202 17159 17086 17746 17811 17848 17900 17868 17964 18059 17997 18280 18319 18990 18958 19047 18927 18870 19053 19376 19300 19806 19350 19957 20073 20264 20233 20158 20845 21035 21287 21409 21570 21172 21817 22212 22583 22778 22399 23381 23449 23921 24259 24480 25141 25204 25929 26322 26352 27833 27606 27874 28121 28486 29185 30227 29895 30842 30670 31508 31794 31616 32215 32724 32061 33218 34789 36574 36879 38181 37302 37101 38380 38667 41836 44951 45601 44806 44356 45319 47977 48065 50990 52268 51439 54150 57542 61182 67320 74175 70339 76261 65646 51593 54918 50178 50036 47310 44524 39475 32437 27083 28290 28231 26635 26760 27043 27792 26086 26277 26241 26888 26928 24498 22093 22386 21467 21560 21561 21469 20890 18655 17181 17099 16900 16476 15936 15243 13440 11387 10791 11907 12058 11483 10037 9511 9708 9506 9312 9565 8932 6838 5345 4980 4635 3914 4277 3674 3298 2110 1116 939 837 955 1106 1785 1267 948 665 107 82 57 53 52 52 48 21 19 23 17 21 13 14 15 11 3 0 4 4 2 2 0 0 1
  NoData Value=-999
  Unit Type: h
  Offset: 0,   Scale:0.00100000004749745

My understanding is that kerchunk cannot handle this, i.e. cannot edit this attribute. What can/should be done in this case? Edit the NetCDF attributes before kerchunking ? Also curious on why this difference, but I guess I should ask the producer.

@martindurant
Copy link
Member

@NikosAlexandris , if you would like the challenge, you could try to follow the example in #374 to create a "var-chunk" zarr array, which is based on my PR zarr-developers/zarr-python#1483 and allows working around the current limitations of zarr.

@NikosAlexandris
Copy link

Also curious on why this difference, but I guess I should ask the producer.

Some communication from the producer :

From: CMSAF Contact <Contact.Cmsaf@dwd.de>
Sent: Tuesday, October 31, 2023 9:27 AM
To: ALEXANDRIS Nikos
Subject: AW: Different band block sizes in SDU, SARAH3 between years

Dear Nikos Alexandris,

Thank you for your valuable hints.

Our way forward would likely be, to first check the status of the chunk-sizes throughout the data records and then to inform the users.

If you need any assistance to work with the data, please let us know.

Best regards,

The CM SAF Operation Team

and

From: CMSAF Contact <Contact.Cmsaf@dwd.de>
Sent: Monday, October 30, 2023 2:08 PM
..

Dear Nikos Alexandris,

Indeed the chunk-sizes of the NetCDF files between the CDR (until 2020) and the ICDR (2021 onwards) are different. Thank you for pointing us towards this. The difference is not intentionally. With the tools and tests we undertook to the data, this has not shown up. The probable reason for the difference is that the slightly different CDR and ICDR processing environments must have used different default values for the chunk-size at some point.

We apologize for any inconvenience and we noted this to be checked additionally in future releases. We hope that working with the data is still feasible and that the data useful for your applications.

..

For the records :

Von: ALEXANDRIS Nikos <Nikos.Alexandris@ec.europa.eu>
Gesendet: Freitag, 27. Oktober 2023 14:50
An: CMSAF Contact <Contact.Cmsaf@dwd.de>
Betreff: Different band block sizes in SDU, SARAH3 between years

Dear DWD

I found out that, indeed, some source NetCDF files from DOI:10.5676/EUM_SAF_CM/SARAH/V003, differ in their Block definition (as per GDAL) for example between year 2020 (2600 x 1) and years 2021, 2022 (2600 x 2600) :

❯ gdalinfo NETCDF:"SDUds202001010000004231000101MA.nc":SDU |grep Block  # First day
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
❯ gdalinfo NETCDF:"SDUds202012310000004231000101MA.nc":SDU |grep Block  # Last day
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
❯ gdalinfo NETCDF:"SDUds2021010100000042310001I1MA.nc":SDU |grep Block
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined
❯ gdalinfo NETCDF:"SDUds2022010100000042310001I1MA.nc":SDU |grep Block
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined

I see nothing, and wouldn't expect, different in their structure :

❯ gdalinfo NETCDF:"SDUds202001010000004231000101MA.nc":SDU -nomd -mm -hist
Driver: netCDF/Network Common Data Format
Files: SDUds202001010000004231000101MA.nc
Size is 2600, 2600
Origin = (-64.999998473533992,64.999998473533992)
Pixel Size = (0.049999998825795,-0.049999998825795)
Corner Coordinates:
Upper Left  ( -64.9999985,  64.9999985)
Lower Left  ( -64.9999985, -64.9999985)
Upper Right (  64.9999985,  64.9999985)
Lower Right (  64.9999985, -64.9999985)
Center      (   0.0000000,   0.0000000)
Band 1 Block=2600x1 Type=Int16, ColorInterp=Undefined
   Computed Min/Max=0.000,16826.000
0...10...20...30...40...50...60...70...80...90...100 - done.
 256 buckets from -31.2686 to 15978.3:
 467342 40261 52663 .. 12 6 87
 NoData Value=-999
 Unit Type: h
 Offset: 0,   Scale:0.00100000004749745

``` bash
❯ gdalinfo NETCDF:"SDUds2022010100000042310001I1MA.nc":SDU -nomd -mm -hist
Driver: netCDF/Network Common Data Format
Files: SDUds2022010100000042310001I1MA.nc
Size is 2600, 2600
Origin = (-64.999998473533992,64.999998473533992)
Pixel Size = (0.049999998825795,-0.049999998825795)
Corner Coordinates:
Upper Left  ( -64.9999985,  64.9999985)
Lower Left  ( -64.9999985, -64.9999985)
Upper Right (  64.9999985,  64.9999985)
Lower Right (  64.9999985, -64.9999985)
Center      (   0.0000000,   0.0000000)
Band 1 Block=2600x2600 Type=Int16, ColorInterp=Undefined
 Min=0.000 Max=15505.000   Computed Min/Max=0.000,15505.000
 Minimum=0.000, Maximum=15505.000, Mean=6665.472, StdDev=3824.519
 256 buckets from -30.402 to 15535.4:
 365805 32375 43754 .. 0 0 1
 NoData Value=-999
 Unit Type: h
 Offset: 0,   Scale:0.00100000004749745

Curious on why this difference. This poses problems in tools that deal with scanning the metadata and rely on the defined “chunk sizes” to build then a continuous time series.
..

and

Von: ALEXANDRIS Nikos
Gesendet: Montag, 30. Oktober 2023 16:57
..

Please have a look at other products too. There is some [1, 2600, 2600] layout (which seems to be a reasonable default) alternating with [1, 1300, 1300] in SIS/SID products.
The data need some re-chunking which is some costly process, as you well know.

Question : may I share your reply publicly, i.e. in relevant discussions for software that deal with this kind of data?
..

@martindurant
Copy link
Member

The difference is not intentional

I imagine this happens a lot in practice

@rsignell
Copy link

rsignell commented Nov 19, 2024

We rechunked ERA5 using rechunker into 18 zarr datasets. Each dataset has 360244 hours of data and 20x20 in lon/lat, except the last dataset, which is shorter. So every chunk is uniform except the last one in the time dimension, which should be fine by Zarr v2, right? Yet we still get this error when using MultiZarrToZarr to combine them. This isn't expected behavior, is it?

_ = MultiZarrToZarr(
        ref_list,
        remote_protocol="abfs",
        remote_options=opts,
        concat_dims=["time"],
        coo_map={"time": "cf:time"},
        identical_dims=identical_dims,
        out=out).translate()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:8

File /anaconda/envs/pangeo-local/lib/python3.12/site-packages/kerchunk/combine.py:604, in MultiZarrToZarr.translate(self, filename, storage_options)
    602     self.store_coords()
    603 if 3 not in self.done:
--> 604     self.second_pass()
    605 if 4 not in self.done:
    606     if self.postprocess is not None:

File /anaconda/envs/pangeo-local/lib/python3.12/site-packages/kerchunk/combine.py:503, in MultiZarrToZarr.second_pass(self)
    501     chunk_sizes[v] = zarray["chunks"]
    502 elif chunk_sizes[v] != zarray["chunks"]:
--> 503     raise ValueError(
    504         f"""Found chunk size mismatch:
    505         at prefix {v} in iteration {i} (file {self._paths[i]})
    506         new chunk: {chunk_sizes[v]}
    507         chunks so far: {zarray["chunks"]}"""
    508     )
    509 chunks = chunk_sizes[v]
    510 zattrs = ujson.loads(m.get(f"{v}/.zattrs", "{}"))

ValueError: Found chunk size mismatch:
                        at prefix 10m_u_component_of_wind in iteration 18 (file None)
                        new chunk: [34560, 20, 20]
                        chunks so far: [17832, 20, 20]

@martindurant
Copy link
Member

If the last chunk of each input along a concat dimension is truncated, then you have a problem - these would become partial chunks not-at-the-end in the combined dataset. That was why we needed variable chunks. I think vzarr does cope with this situation, though ( @TomNicholas ).

Were you saying that only the last chunk of the last dataset is shorter? Then this ought to be workable, but the logic to get it right may need to be written.

@rsignell
Copy link

rsignell commented Nov 19, 2024

Yes, only the last chunk (the 18th) is shorter! The other 17 are identical. I would think this would be a very common case -- it's surprising it hasn't been hit before!

@TomNicholas
Copy link

I think vzarr does cope with this situation, though

No it doesn't, because that would be outside the (current) zarr model. zarr-developers/VirtualiZarr#12 tracks this.

VirtualiZarr will also currently error if you try to concatenate e.g. 3 arrays with chunks (2,), (2,), (1,), even though that would be allowed in zarr's current model. This could be relaxed to accommodate @rsignell 's case though. See the check here

@martindurant
Copy link
Member

OK, I'll look to make this work within MZZ. I think it's probably only the check that's incorrect.

@NikosAlexandris
Copy link

We rechunked ERA5 using rechunker into 18 zarr datasets. Each dataset has 360_24_4 hours of data and 20x20 in lon/lat, except the last dataset, which is shorter. So every chunk is uniform except the last one in the time dimension, which should be fine by Zarr v2, right? Yet we still get this error when using MultiZarrToZarr to combine them. This isn't expected behavior, is it?

_ = MultiZarrToZarr(
        ref_list,
        remote_protocol="abfs",
        remote_options=opts,
        concat_dims=["time"],
        coo_map={"time": "cf:time"},
        identical_dims=identical_dims,
        out=out).translate()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:8

File /anaconda/envs/pangeo-local/lib/python3.12/site-packages/kerchunk/combine.py:604, in MultiZarrToZarr.translate(self, filename, storage_options)
    602     self.store_coords()
    603 if 3 not in self.done:
--> 604     self.second_pass()
    605 if 4 not in self.done:
    606     if self.postprocess is not None:

File /anaconda/envs/pangeo-local/lib/python3.12/site-packages/kerchunk/combine.py:503, in MultiZarrToZarr.second_pass(self)
    501     chunk_sizes[v] = zarray["chunks"]
    502 elif chunk_sizes[v] != zarray["chunks"]:
--> 503     raise ValueError(
    504         f"""Found chunk size mismatch:
    505         at prefix {v} in iteration {i} (file {self._paths[i]})
    506         new chunk: {chunk_sizes[v]}
    507         chunks so far: {zarray["chunks"]}"""
    508     )
    509 chunks = chunk_sizes[v]
    510 zattrs = ujson.loads(m.get(f"{v}/.zattrs", "{}"))

ValueError: Found chunk size mismatch:
                        at prefix 10m_u_component_of_wind in iteration 18 (file None)
                        new chunk: [34560, 20, 20]
                        chunks so far: [17832, 20, 20]

Hello @rsignell : I am restarting working on data re-chunking these days, including global ERA5 datasets (temperature and windspeed). I have left an effort to identify the "optimal" chunking shape for my use-case (on-the-fly reading and processing location-specific time series) in the middle due to other priorities. Sooner or later will come back to this or other similar threads.

One observation : only nccopy really worked for me to rechunk data in NetCDF files.
One question : why 20 x 20 in space ? What is the use-case behind this shape ?

@rsignell
Copy link

We picked that chunk size and shape to facilitate climate model bias correction workflows that require having the whole time series in memory. The whole time series can be loaded with 18 of those chunks, which only requires about a gig of RAM.

@NikosAlexandris
Copy link

We picked that chunk size and shape to facilitate climate model bias correction workflows that require having the whole time series in memory. The whole time series can be loaded with 18 of those chunks, which only requires about a gig of RAM.

@rsignell How long does it take to load the whole time series in memory ?

@rsignell
Copy link

On my 2-core laptop on my home network, it takes 6s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants