Variable-length chunks #386

TomNicholas · 2024-11-12T19:56:16Z

I would like to be able to use zarr + virtualizarr + icechunk with variable-length chunks - see
zarr-developers/zarr-specs#138.

I'm thinking about what changes in the stack would be required to get this to work - definitely in zarr-python, but also presumably icechunk needs to be able to return chunks of variable size, and the icechunk spec has to accommodate that generality? Then if it's in the icechunk spec does it also need to be in the zarr spec too?

cc @abarciauskas-bgse @sharkinsspatial

rabernat · 2024-11-12T20:10:09Z

I agree that this is a super important topic that we should figure out how to solve. It has been discussed for a long time, and there are various prototypes and ideas out there, but it hasn't really advanced.

I'd encourage us to start from first principles and actually design variable chunking from the ground up, starting not from the various existing specs and tools but from a set of requirements and first principles. Then we can thank about the right way to implement it.

For example, the existing spec conversations about this stalled on the question of scale. Is it feasible to store the chunk sizes in the metadata? That depends...how many chunks are we expected to store? It's fine for 100 chunks. Probably not for 100_000_000. Does the solution need to scale to accommodate arbitrarily large arrays? What tradeoffs are we willing to accept? E.g. can we accept increased latency in exchange for variable length chunks? What about writing? What's the process for updating existing variable-length-chunked datasets? There are many, many more questions we could ask. (@paraseba is very good at enumerating these types of design questions.)

In summary, what I think is needed to move forward is a set of meticulously documented use cases and technical requirements.

TomNicholas · 2024-11-14T16:32:39Z

Does the solution need to scale to accommodate arbitrarily large arrays?

Yeah this is the key question, from which everything else should follow.

In summary, what I think is needed to move forward is a set of meticulously documented use cases and technical requirements.

I'll start: There's one here that's representative of the output of many HPC fluid simulation codes: zarr-developers/VirtualiZarr#217

dcherian · 2024-11-14T16:39:00Z

Another extremely common one is virtual references to netCDF files, which have daily frequency data and a single month of data in a single netCDF file.

abarciauskas-bgse · 2024-11-14T20:26:36Z

@dcherian I've heard this example before but not sure what datasets are structured this way - can you provide an example?

rabernat · 2024-11-14T20:34:29Z

can you provide an example?

Most of these files fit the bill: https://nsf-ncar-era5.s3.amazonaws.com/index.html#e5.oper.an.sfc/

The files are partitioned by month and chunked in such a way that the time-concatenated chunks would be uneven.

TomNicholas added enhancement ✨ New feature or request spec 🗒️ labels Nov 12, 2024

TomNicholas mentioned this issue Nov 19, 2024

Support variable-length chunks zarr-developers/VirtualiZarr#12

Open

norlandrhagen mentioned this issue Dec 5, 2024

xarray.concat fails due to inconsistent leap year shapes zarr-developers/VirtualiZarr#330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable-length chunks #386

Variable-length chunks #386

TomNicholas commented Nov 12, 2024 •

edited

Loading

rabernat commented Nov 12, 2024

TomNicholas commented Nov 14, 2024

dcherian commented Nov 14, 2024

abarciauskas-bgse commented Nov 14, 2024

rabernat commented Nov 14, 2024

Variable-length chunks #386

Variable-length chunks #386

Comments

TomNicholas commented Nov 12, 2024 • edited Loading

rabernat commented Nov 12, 2024

TomNicholas commented Nov 14, 2024

dcherian commented Nov 14, 2024

abarciauskas-bgse commented Nov 14, 2024

rabernat commented Nov 14, 2024

TomNicholas commented Nov 12, 2024 •

edited

Loading