Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable-length chunks #386

Open
TomNicholas opened this issue Nov 12, 2024 · 5 comments
Open

Variable-length chunks #386

TomNicholas opened this issue Nov 12, 2024 · 5 comments
Labels
enhancement ✨ New feature or request spec 🗒️

Comments

@TomNicholas
Copy link
Contributor

TomNicholas commented Nov 12, 2024

I would like to be able to use zarr + virtualizarr + icechunk with variable-length chunks - see
zarr-developers/zarr-specs#138.

I'm thinking about what changes in the stack would be required to get this to work - definitely in zarr-python, but also presumably icechunk needs to be able to return chunks of variable size, and the icechunk spec has to accommodate that generality? Then if it's in the icechunk spec does it also need to be in the zarr spec too?

cc @abarciauskas-bgse @sharkinsspatial

@TomNicholas TomNicholas added enhancement ✨ New feature or request spec 🗒️ labels Nov 12, 2024
@rabernat
Copy link
Contributor

I agree that this is a super important topic that we should figure out how to solve. It has been discussed for a long time, and there are various prototypes and ideas out there, but it hasn't really advanced.

I'd encourage us to start from first principles and actually design variable chunking from the ground up, starting not from the various existing specs and tools but from a set of requirements and first principles. Then we can thank about the right way to implement it.

For example, the existing spec conversations about this stalled on the question of scale. Is it feasible to store the chunk sizes in the metadata? That depends...how many chunks are we expected to store? It's fine for 100 chunks. Probably not for 100_000_000. Does the solution need to scale to accommodate arbitrarily large arrays? What tradeoffs are we willing to accept? E.g. can we accept increased latency in exchange for variable length chunks? What about writing? What's the process for updating existing variable-length-chunked datasets? There are many, many more questions we could ask. (@paraseba is very good at enumerating these types of design questions.)

In summary, what I think is needed to move forward is a set of meticulously documented use cases and technical requirements.

@TomNicholas
Copy link
Contributor Author

Does the solution need to scale to accommodate arbitrarily large arrays?

Yeah this is the key question, from which everything else should follow.

In summary, what I think is needed to move forward is a set of meticulously documented use cases and technical requirements.

I'll start: There's one here that's representative of the output of many HPC fluid simulation codes: zarr-developers/VirtualiZarr#217

@dcherian
Copy link
Contributor

Another extremely common one is virtual references to netCDF files, which have daily frequency data and a single month of data in a single netCDF file.

@abarciauskas-bgse
Copy link
Contributor

@dcherian I've heard this example before but not sure what datasets are structured this way - can you provide an example?

@rabernat
Copy link
Contributor

can you provide an example?

Most of these files fit the bill: https://nsf-ncar-era5.s3.amazonaws.com/index.html#e5.oper.an.sfc/

The files are partitioned by month and chunked in such a way that the time-concatenated chunks would be uneven.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ✨ New feature or request spec 🗒️
Projects
None yet
Development

No branches or pull requests

4 participants