Replies: 9 comments 32 replies
-
@martindurant, thinking about construction API: Can we start with an array of regular chunks, and then append new chunks of different sizes? What's the API here? Currently Potential usage: a = zarr.ones(5, chunks=3)
a.chunks
# (3,) Current behavior: a.append(zarr.ones(3, chunks=2))
a.chunks
# (3,) Instead, we could have: a.append(zarr.ones(3, chunks=2), variable_chunks=True)
a.chunks
# ((3, 2, 2, 1),) ^ This gets a little weird due to the misalignment of regular chunk boundaries in each input array. We could also say this is out of scope for the ZEP, which will only address creation. |
Beta Was this translation helpful? Give feedback.
-
Thanks for providing this ZEP. I think support for variable-length chunks would make Zarr more useful for many new applications. I wonder about the scalability of the proposed design. If you have a large number of chunks (in the variable-sized dimension), they would all need to be listed in the metadata, correct? Upon creation the metadata would need to be synchronized, which could limit parallel writes. |
Beta Was this translation helpful? Give feedback.
-
Another use case from @normanrz: ome/ngff#33 (comment)
Looking at the documentation of the format, it overlaps quite a bit with the to-be-written oct tree use case (cc: @kevinyamauchi) |
Beta Was this translation helpful? Give feedback.
-
POC: zarr-developers/zarr-python#1483 (it works, ish) |
Beta Was this translation helpful? Give feedback.
-
Another use case for variable-length chunks, in Cubed. Cubed uses Zarr arrays to store intermediate results between steps of serverless distributed computations. Any computation step that changes the chunks to be irregular (e.g. certain groupby operations) is currently forbidden by Zarr not supporting irregular chunks. (xref cubed-dev/cubed#312) |
Beta Was this translation helpful? Give feedback.
-
I think variable-length chunks would allow incrementally rechunking an array in-place, which could be very convenient |
Beta Was this translation helpful? Give feedback.
-
When implementing domain decomposition for MPI applications, it's common to utilize 'optimized' variable chunk sizes. For instance, in the Earth System Model, when applying domain decomposition to the x-dimension (let's assume we have 100), with 7 MPI workers concurrently handling tasks, the standard MPI domain decomposition might be structured as 15, 15, 14, 14, 14, 14, 14. This allocation ensures that the worker with the most workload (15) and the one with the least (14) differ by only 1 task. This differs from a zarr chunk distribution, like 15, 15, 15, 15, 15, 15, 10, where the hardest worker (15) and the least burdened worker (10) differ by 5 tasks. This optimized variable chunk sizing aims to minimize differences in execution times among MPI tasks as much as possible, considering that each MPI task necessitates periodic data exchanges. I'm unable to employ kerchunk for these types of MPI Earth System Model outputs since kerchunk stop using xarray(dask) chunks as a backend and use zarr chunks. |
Beta Was this translation helpful? Give feedback.
-
I was thinking through this ZEP the other day, and I had a thought about a possible change that I think could make things a lot simpler. Right now, the chunks are encoded explicitly in the Array metadata {
"type": "rectangular",
"chunk_shape": [[5, 5, 5, 15, 15, 20, 35], 10],
"separator":"/"
} The chunks here will still be named the same way in the chunk store, i.e. But what if we did this instead. {
"type": "rectangular",
"chunk_shape": ["*", 10],
"separator":"/"
} where the
So we would essentially be moving information about the chunk shapes from the metadata doc to the store. To discover the chunk shape, an implementation would have to list the relevant directory of the chunk store. Why do this? Here are some reasons:
Downsides:
Thoughts? |
Beta Was this translation helpful? Give feedback.
-
Another potential use case: Kerchunkifying/Virtualizing the Open-Meteo data format “OM”: fsspec/kerchunk#464 |
Beta Was this translation helpful? Give feedback.
-
Posting ZEP 0003 here for discussion (as part of the ZEP workflow) (@MSanKeys963)
Still very drafty, but could use some discussion. Up to date copy at: /~https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0003.md
ZEP 3 — Variable chunking
Authors:
Status: Draft
Type: Specification
Created: 2022-10-17
Abstract
To allow the chunks of a zarr array to be rectangular grid rather than a regular grid, with the chunk lengths along any dimension a list of integers rather than a single chunk size.
Motivation and Scope
Two specific use cases have motivated this, given below. However, this generalisation of Zarr's storage
model can be seen as an optional enhancement, and the same data model as currently used by dask.array.
Usage and Impact
Creation
Backward Compatibility
This change is fully backward compatible - all old data will remain usable. However, data written with variable chunks will not be readable by older versions of Zarr. It would be reasonable to wish to backport the feature to v2.
Related Work
Dask
dask.array
uses rectangular chunking internally, and is one of the major consumers of zarr data. Much of the code translating logical slices into slices on the individual chunks should be reusable.Parquet/ Arrow
Arrow describes tables as a collection of record batches. There is no restriction on the size of these batches. This is not only very flexible, but can be used as an indexing strategy for low cardinality columns within parquet.
This feature was cited as one of the reasons parquet was chose over zarr for dask dataframes: dask/dask#1599
awkward array
zarr-developers/zarr-specs#62
Implementation
It is to be hoped that much code can be adapted from dask.array, which already allows variable chunk sizes on each dimension.
Alternatives
Just tune chunk sizes
zarr-developers/zarr-specs#62 (comment)
Discussion
References and Footnotes
Copyright
This document has been placed in the public domain.
Beta Was this translation helpful? Give feedback.
All reactions