You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My data has grown large enough that it no longer fits in memory, so I turned to storing it with zarr with the builtin functions in xarray.
This works great, until I want to loop over small parts of that data as part of my calculations. Xarray is used a lot in other places in the same codebase, but for low-level algorithms I find it easier to extract the numpy data.
My initial thought was that .data should return this zarr array, but it loads the entire data as a numpy array and returns that - which of course doesn't work with a large dataset.
Describe the solution you'd like
The .data attribute should return the zarr stored array instead of the loaded numpy array.
Describe alternatives you've considered
Using xarray's indexing methods is somewhat slow for this kind of looping, and offers me no advantage since I anyhow only need to get the numpy data for further processing.
I tried installing dask, which gives me a dask array instead of a numpy array from .data. However, looping through this dask array is very slow.
Additional context
I did some small scale testing to time the different indexing methods:
Looping through the dataset opened directly with zarr: about 300 ms.
Looping through the dataset as an xarray variable: about 3 seconds.
Looping through the dataset as a dask array: about 1.5 minutes!
The text was updated successfully, but these errors were encountered:
Can you write out an example the looping you're trying to achieve please? It might help us find a hotspot to fix.
In general, it's quite hard to extract the Zarr Store from the variable (it's hidden under potentially many layers of wrapper classes). You might consider tracking the path to the store, and the path to the array in .attrs and using that to reopen the Zarr array where you need it.
dcherian
changed the title
Direct access to underlying zarr store
Slow looping & indexing: need direct access to underlying zarr store
Jan 22, 2025
Is your feature request related to a problem?
My data has grown large enough that it no longer fits in memory, so I turned to storing it with zarr with the builtin functions in xarray.
This works great, until I want to loop over small parts of that data as part of my calculations. Xarray is used a lot in other places in the same codebase, but for low-level algorithms I find it easier to extract the numpy data.
My initial thought was that
.data
should return this zarr array, but it loads the entire data as a numpy array and returns that - which of course doesn't work with a large dataset.Describe the solution you'd like
The
.data
attribute should return the zarr stored array instead of the loaded numpy array.Describe alternatives you've considered
Using xarray's indexing methods is somewhat slow for this kind of looping, and offers me no advantage since I anyhow only need to get the numpy data for further processing.
I tried installing dask, which gives me a dask array instead of a numpy array from
.data
. However, looping through this dask array is very slow.Additional context
I did some small scale testing to time the different indexing methods:
The text was updated successfully, but these errors were encountered: