Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow looping & indexing: need direct access to underlying zarr store #9973

Open
CarlAndersson opened this issue Jan 22, 2025 · 1 comment
Open

Comments

@CarlAndersson
Copy link
Contributor

Is your feature request related to a problem?

My data has grown large enough that it no longer fits in memory, so I turned to storing it with zarr with the builtin functions in xarray.
This works great, until I want to loop over small parts of that data as part of my calculations. Xarray is used a lot in other places in the same codebase, but for low-level algorithms I find it easier to extract the numpy data.
My initial thought was that .data should return this zarr array, but it loads the entire data as a numpy array and returns that - which of course doesn't work with a large dataset.

Describe the solution you'd like

The .data attribute should return the zarr stored array instead of the loaded numpy array.

Describe alternatives you've considered

Using xarray's indexing methods is somewhat slow for this kind of looping, and offers me no advantage since I anyhow only need to get the numpy data for further processing.

I tried installing dask, which gives me a dask array instead of a numpy array from .data. However, looping through this dask array is very slow.

Additional context

I did some small scale testing to time the different indexing methods:

  • Looping through the dataset opened directly with zarr: about 300 ms.
  • Looping through the dataset as an xarray variable: about 3 seconds.
  • Looping through the dataset as a dask array: about 1.5 minutes!
@dcherian
Copy link
Contributor

Can you write out an example the looping you're trying to achieve please? It might help us find a hotspot to fix.

In general, it's quite hard to extract the Zarr Store from the variable (it's hidden under potentially many layers of wrapper classes). You might consider tracking the path to the store, and the path to the array in .attrs and using that to reopen the Zarr array where you need it.

@dcherian dcherian changed the title Direct access to underlying zarr store Slow looping & indexing: need direct access to underlying zarr store Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants