Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting xarray.apply_ufunc #67

Closed
TomNicholas opened this issue Jul 18, 2022 · 6 comments
Closed

Supporting xarray.apply_ufunc #67

TomNicholas opened this issue Jul 18, 2022 · 6 comments

Comments

@TomNicholas
Copy link
Member

A very large proportion of xarray's code goes through xarray.core.computation.apply_ufunc, which is also exposed publicly for advanced users to wrap their own ufuncs.

Internally if a dask array is present then we call dask.array.apply_gufunc - perhaps cubed should expose a similar function in the same way that it exposes map_blocks etc.?

xref pydata/xarray#6807

@TomNicholas
Copy link
Member Author

TomNicholas commented Jul 18, 2022

Another question I had is about the fact that xarray.apply_ufunc allows users to execute arbitrary code inside their custom ufunc (we have one in a group project that calls out to some fortran!). Does that break the assumptions behind the whole memory predictability thing?

@tomwhite
Copy link
Member

perhaps cubed should expose a similar function in the same way that it exposes map_blocks

Yes, I think it would be possible to write a apply_gufunc in cubed. It's not a part of the array API so it would live somewhere else in the namespace (probably the top-level like map_blocks).

Another question I had is about the fact that xarray.apply_ufunc allows users to execute arbitrary code inside their custom ufunc (we have one in a group project that calls out to some fortran!). Does that break the assumptions behind the whole memory predictability thing?

To some extent it does, but there are things we could do.

We could ask users to provide an (estimate) of the extra memory required, like map_direct already does (although that function is not publicly exposed at the moment).

I've created some notebooks that help judge if the memory estimates are working in practice, see /~https://github.com/tomwhite/cubed/blob/main/examples/lithops-add-random-local.ipynb for example. So there's some tooling like this we could provide to make it easier to provide and check memory estimates.

@TomNicholas
Copy link
Member Author

Sounds good.

I would be interested in trying to implement apply_gufunc in cubed so I can learn more about the internals, but I probably won't have any time to work on it for the next 2 weeks, so if you have added it already by then then no worries!

@tomwhite
Copy link
Member

I would be interested in trying to implement apply_gufunc in cubed so I can learn more about the internals, but I probably won't have any time to work on it for the next 2 weeks, so if you have added it already by then then no worries!

That would be great!

@tomwhite
Copy link
Member

A couple of thoughts on implementation.

It should be possible to follow Dask's implementation, at least for the case where a single output array is returned, which can use blockwise. That's probably the thing to get working first.

For the multiple output array case, we could use Zarr structured arrays, like we do already for the implementation of mean. Related: #69

This should live in the core package. It really belongs in core.ops, but that is getting very big, so we should probably break it into smaller modules. But don't worry about that too much - we can reorganise as needed once something is working.

@tomwhite
Copy link
Member

This has been implemented in #149 and #151, which should be enough to enable support in xarray, so I'm going to close this issue now. The multiple outputs case is being tracked by #152.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants