Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added xarray dataset accessor to record axis validation #136

Merged
merged 12 commits into from
Jan 16, 2023

Conversation

davidorme
Copy link
Collaborator

@davidorme davidorme commented Dec 19, 2022

Description

This PR extends xarray.DataArray to add a core_axes property that is used to record the axis validation applied to that data array. It uses the xarray Accessor framework.

The core_axis property is basically a dictionary of which AxisValidator was applied to each core axis (or None if no validator was applied. The property is also made callable as a simple shorthand boolean test of whether a named axis has been validated. For example, a model __init__ might confirm:

if not elevation.core_axis('spatial'):
    raise ValueError("Elevation data not validated on spatial axis")

I worry a little that making the dict callable is a bit too clever. The main reason is to keep a short API - it would otherwise have to be something like elevation.core_axis.has_axis('spatial'), which admittedly isn't exactly verbose.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@davidorme davidorme requested a review from dalonsoa December 19, 2022 14:32
@davidorme
Copy link
Collaborator Author

davidorme commented Dec 19, 2022

I have in the back of my mind that we might provide a require_axes function that might take a dict of variable names and the axes they should be validated against as a utility function for model startups:

require_axes({"elevation": ('spatial',), "temperature": ('spatial', 'temporal')})

or even a decorator?

@require_axes({"elevation": ('spatial',), "temperature": ('spatial', 'temporal')})

But I think worth seeing what the common use patterns are first!

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is, it is OK, but I'm not sure I see the need of it - even from the functionality point of view. What's wrong with:

value.attrs["core_axis"][axis_name] = ...

It is reusing an existing interface, one that will be known by xarray users, one adequate to store the information you want, it is simple... I feel you are over-engineering something that is pretty simple, in nature.

@davidorme
Copy link
Collaborator Author

That's true, but attrib is an interface that forms part of the NetCDF data structure. So, if we export objects from the Data instance using the xarray writer functions, the user will get data files with noisy attributes. This approach isolates these internal structures from the data structures. But that's quite a niche argument.

@dalonsoa
Copy link
Collaborator

I still don't see the need, but it is you call. I'd suggest you gather the opinion from other members of the team.

@davidorme
Copy link
Collaborator Author

davidorme commented Dec 20, 2022

I was just playing around to check if I really object to noisy attributes in output NetCDF files or if I just like playing with the Accessor framework 😄.

It turns out that you although you can create dict attributes in DataArray.attrs, you cannot then export that to NetCDF files. We'd need to explicitly scrub the core_axes key before writing data.

In [42]: ds = xr.DataArray([1,2,3,4], name='test')
In [43]: ds.attrs['core_axes'] = {'spatial': "Spat_CellId_Coord_Any", "temporal": "Temporal_Date", "soil_layers": None}
In [44]: ds.to_dataset().to_netcdf('Attrs_test.nc')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
TypeError: Invalid value for attr 'core_axes': {'spatial': 'Spat_CellId_Coord_Any', 'temporal': 'Temporal_Date', 'soil_layers': None}. For serialization to netCDF files, its value must be of one of the following types: str, Number, ndarray, number, list, tuple

Changing to having a Dataset as the central data store and being able to use its export methods is a real advantage, so having to wrap those to get rid of core_axes attributes seems like a hack. I'm sure there are ways we could work around it (two lists: core_axis_name and core_axis_validator, although we'd then need a placeholder for None, which is also not a valid attribute type) but that then introduces more complexity than this PR.

@dalonsoa
Copy link
Collaborator

Ok, that is a strong reason for the accessor you created. So all good 👍

@davidorme
Copy link
Collaborator Author

davidorme commented Dec 20, 2022

Damn. Updating the docs revealed that this is broken! The current tests only check that the output of validate_dataarray has the correct core_axes values but in use those values are then immediately stripped out as the DataArray object is inserted into the Dataset instance and becomes a mere Variable!

pydata/xarray#3205

I think the solution is to instead use a Dataset accessor providing a dictionary of the variables each with a dictionary of axis validation. So, validate_dataarray would return type tuple[DataArray, dict] and then Data.__setitem__ would update Data.data.core_axes and then we add a method to Data Data.has_core_axis('temperature', 'spatial').

Actually, at that point, there is no need to store it in the Dataset at all. The core_axes property could just be part of the Data class and the code should keep the list of variable names in that dict in sync with the Dataset variables. Embedding it in the Dataset doesn't seem to have any obvious advantages?

@dalonsoa
Copy link
Collaborator

That's annoying! Given that is no longer linked to the DataArray of interest, I would make all of these methods/attributes of the Data object, so no using accessors anymore.

@davidorme davidorme requested a review from dalonsoa December 20, 2022 16:11
@davidorme
Copy link
Collaborator Author

Sorry @dalonsoa - yet another iteration...

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of comments on doc strings.

Sorry it took me so long for so little input...

Comment on lines 165 to 166
The function returns the validated data array and a dictionary recording which
AxisValidator classes were applied to each of the core axes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in the Returns part of the docstring, right?

Args:
var_name: The name of a variable
axis_name: The core axis name
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Return and Raises explanations in docstring.

@davidorme davidorme merged commit bbe82a6 into feature/data_system Jan 16, 2023
@davidorme davidorme deleted the subfeature/data_coreaxis_accessor branch January 16, 2023 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants