Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenience selection function for sub-parts of NCDataset #47

Closed
Datseris opened this issue Jan 5, 2020 · 14 comments
Closed

Convenience selection function for sub-parts of NCDataset #47

Datseris opened this issue Jan 5, 2020 · 14 comments

Comments

@Datseris
Copy link
Contributor

Datseris commented Jan 5, 2020

Hi there, I have a feature request which could be helpful, albeit it is a small convenience syntax.

Let's say I have loaded an NCDataset. In Python this is done with xarray, but here we have something like

ebaf41_toa = NCDataset(datadir("CERES", "CERES_EBAF-TOA_Ed4.1_Subset_200003-201902.nc"))

Now, let's say we have a field of this dataset, fld = ebaf41_toa["toa_sw_all_mon"]:

toa_sw_all_mon (360 × 180 × 228)
  Datatype:    Float32
  Dimensions:  lon × lat × time
  Attributes:
   long_name            = Top of The Atmosphere Shortwave Flux, All-Sky conditions, Monthly Means     
   standard_name        = TOA Shortwave Flux - All-Sky
   CF_name              = toa_outgoing_shortwave_flux
   comment              = none
   units                = W m-2
   valid_min            =       0.00000
   valid_max            =       600.000
   _FillValue           = -999.0

In Python, you would do something like

ebaf41_toa = xr.open_dataset(path)
fld = ebaf41_toa.toa_sw_all_mon

This xarray offers a convenience syntax like

fld.sel(time=slice('2000-03-01','2019-03-01'))

which means that you can select sub-parts of the field by specifying which ranges of the dependent variables to keep.

This could be implemented here as well, but one has to somehow map the given keywords to "which" dependent variables they represent. I am happy to make this contribution if you are willing to guide me.

@Datseris
Copy link
Contributor Author

Datseris commented Jan 5, 2020

Notice that this xarray syntax retains the structural form of the data. So they are still an xarray, which means that even after selecting sub-parts, you still keep the information regarding which fields are which.

@Alexander-Barth
Copy link
Member

It is great to see your enthusiasm to improve this package! Your help is very welcomed.
I am not too familiar with xarray but I heard many good thinks about this package. In your example time is this the dimension name or variable name (or do both need to exist with the same name)?

Maybe we could have a select functions which also only "virtually" subset the data without actually loading it, with the functionality that you propose (maybe extent it even for NCDataset):

using IntervalSets, NCDatasets
ds = NCDataset("ECMWF_ERA-40_subset.nc")
ncv = ds["tp"]
select(ncv, time = Date(2000,1,1)..Date(2001,12,31))
# returns a CFVariable
select(ds, time = Date(2000,1,1)..Date(2001,12,31))
# returns a NCDataset with all variable with a time dimension sliced, and variable without a time dimensions are not sliced

I played a bit with xarray, and I noticed that even indexing the xarray.DataArray type does not result in a numpy array:

import xarray as xr
# file from https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc
ds = xr.open_dataset("ECMWF_ERA-40_subset.nc")
a = ds["tp"][:,:,1] # a is still a xarray.core.dataarray.DataArray
b = ds["tp"][:,:,1].values # now we have the actual data

Maybe it would also be useful to have a function which works on indices and only virtually subsets the data, a bit like the julia function view:

view(ncv,:,:,1)
# also returns a CFVariable with the first time instance
view(ds,time = 1:3)
# returns a NCDataset with the first 3 time instances

While the arguments for view would always be indices, the arguments of select would be the corresponding values of the netCDF variable. Maybe it would be useful to build the functionality of select ontop of the functionality of view. It seems that xarrayhas the methods sel and isel for this.

view(ncv,:,:,1) does actually work already, but view(ncv,:,:,1)[:,:] loads every scalar element individually which is quite slow.

@Datseris
Copy link
Contributor Author

Datseris commented Jan 6, 2020

I played a bit with xarray, and I noticed that even indexing the xarray.DataArray type does not result in a numpy array:

Exactly, and I find this a neat feature. It helps a lot to not have to constantly remember which dimension of the dataset is which. Also, the pretty printing for CFVariable is just too cool to pass!

Maybe it would also be useful to have a function which works on indices and only virtually subsets the data, a bit like the julia function view:

Oh wow, this is really a good suggestion! So to get it right, you are suggesting that two syntaxes should exist when accessing a CFVariable:

  1. Accessing it like an array, e.g. v[:, :, 1:5], which would return an Array type.
  2. Acessing it similarlry as is done in xarray, where here you suggest the syntax view(v, :, :, 1:5), where the result would still be a CFvariable.

I think this is a good idea! Here is the thing to really consider: which of these two things do you think is better to have the "short" syntax, i.e. v[:, :,1:5] ? In general here is how I think, regarding Julia and types: if I write v[:, :, 1:5], I would expect to get the same type as v, so in this case a CFVariable. Therefore I would vote that the brackets syntax still returns what you suggest here with the view.

I now also realize, that having special syntax for getting only the numerical values as an Array type is not really necessary, because this is currently both valid and also very clear syntax: Array(v)[:, :, 1:5].

@Alexander-Barth
Copy link
Member

Maybe it would also be useful to have a function which works on indices and only virtually subsets the data, a bit like the julia function view:

Oh wow, this is really a good suggestion! So to get it right, you are suggesting that two syntaxes should exist when accessing a CFVariable:

1. Accessing it like an array, e.g. `v[:, :, 1:5]`, which would return an `Array` type.

2. Acessing it similarlry as is done in `xarray`, where here you suggest the syntax  `view(v, :, :, 1:5)`, where the result would still be a `CFvariable`.

Yes, this is exactly right.

I think this is a good idea! Here is the thing to really consider: which of these two things do you think is better to have the "short" syntax, i.e. v[:, :,1:5] ? In general here is how I think, regarding Julia and types: if I write v[:, :, 1:5], I would expect to get the same type as v, so in this case a CFVariable. Therefore I would vote that the brackets syntax still returns what you suggest here with the view.

I now also realize, that having special syntax for getting only the numerical values as an Array type is not really necessary, because this is currently both valid and also very clear syntax: Array(v)[:, :, 1:5].

I see your point of not changing the type by indexing, but changing the behavior of getindex is a quite a huge change which will break everybodies code (and I have a lot of code using NCDatasets :-))

But in fact, that are some difference between numpy and julia that, to me, justify a different approach of NCDatasets and xarray. In fact, in numpy indexing creates a view:

import numpy as np                                                                                                                                                                                  
a = np.array([1,2,3,4])                                                                                                                                                                             
b = a[:]                                                                                                                                                                                            
b[1] = 10                                                                                                                                                                                           
# returns array([ 1, 10,  3,  4])

While in Julia, indexing copies the data; for NCDatasets you can think of copying the data from the disk to memory. In Julia, b = view(a,:,1); b[1] = 1 changes b while b = a[:,1]; b[1] = 1 does not.
Likewise, in NCDatasets for a file open in read-write-mode, b = view(a,:,1); b[1] = 1 could/should actually change the data in the file (if a is a CFVariable) while b = a[:,1]; b[1] = 1 would not.
There is thus some risk that users inadvertently modify their NetCDF files if we would make this change.

@Datseris
Copy link
Contributor Author

Datseris commented Jan 7, 2020

While in Julia, indexing copies the data; for NCDatasets you can think of copying the data from the disk to memory.

Okay, this is the convincing sentence for me!

Therefore let's move forward with your approach and define view (or we can even call it a different function, we'll see). Do you think you can give me so "getting started" tips on how to do this? I am still through the processing of reading the source of NCDatasets (which is surprisingly large!).

@Datseris
Copy link
Contributor Author

Datseris commented Jan 7, 2020

Woohooo I got a "brilliant" idea for convenience syntax that retains the CFVariable structure:

v[:, :, 1:5] # copies the data, returns Array
v(:, :, 1:5) # views the data, returns CFVariable

This would allow short syntax for both operations! (I know how to implement the parenthesis syntax once you guide me on how to actually do it with respect to the source code. The first steps would be for you to tell me which parts of the source to read in detail and I'll try to take it from there.

@Alexander-Barth
Copy link
Member

Therefore let's move forward with your approach and define view (or we can even call it a different function, we'll see). Do you think you can give me so "getting started" tips on how to do this? I am still through the processing of reading the source of NCDatasets (which is surprisingly large!).

I think I would start to implement a view array type for a sub-array which does not fetch the element individually. Maybe the SubArray can be used for this purpose?

Then a view(ncvar,:,:,1) would just wrap the ncvar in such a view. The view and select methods are new and maybe it is easier to have them in a separate file. I am not not sure if it is necessary to modify the existing code (at least for subsetting a variable, for subsetting a entire dataset it might probably be the case).

As you have seen, CFVariable array type wraps the Variable array type (or the multi-file variant). Similarly, the view type could wrap the CFVariable type. If you want to read the existing code, it would be this part.

A significant addition to the code was the support of multi-files (and in particular avoiding the opening of all files at the same time). These are the types Defer* and MF* they are not important here.

Concerning the v(:, :, 1:5): I am wondering if you can have this alternative syntax as an extension package, but it is a clever idea.

@Datseris
Copy link
Contributor Author

Datseris commented Jan 7, 2020

Okay! I'll start working on view. (I started already working on select, but I now realize view must be done first, as select will just propagate the result to view). Thanks for pointing where to read.

Concerning the v(:, :, 1:5): I am wondering if you can have this alternative syntax as an extension package, but it is a clever idea.

Why would you have this at an extension package and not here? (as far as Julia is concerned, yes you can, this is just a method definition, but why tho)


Kind of unrelated with this issue, but the above question reminded me to ask you again, cf. discussion in JuliaClimate/ClimateTools.jl#65 , but such things are good to be clear early on: are you willing to move NCDatasets.jl in an organization (in that discussion JuliaClimate) and allow other people to also have maintainer status over it? I have been involved in long discussions with many people in Geo/Climate fields in Julia, discussing the many benefits of having packages in organizations (along with the difficulties), if you are really interested to read about it, these two discourse posts are relevant: https://discourse.julialang.org/t/newcomer-contributor-in-juliageo-and-co-help-me-get-started/32480 and https://discourse.julialang.org/t/how-can-we-create-a-leaner-ecosystem-for-julia/32904/25?u=datseris onwards. I've tried to point out the many benefits of why this is a good thing to do.

@Datseris
Copy link
Contributor Author

Datseris commented Jan 7, 2020

I am a bit confused about the separation of Variable and CFVariable. I think the latter is the first one but with transformations. The functionality we are discussing here could be applied to both of them or not?

Regarding #51 a solution could be to make both Variable, CFVariable and company a subtype of an abstract type AbstractVariable. Whether that type should be an subtype of AbstractArray is a different discussion.

@Datseris
Copy link
Contributor Author

Datseris commented Jan 7, 2020

Actually, is there even a way to create / get a Variable in actual real scenarios of loading real data? Because I can see that Base.getindex(ds::AbstractDataset,varname::AbstractString) always returns an CFVariable... What's the purpose of Variable ?

@Alexander-Barth
Copy link
Member

You can get a Variable from the variable function (if for some reasons you do not want to have scaling):

typeof(variable(NCDataset("WOD-Salinity-Provencal.nc"),"Salinity"))
# returns NCDatasets.Variable{Float32,1}

It is a good idea to have an abstract type AbstractVariable as you propose.
Thanks!

@Alexander-Barth
Copy link
Member

Concerning moving NCDatasets to an organization: I prefer to keep it under my name, but I might reconsider should I have less time to work on it.

@Datseris
Copy link
Contributor Author

Views are working fine for AxisArrays, so the best way forward here is to just simply convert an CFVariable into an axis array:

julia> B = AxisArray(randn(10,10), :x, :z)
2-dimensional AxisArray{Float64,2,...} with axes:
    :x, Base.OneTo(10)
    :z, Base.OneTo(10)
And data, a 10×10 Array{Float64,2}:
 -2.07336   -0.716528  -1.19063   …   2.35417    -0.523806   -0.759501
 -0.331434   0.82659   -1.57111      -1.44891    -0.255659    1.72754
  0.648818   0.368015   1.21215       0.0430943  -0.0960963   1.28175
  0.694397   0.701518  -0.834854     -0.164019   -1.35059    -0.288781
  0.28872    0.847034  -0.270278      0.196656    0.180989    0.516272
 -0.286382  -0.341013  -1.66179   …  -2.4435     -1.04169     0.148246
  0.456696   1.39454   -0.102087      1.75234    -0.280957   -0.436426
  0.86267    0.10408    0.195798      1.39754    -0.398613   -0.047977
  1.05371   -1.53931   -1.52603      -0.278274    0.995427   -2.7661
  0.643634  -0.332181  -1.11134      -0.176029    0.699264   -0.0990593

julia> @view B[1:5, :]
2-dimensional AxisArray{Float64,2,...} with axes:
    :x, 1:5
    :z, Base.OneTo(10)
And data, a 5×10 view(::Array{Float64,2}, 1:5, :) with eltype Float64:
 -2.07336   -0.716528  -1.19063   0.729718  …   2.35417    -0.523806   -0.759501
 -0.331434   0.82659   -1.57111   0.164159     -1.44891    -0.255659    1.72754
  0.648818   0.368015   1.21215   0.709711      0.0430943  -0.0960963   1.28175
  0.694397   0.701518  -0.834854  1.22847      -0.164019   -1.35059    -0.288781
  0.28872    0.847034  -0.270278  0.394992      0.196656    0.180989    0.516272

@Datseris
Copy link
Contributor Author

This is solved in e.g. ClimateBase.jl where an automatic convertion to a DimensionalArray from DimensionalData.jl is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants