-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grib_tree method #399
Add grib_tree method #399
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand that:
- each branch of the tree is combined as usual
- the branches have their keys copied into the right place in the overall output
- special renaming of grib keys is going on
# Hard code the filters in the correct order for the group hierarchy | ||
filters = ["stepType", "typeOfLevel"] | ||
|
||
zarr_store = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow for parquet/bring-your-own output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we need to figure out how parquet works at all for deep nested data like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the hierarchy in zarr is just a path embedded in the key... it might must work.
If I want to add more tests, should I checkin some json output from scan_grib for a few variables?
kerchunk/grib2.py
Outdated
|
||
unknown_counter = 0 | ||
for msg_ind, group in enumerate(message_groups): | ||
if "version" not in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested with zarr v3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to run scan_grib using zarr v3? As implemented, grib_tree is pretty tightly coupled on the output of scan_grib. I see you can set an ENV var to allow access to the experimental API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so version should always be 2? Or is this not the zarr version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - you are right - this is the kerchunk version, not the zarr version anyway.
# https://confluence.ecmwf.int/display/UDOC/Creating+your+own+local+definitions+-+ecCodes+GRIB+FAQ | ||
# If you process the groups from a single file in order, you can use the msg# to compare with the | ||
# IDX file. | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not include the message about what to do in the logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to put more of the answer in the log message - though I have not been successful in doing it myself... yet.
So far I have used the idx file to confirm that I am unlikely to need these unknown variables in my use case with HRRR and GFS grib2 files.
Example: gs://high-resolution-rapid-refresh/hrrr.20230928/conus/hrrr.t01z.wrfsfcf05.grib2
Msg # indexed from zero
Found unknown variable in msg# 1... it will be dropped
Found unknown variable in msg# 2... it will be dropped
Found unknown variable in msg# 37... it will be dropped
Found unknown variable in msg# 38... it will be dropped
Found unknown variable in msg# 42... it will be dropped
Found unknown variable in msg# 44... it will be dropped
...
IDX file entries (messages indexed from 1)
Example: gs://high-resolution-rapid-refresh/hrrr.20230928/conus/hrrr.t01z.wrfsfcf05.grib2.idx
1:0:d=2023092801:REFC:entire atmosphere:5 hour fcst:
2:375155:d=2023092801:RETOP:cloud top:5 hour fcst:
3:517041:d=2023092801:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:5 hour fcst:
4:889615:d=2023092801:VIL:entire atmosphere:5 hour fcst:
5:1157550:d=2023092801:VIS:surface:5 hour fcst:
...
37:24333077:d=2023092801:VGRD:1000 mb:5 hour fcst:
38:24956531:d=2023092801:MAXUVV:100-1000 mb above ground:4-5 hour max fcst:
39:25687250:d=2023092801:MAXDVV:100-1000 mb above ground:4-5 hour max fcst:
40:26410531:d=2023092801:DZDT:0.5-0.8 sigma layer:4-5 hour ave fcst:
41:27042519:d=2023092801:MSLMA:mean sea level:5 hour fcst:
42:27650201:d=2023092801:HGT:1000 mb:5 hour fcst:
43:28342907:d=2023092801:MAXREF:1000 m above ground:4-5 hour max fcst:
44:28588455:d=2023092801:REFD:263 K level:4-5 hour max fcst:
45:28736498:d=2023092801:MXUPHL:5000-2000 m above ground:4-5 hour max fcst:
46:28789333:d=2023092801:MNUPHL:5000-2000 m above ground:4-5 hour min fcst:
...
The metadata in the NOAA EMC generated idx files suggests we are missing local definitions for a number of variables that represent min, max or avg values over a layer. I have asked NOAA-EMC for help with decoding these properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can leave this for future improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crickets over at NOAA-EMC... I think we might have a working rust gribberish reader before we get a clear answer on how to build the ecCodes tables for these unknown properties.
🤞 @mpiannucci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the tables are annoying, need to figure that out in rust too to be fully comprehensive
vname, | ||
msg_ind, | ||
) | ||
# Use unknown as a group or drop it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any cases like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the same HRRR SFCF grib2 files discussed above re: unknown variables, here are two examples of unknown levels:
Example logs
Found 'unknown' attribute value for key typeOfLevel in var gh of msg# 161
Found 'unknown' attribute value for key typeOfLevel in var layth of msg# 164
Corresponding idx file entries.
162:129396944:d=2023092801:HGT:level of free convection:5 hour fcst:
165:135154615:d=2023092801:LAYTH:261 K level - 256 K level:5 hour fcst:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that we should be able to decode them? The grib object has many many possble attributes, so this seems solvable. Not needed for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I believe these are just more custom codes, similar to the unknown variables above.
I think they are all documented and verifiable by reading the idx files, but implementing the tables for ecCodes to read them is more than I can take on right now...
Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diving in - I hope this can be useful.
- each branch of the tree is combined as usual
Yes - each variable can have different level values so only the branches can be combined in general
- the branches have their keys copied into the right place in the overall output
Yes - similar to what you do in scan_grib, I am just adding a path for the zarr hierarchy.
- special renaming of grib keys is going on
I did make copies of some attributes (name, stepType and typeOfLevel) to help users understand the hierarchy. I don't think I actually renamed any existing metadata. The implementation is opinionated about the grib level and grib step attributes as well as the coordinate names.
I would love to have more ECMWF data to test generality with.
# https://confluence.ecmwf.int/display/UDOC/Creating+your+own+local+definitions+-+ecCodes+GRIB+FAQ | ||
# If you process the groups from a single file in order, you can use the msg# to compare with the | ||
# IDX file. | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to put more of the answer in the log message - though I have not been successful in doing it myself... yet.
So far I have used the idx file to confirm that I am unlikely to need these unknown variables in my use case with HRRR and GFS grib2 files.
Example: gs://high-resolution-rapid-refresh/hrrr.20230928/conus/hrrr.t01z.wrfsfcf05.grib2
Msg # indexed from zero
Found unknown variable in msg# 1... it will be dropped
Found unknown variable in msg# 2... it will be dropped
Found unknown variable in msg# 37... it will be dropped
Found unknown variable in msg# 38... it will be dropped
Found unknown variable in msg# 42... it will be dropped
Found unknown variable in msg# 44... it will be dropped
...
IDX file entries (messages indexed from 1)
Example: gs://high-resolution-rapid-refresh/hrrr.20230928/conus/hrrr.t01z.wrfsfcf05.grib2.idx
1:0:d=2023092801:REFC:entire atmosphere:5 hour fcst:
2:375155:d=2023092801:RETOP:cloud top:5 hour fcst:
3:517041:d=2023092801:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:5 hour fcst:
4:889615:d=2023092801:VIL:entire atmosphere:5 hour fcst:
5:1157550:d=2023092801:VIS:surface:5 hour fcst:
...
37:24333077:d=2023092801:VGRD:1000 mb:5 hour fcst:
38:24956531:d=2023092801:MAXUVV:100-1000 mb above ground:4-5 hour max fcst:
39:25687250:d=2023092801:MAXDVV:100-1000 mb above ground:4-5 hour max fcst:
40:26410531:d=2023092801:DZDT:0.5-0.8 sigma layer:4-5 hour ave fcst:
41:27042519:d=2023092801:MSLMA:mean sea level:5 hour fcst:
42:27650201:d=2023092801:HGT:1000 mb:5 hour fcst:
43:28342907:d=2023092801:MAXREF:1000 m above ground:4-5 hour max fcst:
44:28588455:d=2023092801:REFD:263 K level:4-5 hour max fcst:
45:28736498:d=2023092801:MXUPHL:5000-2000 m above ground:4-5 hour max fcst:
46:28789333:d=2023092801:MNUPHL:5000-2000 m above ground:4-5 hour min fcst:
...
The metadata in the NOAA EMC generated idx files suggests we are missing local definitions for a number of variables that represent min, max or avg values over a layer. I have asked NOAA-EMC for help with decoding these properly.
vname, | ||
msg_ind, | ||
) | ||
# Use unknown as a group or drop it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the same HRRR SFCF grib2 files discussed above re: unknown variables, here are two examples of unknown levels:
Example logs
Found 'unknown' attribute value for key typeOfLevel in var gh of msg# 161
Found 'unknown' attribute value for key typeOfLevel in var layth of msg# 164
Corresponding idx file entries.
162:129396944:d=2023092801:HGT:level of free convection:5 hour fcst:
165:135154615:d=2023092801:LAYTH:261 K level - 256 K level:5 hour fcst:
kerchunk/grib2.py
Outdated
|
||
unknown_counter = 0 | ||
for msg_ind, group in enumerate(message_groups): | ||
if "version" not in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to run scan_grib using zarr v3? As implemented, grib_tree is pretty tightly coupled on the output of scan_grib. I see you can set an ENV var to allow access to the experimental API...
# Hard code the filters in the correct order for the group hierarchy | ||
filters = ["stepType", "typeOfLevel"] | ||
|
||
zarr_store = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the hierarchy in zarr is just a path embedded in the key... it might must work.
If I want to add more tests, should I checkin some json output from scan_grib for a few variables?
kerchunk/grib2.py
Outdated
if "step/.zarray" not in group["refs"]: | ||
group = copy.deepcopy(group) | ||
logger.warning("Adding missing step variable to group path %s", path) | ||
group["refs"]["step/.zarray"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I put this in scan_grib where the cfgrib WrongStepUnitError is handled?
I don't like doing this here. Another alternative would be to use the preprocess hook in MultiZarrToZarr?
We should be able to calculate the correct step value by subtracting the valid_time from the time (reference time).
That requires decoding those values though and then encoding the step correctly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the HRRR SubHF dataset, even the step values that are decoded (don't raise an exception) are wrong.
scans = scan_grib("gs://high-resolution-rapid-refresh/hrrr.20210928/conus/hrrr.t00z.wrfsubhf01.grib2")
dsets = []
zsets = []
vname = "dswrf"
for gg in scans:
if f"{vname}/.zattrs" in gg['refs']:
fs = fsspec.filesystem("reference", fo=gg)
dsets.append(xr.open_dataset(fs.get_mapper(""), engine="zarr", consolidated=False))
zsets.append(zarr.open_group(fs.get_mapper("")))
for ds, zs in zip(dsets, zsets):
print(
f"{ds[vname].attrs['GRIB_stepType']}, zs units {zs.step.attrs['units'] if 'step' in ds else None}: "
f"ds step - {ds.step.values[()].astype('timedelta64[h]') if 'step' in ds else None}, zs step - {zs.step[()] if 'step' in zs else None};"
f" valid_time - {ds.valid_time.values[()]}; time - {ds.time.values[()]}"
)
avg, zs units None: ds step - None, zs step - None; valid_time - 2021-09-28T00:15:00.000000000; time - 2021-09-28T00:00:00.000000000
instant, zs units hours: ds step - 15 hours, zs step - 15.0; valid_time - 2021-09-28T00:15:00.000000000; time - 2021-09-28T00:00:00.000000000
avg, zs units hours: ds step - 30 hours, zs step - 30.0; valid_time - 2021-09-28T00:30:00.000000000; time - 2021-09-28T00:00:00.000000000
instant, zs units hours: ds step - 30 hours, zs step - 30.0; valid_time - 2021-09-28T00:30:00.000000000; time - 2021-09-28T00:00:00.000000000
avg, zs units hours: ds step - 45 hours, zs step - 45.0; valid_time - 2021-09-28T00:45:00.000000000; time - 2021-09-28T00:00:00.000000000
instant, zs units hours: ds step - 45 hours, zs step - 45.0; valid_time - 2021-09-28T00:45:00.000000000; time - 2021-09-28T00:00:00.000000000
avg, zs units hours: ds step - 60 hours, zs step - 60.0; valid_time - 2021-09-28T01:00:00.000000000; time - 2021-09-28T00:00:00.000000000
instant, zs units hours: ds step - 1 hours, zs step - 1.0; valid_time - 2021-09-28T01:00:00.000000000; time - 2021-09-28T00:00:00.000000000
The values are in floating point minute but the units are specified as hours.
Only the stepType instant for the end of the forecast is correct - 1 hour.
Not sure how to fix the Docs check but I think the rest of this is in pretty good shape! |
@jthielen you might be interested in this |
if "fill_value" not in kw: | ||
if data.dtype.kind == "i": | ||
kw["fill_value"] = None | ||
elif k in z: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be true when we create the dataset in a couple of lines below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting the fill_value
attribute from the coordinate zarr array instance.
Without the if clause I got errors on some existing tests I think... so I copied the if clause used for copying the attributes (new line 329).
The change is important because timedeltas stored as float are given a default fill_value of 0.0 without this change which becomes NaT when read with xarray. Meaning you can't have a zero timedelta.
kerchunk/grib2.py
Outdated
|
||
unknown_counter = 0 | ||
for msg_ind, group in enumerate(message_groups): | ||
if "version" not in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so version should always be 2? Or is this not the zarr version?
# https://confluence.ecmwf.int/display/UDOC/Creating+your+own+local+definitions+-+ecCodes+GRIB+FAQ | ||
# If you process the groups from a single file in order, you can use the msg# to compare with the | ||
# IDX file. | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can leave this for future improvement
vname, | ||
msg_ind, | ||
) | ||
# Use unknown as a group or drop it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that we should be able to decode them? The grib object has many many possble attributes, so this seems solvable. Not needed for this PR.
Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant Thanks for taking a pass on this.
Do you want to add xarray_datatree as a dev dependency for kerchunk so we can make some additional tests?
Thank you for your work on datatree @TomNicholas any thoughts on how I am using or abusing datatree here?
if "fill_value" not in kw: | ||
if data.dtype.kind == "i": | ||
kw["fill_value"] = None | ||
elif k in z: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting the fill_value
attribute from the coordinate zarr array instance.
Without the if clause I got errors on some existing tests I think... so I copied the if clause used for copying the attributes (new line 329).
The change is important because timedeltas stored as float are given a default fill_value of 0.0 without this change which becomes NaT when read with xarray. Meaning you can't have a zero timedelta.
kerchunk/grib2.py
Outdated
|
||
unknown_counter = 0 | ||
for msg_ind, group in enumerate(message_groups): | ||
if "version" not in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - you are right - this is the kerchunk version, not the zarr version anyway.
# https://confluence.ecmwf.int/display/UDOC/Creating+your+own+local+definitions+-+ecCodes+GRIB+FAQ | ||
# If you process the groups from a single file in order, you can use the msg# to compare with the | ||
# IDX file. | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crickets over at NOAA-EMC... I think we might have a working rust gribberish reader before we get a clear answer on how to build the ecCodes tables for these unknown properties.
🤞 @mpiannucci
vname, | ||
msg_ind, | ||
) | ||
# Use unknown as a group or drop it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I believe these are just more custom codes, similar to the unknown variables above.
I think they are all documented and verifiable by reading the idx files, but implementing the tables for ecCodes to read them is more than I can take on right now...
That's fine, add it to the CI envs. Having it in the optional package requirements is fine too, but not necessary (this is experimental, after all). |
Isn't it all experimental, after all? |
Some bits are more experimental than others :) |
@martindurant Still not sure what do about the docs build failure here? |
I don't know either, except that the install raises a number of incompatibility warnings from pip. ammaraskar/sphinx-action@master doesn't appear to have changed, but it is somewhat dated - perhaps we need a concrete environment and run sphinx ourselves. |
Hi! I'm just getting started in the kerchunk and fsspec world and wanted to do fsspec on our grib files from WRF model. I was having a look at the colab notebook but I'm not sure how to implement it in my files...do I need to implement the correct_hrrr_subhf_step for my structure? Is there a docs I am missing? Thanks!!! |
@andreall hopefully that is only required for the quirky bad step values in the HRRR Subhourly 2D Surface data. I don't think you will have this problem. I suggest you try it for one grib file and see what you get - look at the valid_time, step and time coordinates as and see if it is working properly. something like this... scans = scan_grib("myfile.grib2")
ref_store = grib_tree(scans)
fs = fsspec.filesystem("reference", fo=ref_store)
zg = zarr.open_group(fs.get_mapper(""))
dt = datatree.open_datatree(fs.get_mapper(""), engine="zarr", consolidated=False) The colab and the code comments are all I have for docs right now, but if this PR is merged I will help support it. |
Question: xarray now supports |
Looks like not yet? |
I'm sure we could suggest it |
@martindurant pushed one more commit |
In JSON, the values are indeed strings on-disk, which become bytes when read by the filesystem: either as ascii, or base64 encoded. In memory (dicts), bytes, ascii strings or base64 strings should all work. Parquet, by contrast, stores bytes only. |
See pydata/xarray#7437 (on the |
Merging this now, let's see what use it gets. We can always fix things for cases of gribs that are just a bit different from those trialed. |
Add an method to map grib2 data model to zarr hierarchy & xarray datatree
The method is opinionated about how to do this - there are many possible ways. I hope the is generally useful enough that it is worth adding to the kerchunk library.
See examples in Colab
Here is an example of the hierarchical structure for the eastward wind velocity variable which has been aggregated by step, time and level (isobaricInhPa).