Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor available datasets logic to be more flexible #739

Merged
merged 13 commits into from
May 6, 2019

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Apr 28, 2019

This is an attempt at resolving the discussion brought up in #434.

Problem Summary

Currently in Satpy we have a couple difficult-to-code use cases with readers:

  1. Some readers don't have their datasets pre-configured, like generic readers or readers that parse the output of algorithm software (clavrx, geocat, etc). So far we've gotten by with an available_datasets method on file handlers that provides the dataset info for new "dynamic" datasets that the file reader knows about but the YAML didn't.
  2. Certain datasets have information that is only knowable by the file contents, like resolution. Since it can't be configured in YAML we get around this by file handlers having a resolution property that the base yaml class is checking for and using to update DatasetIDs configured in the YAML.
  3. Some datasets are configured in YAML but may or may not be in the input file depending on how it was created. There is therefore no way to know if it is "available" or not until it is actually requested by the user.
  4. If a dataset is available in multiple resolutions but the highest ("best") resolution is not available due to some files not being provided, we see that the dataset is not available via available_dataset_ids but even asking for the generic "name" of the dataset will still try to load the highest resolution version and fail (see MODIS L1B reader).

Solution Summary

I've updated the available_datasets method of the file handlers to take a series of (is_available, dataset_info) pairs via a generator. I chain these calls together for the first file handler of each loaded file type. The is_available "boolean" can be:

  1. True: There is a file handler and it has this dataset and can load it
  2. False: There is a file handler and it knows it should be able to load this, but doesn't have it.
  3. None: There isn't a file handler that knows how to load this dataset.

This method can do three things:

  1. Yield the dataset info provided it, as is.
  2. Update the dataset info with new identifying information (resolution, etc) and possibly other metadata that would effect loading (coordinates, etc). Then yield the dataset info.
  3. Generate new dataset info for dynamic datasets that haven't been configured in the YAML, but that the file handler knows about.

I've changed the base class to have an idea of "all datasets" and "available datasets", but there are some flaws. Right now, as of this first commit, this only reproduces the current behaviors (maybe with a little extra flair).

Doubts and Questions

  1. I used the "chaining" idea when calling available_datasets because it seemed the most flexible, ended up being kind of elegant, should be performant because of the way the generators are used, and allows for file handlers to "communicate" between themselves with what is available. However, it may make the actual use of the generator inside available_datasets overly complex and hard to use. The simple cases are still relatively simple, but the semi-complex ones become very hard to walk through.

  2. The dependency tree used by the Scene and .load of the readers have a strange way of handling a "known" versus "available" dataset. It is very obvious now that I've been able to take a step back and work on this.

    The dependency tree needs to know what's possible regardless of whether or not a dataset exists. If we know that it exists then we know what is possible. However, we also use the dependency tree when we actually load datasets to get the DatasetID that we need to load. The "best" dataset to load (highest resolution) may not be the "best available", we currently only load "best". We would probably be ok building the dependency tree with only available datasets, but I'm a little nervous about the lack of future-proofing this would cause. Maybe building two dependency trees would be best/most flexible; one of all known and one of available.

    The main thing to decide is what should the base reader's get_dataset_key return. Should we split it in to two or more methods that do similar things? Is a keyword argument as I've laid out in this PR so far good enough? This is compounded by the problem in the previous paragraph, which components need "known" datasets and which ones need "available"? Right now I have it possible to do "prefer available but give me known if it isn't available".

I'm not sure what's best here and wouldn't mind some opinions. I've been thinking about this too much so I'm taking a break for a little while.

TODO

  1. Move the "resolution" property logic to available_datasets functionality.
  2. Determine best way to handle all/available differences in current dependency tree creation and use.

@djhoese djhoese added enhancement code enhancements, features, improvements component:readers component:dep_tree Dependency tree and dataset loading refactor work in progress labels Apr 28, 2019
@djhoese djhoese requested review from mraspaud and sfinkens April 28, 2019 02:01
@djhoese djhoese self-assigned this Apr 28, 2019
Copy link
Member

@sfinkens sfinkens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, this is excellent so far!

  1. Originally, I was preferring two separate methods for adding new and complementing existing datasets. But the chaining approach is (once I got it) much more elegant and flexible. I also like that there is now one well-documented baseclass method to look up (not the hardly documented mix of available_datasets and resolution). I found the comments in the clavrx reader even more instructive than the rather abstract docstring in BaseFileHandler.available_datasets. Maybe those could be added as an implementation guide:

    # some other file handler knows how to load this
    if is_avail is not None:
        yield is_avail, ds_info
    
    # we can confidently say that we can provide this dataset and can
    # provide more info
    yield True, new_info
    
    # if we didn't know how to handle this dataset and no one else did
    # then we should keep it going down the chain
    yield is_avail, ds_info

    What do you think of replacing (True, False, None) with constants (AVAILABLE, KNOWN_BUT_UNAVAILABLE, UNKNOWN)? That would make the logic even more clear. Probably makes the docstrings less readable, though.

  2. I know too little about the dependency tree to give an advice here. But you can't do much if the dataset isn't available, right? So "prefer available but give me known if it isn't available" seems like a good default to me. Can you give an example of why you are worried about the lack of future-proofing when building the dependency tree with only the available datasets?

@djhoese
Copy link
Member Author

djhoese commented Apr 30, 2019

  1. Good idea on the implementation guide. I actually had an example in the base classes docstring when I first wrote it, back when I wanted users to use the base classes method regardless.

    I'm not completely against the constants, but I would almost rather wait for python 2 support to end so we could use an Enum.

  2. Right now, it probably doesn't matter much. My thought was something like a composite being listed as "X" resolution in the future even though there are other resolutions available if the user provided that resolution. So I'm worried that all_dataset_ids would become less accurate or less useful. Maybe I do the preference for available and see what breaks (and what gets fixed).

@djhoese
Copy link
Member Author

djhoese commented May 1, 2019

@sfinkens So I've changed the default behavior to do as discussed; return the best available dataset before falling back to best known dataset. From my basic test this looks to have fixed a lot of the issues with the MODIS L1B reader that @TAlonglong was running in to.

I need to add tests and more documentation but I think this may work. One thing I realized I was worried about @sfinkens is what the Scene.all/available_composite_ids methods would return. Looking at the code it may actually work as expected because of how I had it coded originally. Turns out past Dave was smart this time.

@sfinkens
Copy link
Member

sfinkens commented May 2, 2019

@djhoese Thanks for the explanation. Looks good to me! And agree to wait for Enums in Python3. We'll see, maybe the current approach is already clear enough and there will be no need to change it.

@djhoese djhoese changed the title [WIP] Refactor available datasets logic to be more flexible Refactor available datasets logic to be more flexible May 2, 2019
@djhoese djhoese marked this pull request as ready for review May 2, 2019 20:19
@djhoese
Copy link
Member Author

djhoese commented May 2, 2019

I think this is ready to go. @sfinkens I did what you suggested and added a couple examples to the available_datasets docstring. It is nice and long now.

@mraspaud
Copy link
Member

mraspaud commented May 2, 2019

Let me have a look :)

@coveralls
Copy link

coveralls commented May 2, 2019

Coverage Status

Coverage increased (+0.008%) to 80.715% when pulling e256d2c on djhoese:feature-available-datasets-update into 9f7dced on pytroll:master.

@codecov
Copy link

codecov bot commented May 2, 2019

Codecov Report

Merging #739 into master will increase coverage by <.01%.
The diff coverage is 93.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #739      +/-   ##
==========================================
+ Coverage   80.71%   80.71%   +<.01%     
==========================================
  Files         149      149              
  Lines       21677    21836     +159     
==========================================
+ Hits        17496    17626     +130     
- Misses       4181     4210      +29
Impacted Files Coverage Δ
satpy/writers/__init__.py 85.21% <ø> (ø) ⬆️
satpy/readers/clavrx.py 92.61% <100%> (+0.84%) ⬆️
satpy/readers/file_handlers.py 93.22% <100%> (+1.38%) ⬆️
satpy/tests/test_scene.py 99.44% <100%> (+0.01%) ⬆️
satpy/readers/nucaps.py 94.11% <100%> (ø) ⬆️
satpy/tests/utils.py 97.29% <100%> (+0.32%) ⬆️
satpy/readers/viirs_sdr.py 85.46% <100%> (+0.43%) ⬆️
satpy/tests/reader_tests/test_viirs_sdr.py 92.33% <100%> (+0.02%) ⬆️
satpy/tests/reader_tests/test_clavrx.py 98.37% <100%> (+0.28%) ⬆️
satpy/readers/geocat.py 87.5% <63.63%> (-4.23%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f7dced...e256d2c. Read the comment docs.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First quick review.

"file type"; the first file handler for each type.

This method should **not** update values of the dataset information
dictionary **unless* this file handler has a matching file type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dictionary **unless* this file handler has a matching file type
dictionary **unless** this file handler has a matching file type

and should yield its new dataset in addition to the previous one.

Args:
configured_datasets (list): Series of (bool, dict) in the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be mentioned already here that the fist element can be None ? eg (bool or None, dict)

available datasets. This argument could be the result of a
previous file handler's implementation of this method.

Returns: Iterator of (bool, dict) pairs where dict is the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

satpy/readers/file_handlers.py Show resolved Hide resolved
satpy/readers/file_handlers.py Show resolved Hide resolved
satpy/readers/geocat.py Show resolved Hide resolved
satpy/readers/goes_imager_nc.py Show resolved Hide resolved
satpy/readers/yaml_reader.py Show resolved Hide resolved
satpy/tests/test_yaml_reader.py Show resolved Hide resolved
satpy/readers/clavrx.py Show resolved Hide resolved
satpy/readers/file_handlers.py Show resolved Hide resolved
Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it, but it looks good I think.

@sfinkens
Copy link
Member

sfinkens commented May 3, 2019

I'll quickly test the goes_imager_nc reader

@sfinkens
Copy link
Member

sfinkens commented May 3, 2019

Works!

@mraspaud mraspaud merged commit 5bfcda8 into pytroll:master May 6, 2019
@djhoese djhoese deleted the feature-available-datasets-update branch May 6, 2019 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:dep_tree Dependency tree and dataset loading component:readers enhancement code enhancements, features, improvements refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow readers to filter the available datasets configured in YAML
4 participants