-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor available datasets logic to be more flexible #739
Refactor available datasets logic to be more flexible #739
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, this is excellent so far!
-
Originally, I was preferring two separate methods for adding new and complementing existing datasets. But the chaining approach is (once I got it) much more elegant and flexible. I also like that there is now one well-documented baseclass method to look up (not the hardly documented mix of
available_datasets
andresolution
). I found the comments in the clavrx reader even more instructive than the rather abstract docstring inBaseFileHandler.available_datasets
. Maybe those could be added as an implementation guide:# some other file handler knows how to load this if is_avail is not None: yield is_avail, ds_info # we can confidently say that we can provide this dataset and can # provide more info yield True, new_info # if we didn't know how to handle this dataset and no one else did # then we should keep it going down the chain yield is_avail, ds_info
What do you think of replacing
(True, False, None)
with constants(AVAILABLE, KNOWN_BUT_UNAVAILABLE, UNKNOWN)
? That would make the logic even more clear. Probably makes the docstrings less readable, though. -
I know too little about the dependency tree to give an advice here. But you can't do much if the dataset isn't available, right? So "prefer available but give me known if it isn't available" seems like a good default to me. Can you give an example of why you are worried about the lack of future-proofing when building the dependency tree with only the available datasets?
|
@sfinkens So I've changed the default behavior to do as discussed; return the best available dataset before falling back to best known dataset. From my basic test this looks to have fixed a lot of the issues with the MODIS L1B reader that @TAlonglong was running in to. I need to add tests and more documentation but I think this may work. One thing I realized I was worried about @sfinkens is what the |
@djhoese Thanks for the explanation. Looks good to me! And agree to wait for Enums in Python3. We'll see, maybe the current approach is already clear enough and there will be no need to change it. |
I think this is ready to go. @sfinkens I did what you suggested and added a couple examples to the |
Let me have a look :) |
Codecov Report
@@ Coverage Diff @@
## master #739 +/- ##
==========================================
+ Coverage 80.71% 80.71% +<.01%
==========================================
Files 149 149
Lines 21677 21836 +159
==========================================
+ Hits 17496 17626 +130
- Misses 4181 4210 +29
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First quick review.
satpy/readers/file_handlers.py
Outdated
"file type"; the first file handler for each type. | ||
|
||
This method should **not** update values of the dataset information | ||
dictionary **unless* this file handler has a matching file type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionary **unless* this file handler has a matching file type | |
dictionary **unless** this file handler has a matching file type |
satpy/readers/file_handlers.py
Outdated
and should yield its new dataset in addition to the previous one. | ||
|
||
Args: | ||
configured_datasets (list): Series of (bool, dict) in the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be mentioned already here that the fist element can be None ? eg (bool or None, dict)
satpy/readers/file_handlers.py
Outdated
available datasets. This argument could be the result of a | ||
previous file handler's implementation of this method. | ||
|
||
Returns: Iterator of (bool, dict) pairs where dict is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested it, but it looks good I think.
I'll quickly test the |
Works! |
This is an attempt at resolving the discussion brought up in #434.
Problem Summary
Currently in Satpy we have a couple difficult-to-code use cases with readers:
available_datasets
method on file handlers that provides the dataset info for new "dynamic" datasets that the file reader knows about but the YAML didn't.resolution
property that the base yaml class is checking for and using to update DatasetIDs configured in the YAML.available_dataset_ids
but even asking for the generic "name" of the dataset will still try to load the highest resolution version and fail (see MODIS L1B reader).Solution Summary
I've updated the
available_datasets
method of the file handlers to take a series of(is_available, dataset_info)
pairs via a generator. I chain these calls together for the first file handler of each loaded file type. Theis_available
"boolean" can be:True
: There is a file handler and it has this dataset and can load itFalse
: There is a file handler and it knows it should be able to load this, but doesn't have it.None
: There isn't a file handler that knows how to load this dataset.This method can do three things:
I've changed the base class to have an idea of "all datasets" and "available datasets", but there are some flaws. Right now, as of this first commit, this only reproduces the current behaviors (maybe with a little extra flair).
Doubts and Questions
I used the "chaining" idea when calling
available_datasets
because it seemed the most flexible, ended up being kind of elegant, should be performant because of the way the generators are used, and allows for file handlers to "communicate" between themselves with what is available. However, it may make the actual use of the generator insideavailable_datasets
overly complex and hard to use. The simple cases are still relatively simple, but the semi-complex ones become very hard to walk through.The dependency tree used by the Scene and
.load
of the readers have a strange way of handling a "known" versus "available" dataset. It is very obvious now that I've been able to take a step back and work on this.The dependency tree needs to know what's possible regardless of whether or not a dataset exists. If we know that it exists then we know what is possible. However, we also use the dependency tree when we actually load datasets to get the
DatasetID
that we need to load. The "best" dataset to load (highest resolution) may not be the "best available", we currently only load "best". We would probably be ok building the dependency tree with only available datasets, but I'm a little nervous about the lack of future-proofing this would cause. Maybe building two dependency trees would be best/most flexible; one of all known and one of available.The main thing to decide is what should the base reader's
get_dataset_key
return. Should we split it in to two or more methods that do similar things? Is a keyword argument as I've laid out in this PR so far good enough? This is compounded by the problem in the previous paragraph, which components need "known" datasets and which ones need "available"? Right now I have it possible to do "prefer available but give me known if it isn't available".I'm not sure what's best here and wouldn't mind some opinions. I've been thinking about this too much so I'm taking a break for a little while.
TODO
available_datasets
functionality.git diff origin/master -- "*py" | flake8 --diff