-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline loading #1726
Offline loading #1726
Conversation
It's maybe a bit annoying to add but could we maybe have as well a version of the local data loading scripts in the package? Maybe we can add them to package_data or something? |
Yes I mentioned this in #824 as well. I'm looking into it |
Alright now from datasets import load_dataset
d = load_dataset("csv", data_files=["path/to/data.csv"]) Instead of loading the dataset script from the module cache, it's loaded from inside the I updated the test to still be able to fetch the dummy data files for those datasets from |
Alright now all test pass :) |
LGTM! Since you're getting the local script's last modification date anyways do you think it might be a good idea to show it in the warning? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool!
from .utils import offline | ||
|
||
|
||
class LoadTest(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Yep good idea. I added the date in the warning. For example |
* minor * add prepare module test * fix windows path scheme check * cached_path raises requests error if no internet * look for cached modules if there's no internet * wip tests * add warning message * update tests * style * remove test modules if already exist * style * add init_dynamic_modules function for testing purposes * fix importlib cache * move csv, json, text and pandas to inside the package * add packaged datasets handling in prepare_module * update tests * minor fix * add missing __init__.py * fix test * style * fix test * fix tests * show last modification date in the warning
module_name_for_dynamic_modules = os.path.basename(dynamic_modules_path) | ||
datasets_modules_path = os.path.join(dynamic_modules_path, "datasets") | ||
datasets_modules_name = module_name_for_dynamic_modules + ".datasets" | ||
metrics_modules_path = os.path.join(dynamic_modules_path, "metric") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq small typo here which breaks metrics loading, submitting a fix now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was "metrics" with an 's' !! Good catch
As discussed in #824 it would be cool to make the library work in offline mode.
Currently if there's not internet connection then modules (datasets or metrics) that have already been loaded in the past can't be loaded and it raises a ConnectionError.
This is because
prepare_module
fetches online for the latest version of the module.To make it work in offline mode one suggestion was to reload the latest local version of the module.
I implemented that and I also raise a warning saying that the module that is loaded is the latest local version.
I added tests to make sure it works as expected and I needed to do a few changes in the code to be able to test things properly. In particular I added a parameter
hf_modules_cache
toinit_dynamic_modules
for testing purposes. It makes it possible to have temporary modules caches for testing.I also added a
offline
context utility that allows to test part of the code by making all the requests fail as if there was no internet.Close #824, close #761.