Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline loading #1726

Merged
merged 24 commits into from
Jan 19, 2021
Merged

Offline loading #1726

merged 24 commits into from
Jan 19, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Jan 12, 2021

As discussed in #824 it would be cool to make the library work in offline mode.
Currently if there's not internet connection then modules (datasets or metrics) that have already been loaded in the past can't be loaded and it raises a ConnectionError.
This is because prepare_module fetches online for the latest version of the module.

To make it work in offline mode one suggestion was to reload the latest local version of the module.
I implemented that and I also raise a warning saying that the module that is loaded is the latest local version.

logger.warning(
    f"Using the latest cached version of the module from {cached_module_path} since it "
    f"couldn't be found locally at {input_path} or remotely ({error_type_that_prevented_reaching_out_remote_stuff})."
)

I added tests to make sure it works as expected and I needed to do a few changes in the code to be able to test things properly. In particular I added a parameter hf_modules_cache to init_dynamic_modules for testing purposes. It makes it possible to have temporary modules caches for testing.

I also added a offline context utility that allows to test part of the code by making all the requests fail as if there was no internet.

Close #824, close #761.

@lhoestq lhoestq requested a review from thomwolf January 12, 2021 15:21
@thomwolf
Copy link
Member

It's maybe a bit annoying to add but could we maybe have as well a version of the local data loading scripts in the package?
The text, json, csv. Thinking about people like in #1725 who are expecting to be able to work with local data without downloading anything.

Maybe we can add them to package_data or something?

@lhoestq
Copy link
Member Author

lhoestq commented Jan 12, 2021

Yes I mentioned this in #824 as well. I'm looking into it

@lhoestq
Copy link
Member Author

lhoestq commented Jan 14, 2021

Alright now csv, json, text and pandas are "packaged datasets", i.e. they're part of the datasets package, which makes them available in offline mode without any change in terms of API:

from datasets import load_dataset

d = load_dataset("csv", data_files=["path/to/data.csv"])

Instead of loading the dataset script from the module cache, it's loaded from inside the datasets package.

I updated the test to still be able to fetch the dummy data files for those datasets from datasets/{text|csv|pandas|json}/dummy in the repo.

@lhoestq
Copy link
Member Author

lhoestq commented Jan 15, 2021

Alright now all test pass :)
(I don't thank you windows)

@lhoestq lhoestq requested a review from yjernite January 18, 2021 14:24
@yjernite
Copy link
Member

LGTM! Since you're getting the local script's last modification date anyways do you think it might be a good idea to show it in the warning?

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool!

from .utils import offline


class LoadTest(TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@lhoestq
Copy link
Member Author

lhoestq commented Jan 19, 2021

LGTM! Since you're getting the local script's last modification date anyways do you think it might be a good idea to show it in the warning?

Yep good idea. I added the date in the warning. For example (last modified on Mon Nov 30 11:01:56 2020)

@lhoestq lhoestq merged commit 60fa3a1 into master Jan 19, 2021
@lhoestq lhoestq deleted the offline-loading branch January 19, 2021 16:42
eusip pushed a commit to eusip/datasets that referenced this pull request Jan 21, 2021
* minor

* add prepare module test

* fix windows path scheme check

* cached_path raises requests error if no internet

* look for cached modules if there's no internet

* wip tests

* add warning message

* update tests

* style

* remove test modules if already exist

* style

* add init_dynamic_modules function for testing purposes

* fix importlib cache

* move csv, json, text and pandas to inside the package

* add packaged datasets handling in prepare_module

* update tests

* minor fix

* add missing __init__.py

* fix test

* style

* fix test

* fix tests

* show last modification date in the warning
module_name_for_dynamic_modules = os.path.basename(dynamic_modules_path)
datasets_modules_path = os.path.join(dynamic_modules_path, "datasets")
datasets_modules_name = module_name_for_dynamic_modules + ".datasets"
metrics_modules_path = os.path.join(dynamic_modules_path, "metric")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq small typo here which breaks metrics loading, submitting a fix now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was "metrics" with an 's' !! Good catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discussion using datasets in offline mode Downloaded datasets are not usable offline
4 participants