Offline loading #1726

lhoestq · 2021-01-12T15:21:57Z

As discussed in #824 it would be cool to make the library work in offline mode.
Currently if there's not internet connection then modules (datasets or metrics) that have already been loaded in the past can't be loaded and it raises a ConnectionError.
This is because prepare_module fetches online for the latest version of the module.

To make it work in offline mode one suggestion was to reload the latest local version of the module.
I implemented that and I also raise a warning saying that the module that is loaded is the latest local version.

logger.warning(
    f"Using the latest cached version of the module from {cached_module_path} since it "
    f"couldn't be found locally at {input_path} or remotely ({error_type_that_prevented_reaching_out_remote_stuff})."
)

I added tests to make sure it works as expected and I needed to do a few changes in the code to be able to test things properly. In particular I added a parameter hf_modules_cache to init_dynamic_modules for testing purposes. It makes it possible to have temporary modules caches for testing.

I also added a offline context utility that allows to test part of the code by making all the requests fail as if there was no internet.

Close #824, close #761.

thomwolf · 2021-01-12T16:01:58Z

It's maybe a bit annoying to add but could we maybe have as well a version of the local data loading scripts in the package?
The text, json, csv. Thinking about people like in #1725 who are expecting to be able to work with local data without downloading anything.

Maybe we can add them to package_data or something?

lhoestq · 2021-01-12T16:39:39Z

Yes I mentioned this in #824 as well. I'm looking into it

lhoestq · 2021-01-14T16:17:24Z

Alright now csv, json, text and pandas are "packaged datasets", i.e. they're part of the datasets package, which makes them available in offline mode without any change in terms of API:

from datasets import load_dataset

d = load_dataset("csv", data_files=["path/to/data.csv"])

Instead of loading the dataset script from the module cache, it's loaded from inside the datasets package.

I updated the test to still be able to fetch the dummy data files for those datasets from datasets/{text|csv|pandas|json}/dummy in the repo.

lhoestq · 2021-01-15T14:41:57Z

Alright now all test pass :)
(I don't thank you windows)

yjernite · 2021-01-18T18:59:33Z

LGTM! Since you're getting the local script's last modification date anyways do you think it might be a good idea to show it in the warning?

thomwolf

This is really cool!

thomwolf · 2021-01-18T22:10:01Z

tests/test_load.py

+from .utils import offline
+
+
+class LoadTest(TestCase):


lhoestq · 2021-01-19T15:21:44Z

LGTM! Since you're getting the local script's last modification date anyways do you think it might be a good idea to show it in the warning?

Yep good idea. I added the date in the warning. For example (last modified on Mon Nov 30 11:01:56 2020)

* minor * add prepare module test * fix windows path scheme check * cached_path raises requests error if no internet * look for cached modules if there's no internet * wip tests * add warning message * update tests * style * remove test modules if already exist * style * add init_dynamic_modules function for testing purposes * fix importlib cache * move csv, json, text and pandas to inside the package * add packaged datasets handling in prepare_module * update tests * minor fix * add missing __init__.py * fix test * style * fix test * fix tests * show last modification date in the warning

yjernite · 2021-01-28T18:00:27Z

src/datasets/load.py

+    module_name_for_dynamic_modules = os.path.basename(dynamic_modules_path)
+    datasets_modules_path = os.path.join(dynamic_modules_path, "datasets")
+    datasets_modules_name = module_name_for_dynamic_modules + ".datasets"
+    metrics_modules_path = os.path.join(dynamic_modules_path, "metric")


@lhoestq small typo here which breaks metrics loading, submitting a fix now

It was "metrics" with an 's' !! Good catch

lhoestq added 13 commits January 7, 2021 19:03

minor

c2a7cab

add prepare module test

93c6134

fix windows path scheme check

6946abd

cached_path raises requests error if no internet

bbc1132

look for cached modules if there's no internet

23f766b

wip tests

afdecdd

add warning message

5ab0108

update tests

af473d8

style

fc85400

remove test modules if already exist

e394196

style

100cca4

add init_dynamic_modules function for testing purposes

1a7425e

fix importlib cache

2e3efee

lhoestq requested a review from thomwolf January 12, 2021 15:21

Merge branch 'master' into offline-loading

247ea0f

lhoestq mentioned this pull request Jan 14, 2021

Use passed --cache_dir for modules cache #1369

Open

lhoestq added 5 commits January 14, 2021 17:05

move csv, json, text and pandas to inside the package

aebb4d3

add packaged datasets handling in prepare_module

c56a765

update tests

76238f6

minor fix

7e69c14

add missing __init__.py

a567e8f

lhoestq added 3 commits January 14, 2021 17:33

fix test

78d2607

style

235380c

fix test

92fce40

scissors881 approved these changes Jan 14, 2021

View reviewed changes

fix tests

75215c6

lhoestq requested a review from yjernite January 18, 2021 14:24

yjernite approved these changes Jan 18, 2021

View reviewed changes

thomwolf approved these changes Jan 18, 2021

View reviewed changes

tests/test_load.py

from .utils import offline

class LoadTest(TestCase):

Copy link

Member

thomwolf Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

show last modification date in the warning

7080102

lhoestq merged commit 60fa3a1 into master Jan 19, 2021

lhoestq deleted the offline-loading branch January 19, 2021 16:42

This was referenced Jan 20, 2021

could not run models on a offline server successfully #1724

Closed

Discussion using datasets in offline mode #824

Closed

mozharovsky mentioned this pull request Jan 21, 2021

Support offline datasets formermagic/formerbox#27

Merged

lhoestq mentioned this pull request Jan 24, 2021

Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.2.1/datasets/csv/csv.py #1771

Closed

yjernite reviewed Jan 28, 2021

View reviewed changes

yjernite mentioned this pull request Jan 28, 2021

[BUG FIX] typo in the import path for metrics #1789

Merged

albertvillanova mentioned this pull request Feb 15, 2022

Downloaded datasets are not usable offline #761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline loading #1726

Offline loading #1726

lhoestq commented Jan 12, 2021 •

edited by albertvillanova

Loading

thomwolf commented Jan 12, 2021

lhoestq commented Jan 12, 2021

lhoestq commented Jan 14, 2021

lhoestq commented Jan 15, 2021

yjernite commented Jan 18, 2021

thomwolf left a comment

thomwolf Jan 18, 2021

lhoestq commented Jan 19, 2021

yjernite Jan 28, 2021

lhoestq Jan 28, 2021

Offline loading #1726

Offline loading #1726

Conversation

lhoestq commented Jan 12, 2021 • edited by albertvillanova Loading

thomwolf commented Jan 12, 2021

lhoestq commented Jan 12, 2021

lhoestq commented Jan 14, 2021

lhoestq commented Jan 15, 2021

yjernite commented Jan 18, 2021

thomwolf left a comment

Choose a reason for hiding this comment

thomwolf Jan 18, 2021

Choose a reason for hiding this comment

lhoestq commented Jan 19, 2021

yjernite Jan 28, 2021

Choose a reason for hiding this comment

lhoestq Jan 28, 2021

Choose a reason for hiding this comment

lhoestq commented Jan 12, 2021 •

edited by albertvillanova

Loading