Fix load_dataset for data_files with protocols other than HF #6862

matstrand · 2024-05-03T01:43:47Z

I've added a new test case and a solution. Before applying the solution the test case was failing with the same error described in the linked issue.

MRE:

pip install "datasets[s3]"
python -c "from datasets import load_dataset; load_dataset('csv', data_files={'train': 's3://noaa-gsod-pds/2024/A5125600451.csv'})"

lhoestq · 2024-06-04T12:52:26Z

src/datasets/utils/file_utils.py

+            if scheme in ("s3", "s3a") and storage_options is not None and "hf" in storage_options:
+                # Issue 6598: **storage_options is passed to botocore.session.Session() 
+                # and must not contain keys that become invalid kwargs.
+                del storage_options["hf"]


it would be great to have a more general solution for any filesystem than hardcoding something for S3

maybe in _prepare_single_hop_path_and_storage_options() in file_utils.py ?

Thanks ! Can you please explain a bit more what a better general solution could be? Is this an option ? Move this logic to set the hf token to this protocol check Link

Thanks both, seems the root issue is settingDownloadConfig.storage_options['hf'] in all situations. We may be able to remove that as test_load.py passes without it. It was added at #6028 but seems possibly obsoleted by #6784.

I've pushed a change. @lhoestq can you please trigger the CI so we can see if this introduces any regressions?

sure, I triggered the ci

…tocols

HuggingFaceDocBuilderDev · 2024-06-21T16:22:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stoicio

Looks good to me, Thanks for fixing !

tests/test_load.py

…ec'd

matstrand · 2024-06-26T20:50:23Z

src/datasets/download/download_config.py

@@ -93,7 +93,7 @@ def __post_init__(self, use_auth_token):
                FutureWarning,
            )
            self.token = use_auth_token
-        if "hf" not in self.storage_options:
+        if self.token is not None and "hf" not in self.storage_options:


I was worried that removing this altogether might break functionality for someone. This still might break CI but let's see before spending more time on it.

you might have to revert this, since we need the endpoint to be in the storage options even if token is None

I'm trying to account for that here. This way _prepare_path_and_storage_options ensures that the endpoint is in the storage options without needing to populate it for all protocols in __post_init__.

I think we can completely remove the default "hf" storage option from DownloadConfig, now that it is properly handled by the call to _prepare_path_and_storage_options made in cached_path.

Do you agree, @lhoestq?

yes sounds good :)

…a cached_path

…aset-broken-s3

albertvillanova

Thank you, @matstrand!! 🤗

I have merged the main branch to fix the CI, and added a new test for cached_path with different protocols, so we are sure that the passed storage_options (to fsspec_head and fsspec_get) is only the one that matches the protocol in the URL.

I would need this PR merged for a subsequent PR I am working on.

github-actions · 2024-07-23T14:37:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005615 / 0.011353 (-0.005738)	0.004015 / 0.011008 (-0.006994)	0.066769 / 0.038508 (0.028261)	0.032983 / 0.023109 (0.009874)	0.246301 / 0.275898 (-0.029597)	0.266463 / 0.323480 (-0.057017)	0.003291 / 0.007986 (-0.004695)	0.002905 / 0.004328 (-0.001424)	0.049913 / 0.004250 (0.045663)	0.046186 / 0.037052 (0.009134)	0.248971 / 0.258489 (-0.009518)	0.288066 / 0.293841 (-0.005775)	0.029638 / 0.128546 (-0.098908)	0.012454 / 0.075646 (-0.063192)	0.225397 / 0.419271 (-0.193875)	0.036075 / 0.043533 (-0.007458)	0.250110 / 0.255139 (-0.005029)	0.267968 / 0.283200 (-0.015232)	0.020943 / 0.141683 (-0.120740)	1.116938 / 1.452155 (-0.335216)	1.159617 / 1.492716 (-0.333099)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.099813 / 0.018006 (0.081807)	0.310770 / 0.000490 (0.310280)	0.000223 / 0.000200 (0.000023)	0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018909 / 0.037411 (-0.018503)	0.062833 / 0.014526 (0.048307)	0.074895 / 0.176557 (-0.101662)	0.121213 / 0.737135 (-0.615922)	0.076984 / 0.296338 (-0.219355)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.282026 / 0.215209 (0.066817)	2.775044 / 2.077655 (0.697390)	1.485574 / 1.504120 (-0.018546)	1.356639 / 1.541195 (-0.184556)	1.378677 / 1.468490 (-0.089813)	0.724739 / 4.584777 (-3.860038)	2.379279 / 3.745712 (-1.366433)	3.030104 / 5.269862 (-2.239758)	1.981636 / 4.565676 (-2.584041)	0.078758 / 0.424275 (-0.345517)	0.005188 / 0.007607 (-0.002419)	0.336284 / 0.226044 (0.110240)	3.261649 / 2.268929 (0.992720)	1.849333 / 55.444624 (-53.595292)	1.564988 / 6.876477 (-5.311489)	1.598720 / 2.142072 (-0.543353)	0.793190 / 4.805227 (-4.012038)	0.135384 / 6.500664 (-6.365280)	0.043597 / 0.075469 (-0.031872)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.976428 / 1.841788 (-0.865359)	12.087446 / 8.074308 (4.013138)	9.756592 / 10.191392 (-0.434800)	0.140836 / 0.680424 (-0.539588)	0.015193 / 0.534201 (-0.519008)	0.327789 / 0.579283 (-0.251494)	0.265418 / 0.434364 (-0.168945)	0.356548 / 0.540337 (-0.183790)	0.451014 / 1.386936 (-0.935922)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005879 / 0.011353 (-0.005474)	0.004001 / 0.011008 (-0.007008)	0.051066 / 0.038508 (0.012558)	0.033824 / 0.023109 (0.010714)	0.275303 / 0.275898 (-0.000595)	0.301223 / 0.323480 (-0.022257)	0.004456 / 0.007986 (-0.003530)	0.002930 / 0.004328 (-0.001399)	0.050674 / 0.004250 (0.046423)	0.040798 / 0.037052 (0.003746)	0.288702 / 0.258489 (0.030213)	0.324865 / 0.293841 (0.031024)	0.032935 / 0.128546 (-0.095611)	0.012372 / 0.075646 (-0.063274)	0.060778 / 0.419271 (-0.358493)	0.034369 / 0.043533 (-0.009164)	0.277240 / 0.255139 (0.022101)	0.300027 / 0.283200 (0.016828)	0.018586 / 0.141683 (-0.123097)	1.148498 / 1.452155 (-0.303657)	1.256665 / 1.492716 (-0.236052)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.105616 / 0.018006 (0.087610)	0.328206 / 0.000490 (0.327716)	0.000229 / 0.000200 (0.000029)	0.000052 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023759 / 0.037411 (-0.013652)	0.077709 / 0.014526 (0.063183)	0.089840 / 0.176557 (-0.086717)	0.129891 / 0.737135 (-0.607244)	0.091533 / 0.296338 (-0.204805)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.308228 / 0.215209 (0.093019)	2.966868 / 2.077655 (0.889213)	1.589914 / 1.504120 (0.085794)	1.463263 / 1.541195 (-0.077932)	1.508233 / 1.468490 (0.039743)	0.722289 / 4.584777 (-3.862488)	0.961580 / 3.745712 (-2.784132)	2.897209 / 5.269862 (-2.372653)	1.969601 / 4.565676 (-2.596076)	0.079850 / 0.424275 (-0.344425)	0.005394 / 0.007607 (-0.002213)	0.355451 / 0.226044 (0.129406)	3.486822 / 2.268929 (1.217893)	1.987236 / 55.444624 (-53.457388)	1.701017 / 6.876477 (-5.175460)	1.849909 / 2.142072 (-0.292163)	0.785358 / 4.805227 (-4.019870)	0.135085 / 6.500664 (-6.365579)	0.042056 / 0.075469 (-0.033413)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.055287 / 1.841788 (-0.786501)	13.696916 / 8.074308 (5.622608)	10.801396 / 10.191392 (0.610004)	0.134642 / 0.680424 (-0.545782)	0.016007 / 0.534201 (-0.518194)	0.304163 / 0.579283 (-0.275120)	0.124530 / 0.434364 (-0.309834)	0.344002 / 0.540337 (-0.196335)	0.445138 / 1.386936 (-0.941798)

* Issue 6598: load_dataset broken for data_files on s3 * Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols * Issue 6598: ruff format * Issue 6598: mock creds for new test, keep __post_init__ when token spec'd * Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path * Do not set default 'hf' in DownloadConfig.storage_options * Test cached_path for different protocols * Mark test with require_moto * Refactor test using fixture * Make import local --------- Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

matstrand force-pushed the issue-6598-load-dataset-broken-s3 branch 4 times, most recently from bf3c8c2 to 88cc5a3 Compare May 3, 2024 02:06

Issue 6598: load_dataset broken for data_files on s3

ded2cac

matstrand force-pushed the issue-6598-load-dataset-broken-s3 branch from c3c0c06 to ded2cac Compare May 3, 2024 02:54

lhoestq reviewed Jun 4, 2024

View reviewed changes

matstrand added 2 commits June 17, 2024 15:28

Issue 6598: do not set DownloadConfig.storage_options[hf] for all pro…

22016eb

…tocols

Issue 6598: ruff format

6a9d771

matstrand requested review from stoicio and lhoestq June 18, 2024 01:44

stoicio reviewed Jun 23, 2024

View reviewed changes

tests/test_load.py Outdated Show resolved Hide resolved

Issue 6598: mock creds for new test, keep __post_init__ when token sp…

b245d00

…ec'd

matstrand commented Jun 26, 2024

View reviewed changes

matstrand and others added 3 commits June 26, 2024 17:51

Issue 6598: call _prepare_path_and_storage_options for remote URLs vi…

ab7e66e

…a cached_path

Merge remote-tracking branch 'upstream/main' into issue-6598-load-dat…

3c4ec5f

…aset-broken-s3

Do not set default 'hf' in DownloadConfig.storage_options

696688c

albertvillanova changed the title ~~Issue 6598: load_dataset broken for data_files on s3~~ Fix load_dataset for data_files on S3 Jul 23, 2024

Test cached_path for different protocols

1608b54

albertvillanova approved these changes Jul 23, 2024

View reviewed changes

albertvillanova changed the title ~~Fix load_dataset for data_files on S3~~ Fix load_dataset for data_files with protocols other than HF Jul 23, 2024

albertvillanova added 3 commits July 23, 2024 15:38

Mark test with require_moto

ddd2e74

Refactor test using fixture

49ce9ff

Make import local

6541b65

albertvillanova merged commit 5142a8c into huggingface:main Jul 23, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix load_dataset for data_files with protocols other than HF #6862

Fix load_dataset for data_files with protocols other than HF #6862

matstrand commented May 3, 2024 •

edited

Loading

lhoestq Jun 4, 2024

stoicio Jun 16, 2024

matstrand Jun 18, 2024

lhoestq Jun 21, 2024

HuggingFaceDocBuilderDev commented Jun 21, 2024

stoicio left a comment

matstrand Jun 26, 2024

lhoestq Jun 27, 2024

matstrand Jun 27, 2024

albertvillanova Jul 23, 2024

albertvillanova Jul 23, 2024

lhoestq Jul 23, 2024

albertvillanova left a comment

github-actions bot commented Jul 23, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix load_dataset for data_files with protocols other than HF #6862

Fix load_dataset for data_files with protocols other than HF #6862

Conversation

matstrand commented May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 21, 2024

stoicio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 23, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

matstrand commented May 3, 2024 •

edited

Loading