Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix load_dataset for data_files with protocols other than HF #6862

Merged

Conversation

matstrand
Copy link
Contributor

@matstrand matstrand commented May 3, 2024

Fixes /issues/6598

I've added a new test case and a solution. Before applying the solution the test case was failing with the same error described in the linked issue.

MRE:

pip install "datasets[s3]"
python -c "from datasets import load_dataset; load_dataset('csv', data_files={'train': 's3://noaa-gsod-pds/2024/A5125600451.csv'})"

@matstrand matstrand force-pushed the issue-6598-load-dataset-broken-s3 branch 4 times, most recently from bf3c8c2 to 88cc5a3 Compare May 3, 2024 02:06
@matstrand matstrand force-pushed the issue-6598-load-dataset-broken-s3 branch from c3c0c06 to ded2cac Compare May 3, 2024 02:54
Comment on lines 571 to 574
if scheme in ("s3", "s3a") and storage_options is not None and "hf" in storage_options:
# Issue 6598: **storage_options is passed to botocore.session.Session()
# and must not contain keys that become invalid kwargs.
del storage_options["hf"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to have a more general solution for any filesystem than hardcoding something for S3

maybe in _prepare_single_hop_path_and_storage_options() in file_utils.py ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Can you please explain a bit more what a better general solution could be? Is this an option ? Move this logic to set the hf token to this protocol check Link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, seems the root issue is settingDownloadConfig.storage_options['hf'] in all situations. We may be able to remove that as test_load.py passes without it. It was added at #6028 but seems possibly obsoleted by #6784.

I've pushed a change. @lhoestq can you please trigger the CI so we can see if this introduces any regressions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I triggered the ci

@matstrand matstrand requested review from stoicio and lhoestq June 18, 2024 01:44
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

@stoicio stoicio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, Thanks for fixing !

tests/test_load.py Outdated Show resolved Hide resolved
@@ -93,7 +93,7 @@ def __post_init__(self, use_auth_token):
FutureWarning,
)
self.token = use_auth_token
if "hf" not in self.storage_options:
if self.token is not None and "hf" not in self.storage_options:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried that removing this altogether might break functionality for someone. This still might break CI but let's see before spending more time on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might have to revert this, since we need the endpoint to be in the storage options even if token is None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to account for that here. This way _prepare_path_and_storage_options ensures that the endpoint is in the storage options without needing to populate it for all protocols in __post_init__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can completely remove the default "hf" storage option from DownloadConfig, now that it is properly handled by the call to _prepare_path_and_storage_options made in cached_path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you agree, @lhoestq?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sounds good :)

@albertvillanova albertvillanova changed the title Issue 6598: load_dataset broken for data_files on s3 Fix load_dataset for data_files on S3 Jul 23, 2024
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @matstrand!! 🤗

I have merged the main branch to fix the CI, and added a new test for cached_path with different protocols, so we are sure that the passed storage_options (to fsspec_head and fsspec_get) is only the one that matches the protocol in the URL.

I would need this PR merged for a subsequent PR I am working on.

@albertvillanova albertvillanova changed the title Fix load_dataset for data_files on S3 Fix load_dataset for data_files with protocols other than HF Jul 23, 2024
@albertvillanova albertvillanova merged commit 5142a8c into huggingface:main Jul 23, 2024
14 checks passed
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005615 / 0.011353 (-0.005738) 0.004015 / 0.011008 (-0.006994) 0.066769 / 0.038508 (0.028261) 0.032983 / 0.023109 (0.009874) 0.246301 / 0.275898 (-0.029597) 0.266463 / 0.323480 (-0.057017) 0.003291 / 0.007986 (-0.004695) 0.002905 / 0.004328 (-0.001424) 0.049913 / 0.004250 (0.045663) 0.046186 / 0.037052 (0.009134) 0.248971 / 0.258489 (-0.009518) 0.288066 / 0.293841 (-0.005775) 0.029638 / 0.128546 (-0.098908) 0.012454 / 0.075646 (-0.063192) 0.225397 / 0.419271 (-0.193875) 0.036075 / 0.043533 (-0.007458) 0.250110 / 0.255139 (-0.005029) 0.267968 / 0.283200 (-0.015232) 0.020943 / 0.141683 (-0.120740) 1.116938 / 1.452155 (-0.335216) 1.159617 / 1.492716 (-0.333099)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.099813 / 0.018006 (0.081807) 0.310770 / 0.000490 (0.310280) 0.000223 / 0.000200 (0.000023) 0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.018909 / 0.037411 (-0.018503) 0.062833 / 0.014526 (0.048307) 0.074895 / 0.176557 (-0.101662) 0.121213 / 0.737135 (-0.615922) 0.076984 / 0.296338 (-0.219355)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.282026 / 0.215209 (0.066817) 2.775044 / 2.077655 (0.697390) 1.485574 / 1.504120 (-0.018546) 1.356639 / 1.541195 (-0.184556) 1.378677 / 1.468490 (-0.089813) 0.724739 / 4.584777 (-3.860038) 2.379279 / 3.745712 (-1.366433) 3.030104 / 5.269862 (-2.239758) 1.981636 / 4.565676 (-2.584041) 0.078758 / 0.424275 (-0.345517) 0.005188 / 0.007607 (-0.002419) 0.336284 / 0.226044 (0.110240) 3.261649 / 2.268929 (0.992720) 1.849333 / 55.444624 (-53.595292) 1.564988 / 6.876477 (-5.311489) 1.598720 / 2.142072 (-0.543353) 0.793190 / 4.805227 (-4.012038) 0.135384 / 6.500664 (-6.365280) 0.043597 / 0.075469 (-0.031872)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.976428 / 1.841788 (-0.865359) 12.087446 / 8.074308 (4.013138) 9.756592 / 10.191392 (-0.434800) 0.140836 / 0.680424 (-0.539588) 0.015193 / 0.534201 (-0.519008) 0.327789 / 0.579283 (-0.251494) 0.265418 / 0.434364 (-0.168945) 0.356548 / 0.540337 (-0.183790) 0.451014 / 1.386936 (-0.935922)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005879 / 0.011353 (-0.005474) 0.004001 / 0.011008 (-0.007008) 0.051066 / 0.038508 (0.012558) 0.033824 / 0.023109 (0.010714) 0.275303 / 0.275898 (-0.000595) 0.301223 / 0.323480 (-0.022257) 0.004456 / 0.007986 (-0.003530) 0.002930 / 0.004328 (-0.001399) 0.050674 / 0.004250 (0.046423) 0.040798 / 0.037052 (0.003746) 0.288702 / 0.258489 (0.030213) 0.324865 / 0.293841 (0.031024) 0.032935 / 0.128546 (-0.095611) 0.012372 / 0.075646 (-0.063274) 0.060778 / 0.419271 (-0.358493) 0.034369 / 0.043533 (-0.009164) 0.277240 / 0.255139 (0.022101) 0.300027 / 0.283200 (0.016828) 0.018586 / 0.141683 (-0.123097) 1.148498 / 1.452155 (-0.303657) 1.256665 / 1.492716 (-0.236052)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.105616 / 0.018006 (0.087610) 0.328206 / 0.000490 (0.327716) 0.000229 / 0.000200 (0.000029) 0.000052 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023759 / 0.037411 (-0.013652) 0.077709 / 0.014526 (0.063183) 0.089840 / 0.176557 (-0.086717) 0.129891 / 0.737135 (-0.607244) 0.091533 / 0.296338 (-0.204805)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.308228 / 0.215209 (0.093019) 2.966868 / 2.077655 (0.889213) 1.589914 / 1.504120 (0.085794) 1.463263 / 1.541195 (-0.077932) 1.508233 / 1.468490 (0.039743) 0.722289 / 4.584777 (-3.862488) 0.961580 / 3.745712 (-2.784132) 2.897209 / 5.269862 (-2.372653) 1.969601 / 4.565676 (-2.596076) 0.079850 / 0.424275 (-0.344425) 0.005394 / 0.007607 (-0.002213) 0.355451 / 0.226044 (0.129406) 3.486822 / 2.268929 (1.217893) 1.987236 / 55.444624 (-53.457388) 1.701017 / 6.876477 (-5.175460) 1.849909 / 2.142072 (-0.292163) 0.785358 / 4.805227 (-4.019870) 0.135085 / 6.500664 (-6.365579) 0.042056 / 0.075469 (-0.033413)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.055287 / 1.841788 (-0.786501) 13.696916 / 8.074308 (5.622608) 10.801396 / 10.191392 (0.610004) 0.134642 / 0.680424 (-0.545782) 0.016007 / 0.534201 (-0.518194) 0.304163 / 0.579283 (-0.275120) 0.124530 / 0.434364 (-0.309834) 0.344002 / 0.540337 (-0.196335) 0.445138 / 1.386936 (-0.941798)

albertvillanova added a commit that referenced this pull request Aug 13, 2024
* Issue 6598: load_dataset broken for data_files on s3

* Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols

* Issue 6598: ruff format

* Issue 6598: mock creds for new test, keep __post_init__ when token spec'd

* Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path

* Do not set default 'hf' in DownloadConfig.storage_options

* Test cached_path for different protocols

* Mark test with require_moto

* Refactor test using fixture

* Make import local

---------

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
albertvillanova added a commit that referenced this pull request Aug 13, 2024
* Issue 6598: load_dataset broken for data_files on s3

* Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols

* Issue 6598: ruff format

* Issue 6598: mock creds for new test, keep __post_init__ when token spec'd

* Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path

* Do not set default 'hf' in DownloadConfig.storage_options

* Test cached_path for different protocols

* Mark test with require_moto

* Refactor test using fixture

* Make import local

---------

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
albertvillanova added a commit that referenced this pull request Aug 14, 2024
* Issue 6598: load_dataset broken for data_files on s3

* Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols

* Issue 6598: ruff format

* Issue 6598: mock creds for new test, keep __post_init__ when token spec'd

* Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path

* Do not set default 'hf' in DownloadConfig.storage_options

* Test cached_path for different protocols

* Mark test with require_moto

* Refactor test using fixture

* Make import local

---------

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected keyword argument 'hf' when downloading CSV dataset from S3
5 participants