-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix load_dataset for data_files with protocols other than HF #6862
Fix load_dataset for data_files with protocols other than HF #6862
Conversation
bf3c8c2
to
88cc5a3
Compare
c3c0c06
to
ded2cac
Compare
src/datasets/utils/file_utils.py
Outdated
if scheme in ("s3", "s3a") and storage_options is not None and "hf" in storage_options: | ||
# Issue 6598: **storage_options is passed to botocore.session.Session() | ||
# and must not contain keys that become invalid kwargs. | ||
del storage_options["hf"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great to have a more general solution for any filesystem than hardcoding something for S3
maybe in _prepare_single_hop_path_and_storage_options()
in file_utils.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Can you please explain a bit more what a better general solution could be? Is this an option ? Move this logic to set the hf token to this protocol check Link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both, seems the root issue is settingDownloadConfig.storage_options['hf']
in all situations. We may be able to remove that as test_load.py
passes without it. It was added at #6028 but seems possibly obsoleted by #6784.
I've pushed a change. @lhoestq can you please trigger the CI so we can see if this introduces any regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I triggered the ci
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, Thanks for fixing !
@@ -93,7 +93,7 @@ def __post_init__(self, use_auth_token): | |||
FutureWarning, | |||
) | |||
self.token = use_auth_token | |||
if "hf" not in self.storage_options: | |||
if self.token is not None and "hf" not in self.storage_options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried that removing this altogether might break functionality for someone. This still might break CI but let's see before spending more time on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might have to revert this, since we need the endpoint to be in the storage options even if token is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to account for that here. This way _prepare_path_and_storage_options
ensures that the endpoint is in the storage options without needing to populate it for all protocols in __post_init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can completely remove the default "hf" storage option from DownloadConfig
, now that it is properly handled by the call to _prepare_path_and_storage_options
made in cached_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you agree, @lhoestq?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sounds good :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @matstrand!! 🤗
I have merged the main branch to fix the CI, and added a new test for cached_path
with different protocols, so we are sure that the passed storage_options
(to fsspec_head
and fsspec_get
) is only the one that matches the protocol in the URL.
I would need this PR merged for a subsequent PR I am working on.
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* Issue 6598: load_dataset broken for data_files on s3 * Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols * Issue 6598: ruff format * Issue 6598: mock creds for new test, keep __post_init__ when token spec'd * Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path * Do not set default 'hf' in DownloadConfig.storage_options * Test cached_path for different protocols * Mark test with require_moto * Refactor test using fixture * Make import local --------- Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
* Issue 6598: load_dataset broken for data_files on s3 * Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols * Issue 6598: ruff format * Issue 6598: mock creds for new test, keep __post_init__ when token spec'd * Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path * Do not set default 'hf' in DownloadConfig.storage_options * Test cached_path for different protocols * Mark test with require_moto * Refactor test using fixture * Make import local --------- Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
* Issue 6598: load_dataset broken for data_files on s3 * Issue 6598: do not set DownloadConfig.storage_options[hf] for all protocols * Issue 6598: ruff format * Issue 6598: mock creds for new test, keep __post_init__ when token spec'd * Issue 6598: call _prepare_path_and_storage_options for remote URLs via cached_path * Do not set default 'hf' in DownloadConfig.storage_options * Test cached_path for different protocols * Mark test with require_moto * Refactor test using fixture * Make import local --------- Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Fixes /issues/6598
I've added a new test case and a solution. Before applying the solution the test case was failing with the same error described in the linked issue.
MRE: