-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add url prefix convention for many compression formats #2822
Add url prefix convention for many compression formats #2822
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really elegant @lhoestq !
from the user perspective, do i need to know about the url prefix convention when using load_dataset
? if yes, perhaps we can add some details in the documentation to explain how this works - what do you think?
extension: str = None # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz | ||
|
||
def __init__( | ||
self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to use a more descriptive name for fo
? e.g. filepath
since we explain in the docstring that we use ffspec.open
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fo
is the variable name in fsspec
to wrap any file to open. I went with the same convention.
It is used when unchaining a chained URL. See the note at the end of https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining
For developers: this “chaining” methods works by formatting
the arguments passed to open_* into target_protocol (a simple
string) and target_options (a dict) and also optionally fo
(target path, if a specific file is required). In order for an
implementation to chain successfully like this, it must look
for exactly those named arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sounds good! since we're sticking close to fsspec
let's keep is as fo
Thanks for the feedback :) I will also complete the documentation to explain this convention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new docs are great and very instructive - thanks for adding them! i left a few minor nits
.. code-block:: | ||
|
||
>>> from datasets.utils.streaming_download_manager import xopen | ||
>>> chained_url = "zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a separate pr / location in the docs, it would be great to see this syntax explained a little bit.
extension: str = None # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz | ||
|
||
def __init__( | ||
self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sounds good! since we're sticking close to fsspec
let's keep is as fo
Merging this one now, next step is resolve the conflicts in #2662 and update the docs for URL chaining :) There is also the glob feature of zip files that I need to add, to be able to do this for example: load_dataset("json", data_files="zip://*::https://foo.bar/archive.zip") |
Intro
When doing dataset streaming, the uncompression of compressed files is done on the fly using
fsspec
.In particular, the download manager method
download_and_extract
doesn't return a path to the local download and extracted file, but instead a chained URL so that the uncompression can be done when the file is opened. A few examples of chained URLS:gz://file.txt::https://foo.bar/file.txt.gz
bz2://file.txt::https://foo.bar/file.txt.bz2
zip://::https://foo.bar/archive.zip
tar://::https://foo.bar/archive.tar.gz
(the TAR uncompression includes gz, bz2 etc. uncompression infsspec
)This syntax is highly inspired by the
fsspec
URL chaining syntax from https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chainingThis url prefixing allows
open
to know what kind of uncompression to do in a dataset script when doingWhat it changes
This changes the previous behavior from #2786 , in which
open
was trying to infer the compression automatically. Infering the compression made it impossible to know whether the user wantedopen
to return compressed data (as the default behavior of the buitin open), or the uncompressed data. By adding uncompression prefixes to the URL,open
know directly if it has to uncompress or not, and also which protocol to use.Additional notes
This PR should close #2813
It should also close this PR #2811 since the oscar dataset script won't try to uncompress twice anymore
Note that I had to temporarily remove the support for passing tar and zip files to
data_files
for streaming to make it work, since it makes it ambiguous whether a zip file passed asdata_files
should be uncompressed or not. IMO we can make it work again by changing the syntax to make the glob explicit:This is the exact same convention as fsspec and it removes all ambiguities
cc @albertvillanova @lewtun