Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add url prefix convention for many compression formats #2822

Merged
merged 10 commits into from
Aug 23, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 20, 2021

Intro

When doing dataset streaming, the uncompression of compressed files is done on the fly using fsspec.

In particular, the download manager method download_and_extract doesn't return a path to the local download and extracted file, but instead a chained URL so that the uncompression can be done when the file is opened. A few examples of chained URLS:

  • gz://file.txt::https://foo.bar/file.txt.gz
  • bz2://file.txt::https://foo.bar/file.txt.bz2
  • zip://::https://foo.bar/archive.zip
  • tar://::https://foo.bar/archive.tar.gz (the TAR uncompression includes gz, bz2 etc. uncompression in fsspec)

This syntax is highly inspired by the fsspec URL chaining syntax from https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining

This url prefixing allows open to know what kind of uncompression to do in a dataset script when doing

def _generate_examples(self, urlpath):
    with open(urlpath) as f:
        ....

What it changes

This changes the previous behavior from #2786 , in which open was trying to infer the compression automatically. Infering the compression made it impossible to know whether the user wanted open to return compressed data (as the default behavior of the buitin open), or the uncompressed data. By adding uncompression prefixes to the URL, open know directly if it has to uncompress or not, and also which protocol to use.

Additional notes

This PR should close #2813

It should also close this PR #2811 since the oscar dataset script won't try to uncompress twice anymore

Note that I had to temporarily remove the support for passing tar and zip files to data_files for streaming to make it work, since it makes it ambiguous whether a zip file passed as data_files should be uncompressed or not. IMO we can make it work again by changing the syntax to make the glob explicit:

load_dataset("json", data_files="zip://*.jsonl::https://foo.bar/archive.zip")

This is the exact same convention as fsspec and it removes all ambiguities

cc @albertvillanova @lewtun

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really elegant @lhoestq !

from the user perspective, do i need to know about the url prefix convention when using load_dataset? if yes, perhaps we can add some details in the documentation to explain how this works - what do you think?

extension: str = None # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz

def __init__(
self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to use a more descriptive name for fo? e.g. filepath since we explain in the docstring that we use ffspec.open?

Copy link
Member Author

@lhoestq lhoestq Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fo is the variable name in fsspec to wrap any file to open. I went with the same convention.
It is used when unchaining a chained URL. See the note at the end of https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining

For developers: this “chaining” methods works by formatting
the arguments passed to open_* into target_protocol (a simple
string) and target_options (a dict) and also optionally fo
(target path, if a specific file is required). In order for an
implementation to chain successfully like this, it must look
for exactly those named arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds good! since we're sticking close to fsspec let's keep is as fo

src/datasets/filesystems/compression.py Outdated Show resolved Hide resolved
src/datasets/filesystems/compression.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Aug 23, 2021

Thanks for the feedback :) I will also complete the documentation to explain this convention

@lhoestq
Copy link
Member Author

lhoestq commented Aug 23, 2021

I just added some documentation about how streaming works with chained URLs.

I will also add some docs about how to use chained URLs directly in load_dataset in #2662, since #2662 does change the documentation already and to avoid having to resolve conflicts.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new docs are great and very instructive - thanks for adding them! i left a few minor nits

docs/source/dataset_streaming.rst Outdated Show resolved Hide resolved
docs/source/dataset_streaming.rst Outdated Show resolved Hide resolved
docs/source/dataset_streaming.rst Outdated Show resolved Hide resolved
docs/source/dataset_streaming.rst Outdated Show resolved Hide resolved
docs/source/dataset_streaming.rst Outdated Show resolved Hide resolved
.. code-block::

>>> from datasets.utils.streaming_download_manager import xopen
>>> chained_url = "zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a separate pr / location in the docs, it would be great to see this syntax explained a little bit.

extension: str = None # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz

def __init__(
self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds good! since we're sticking close to fsspec let's keep is as fo

@lhoestq
Copy link
Member Author

lhoestq commented Aug 23, 2021

Merging this one now, next step is resolve the conflicts in #2662 and update the docs for URL chaining :)

There is also the glob feature of zip files that I need to add, to be able to do this for example:

load_dataset("json", data_files="zip://*::https://foo.bar/archive.zip")

@lhoestq lhoestq merged commit 9adc7db into master Aug 23, 2021
@lhoestq lhoestq deleted the add-url-prefix-convention-for-many-compression-formats branch August 23, 2021 15:59
@lhoestq lhoestq mentioned this pull request Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove compression from xopen
3 participants