Add url prefix convention for many compression formats #2822

lhoestq · 2021-08-20T16:11:23Z

Intro

When doing dataset streaming, the uncompression of compressed files is done on the fly using fsspec.

In particular, the download manager method download_and_extract doesn't return a path to the local download and extracted file, but instead a chained URL so that the uncompression can be done when the file is opened. A few examples of chained URLS:

gz://file.txt::https://foo.bar/file.txt.gz
bz2://file.txt::https://foo.bar/file.txt.bz2
zip://::https://foo.bar/archive.zip
tar://::https://foo.bar/archive.tar.gz (the TAR uncompression includes gz, bz2 etc. uncompression in fsspec)

This syntax is highly inspired by the fsspec URL chaining syntax from https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining

This url prefixing allows open to know what kind of uncompression to do in a dataset script when doing

def _generate_examples(self, urlpath):
    with open(urlpath) as f:
        ....

What it changes

This changes the previous behavior from #2786 , in which open was trying to infer the compression automatically. Infering the compression made it impossible to know whether the user wanted open to return compressed data (as the default behavior of the buitin open), or the uncompressed data. By adding uncompression prefixes to the URL, open know directly if it has to uncompress or not, and also which protocol to use.

Additional notes

This PR should close #2813

It should also close this PR #2811 since the oscar dataset script won't try to uncompress twice anymore

Note that I had to temporarily remove the support for passing tar and zip files to data_files for streaming to make it work, since it makes it ambiguous whether a zip file passed as data_files should be uncompressed or not. IMO we can make it work again by changing the syntax to make the glob explicit:

load_dataset("json", data_files="zip://*.jsonl::https://foo.bar/archive.zip")

This is the exact same convention as fsspec and it removes all ambiguities

cc @albertvillanova @lewtun

lewtun

this is really elegant @lhoestq !

from the user perspective, do i need to know about the url prefix convention when using load_dataset? if yes, perhaps we can add some details in the documentation to explain how this works - what do you think?

lewtun · 2021-08-23T07:47:51Z

src/datasets/filesystems/compression.py

+    extension: str = None  # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz
+
+    def __init__(
+        self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs


would it make sense to use a more descriptive name for fo? e.g. filepath since we explain in the docstring that we use ffspec.open?

fo is the variable name in fsspec to wrap any file to open. I went with the same convention.
It is used when unchaining a chained URL. See the note at the end of https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining

For developers: this “chaining” methods works by formatting the arguments passed to open_* into target_protocol (a simple string) and target_options (a dict) and also optionally fo (target path, if a specific file is required). In order for an implementation to chain successfully like this, it must look for exactly those named arguments.

ok sounds good! since we're sticking close to fsspec let's keep is as fo

src/datasets/filesystems/compression.py

lhoestq · 2021-08-23T08:57:33Z

Thanks for the feedback :) I will also complete the documentation to explain this convention

lhoestq · 2021-08-23T14:08:53Z

I just added some documentation about how streaming works with chained URLs.

I will also add some docs about how to use chained URLs directly in load_dataset in #2662, since #2662 does change the documentation already and to avoid having to resolve conflicts.

docs/source/dataset_streaming.rst

lewtun

the new docs are great and very instructive - thanks for adding them! i left a few minor nits

docs/source/dataset_streaming.rst

lewtun · 2021-08-23T14:29:40Z

docs/source/dataset_streaming.rst

+.. code-block::
+
+    >>> from datasets.utils.streaming_download_manager import xopen
+    >>> chained_url = "zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip"


in a separate pr / location in the docs, it would be great to see this syntax explained a little bit.

lewtun · 2021-08-23T14:31:23Z

src/datasets/filesystems/compression.py

+    extension: str = None  # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz
+
+    def __init__(
+        self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs


ok sounds good! since we're sticking close to fsspec let's keep is as fo

lhoestq · 2021-08-23T15:59:01Z

Merging this one now, next step is resolve the conflicts in #2662 and update the docs for URL chaining :)

There is also the glob feature of zip files that I need to add, to be able to do this for example:

load_dataset("json", data_files="zip://*::https://foo.bar/archive.zip")

lhoestq added 6 commits August 19, 2021 15:03

remove compression="infer" in xopen

96f8c83

add fs protocols for bz2, lz4, xz and zstd

8c866ed

test streaming gz, lz4, bz2, xz and zst

b891219

fix test

b145590

fix tar streaming

200d889

temporarily remove zip and tar data_files streaming

12e5947

lewtun reviewed Aug 23, 2021

View reviewed changes

lhoestq added 2 commits August 23, 2021 14:44

lewis' comments

8ce501e

docs on how streaming works with chained URLs

c64fc0d

severo reviewed Aug 23, 2021

View reviewed changes

docs/source/dataset_streaming.rst Outdated Show resolved Hide resolved

lewtun reviewed Aug 23, 2021

View reviewed changes

lhoestq added 2 commits August 23, 2021 17:35

severo's comment

4a3170c

lewis' comments

4bcf10f

lhoestq merged commit 9adc7db into master Aug 23, 2021

lhoestq deleted the add-url-prefix-convention-for-many-compression-formats branch August 23, 2021 15:59

lhoestq mentioned this pull request Aug 26, 2021

Fix stream oscar #2811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add url prefix convention for many compression formats #2822

Add url prefix convention for many compression formats #2822

lhoestq commented Aug 20, 2021 •

edited

Loading

lewtun left a comment

lewtun Aug 23, 2021

lhoestq Aug 23, 2021 •

edited

Loading

lewtun Aug 23, 2021

lhoestq commented Aug 23, 2021

lhoestq commented Aug 23, 2021 •

edited

Loading

lewtun left a comment

lewtun Aug 23, 2021

lewtun Aug 23, 2021

lhoestq commented Aug 23, 2021

Add url prefix convention for many compression formats #2822

Add url prefix convention for many compression formats #2822

Conversation

lhoestq commented Aug 20, 2021 • edited Loading

Intro

What it changes

Additional notes

lewtun left a comment

Choose a reason for hiding this comment

lewtun Aug 23, 2021

Choose a reason for hiding this comment

lhoestq Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

lewtun Aug 23, 2021

Choose a reason for hiding this comment

lhoestq commented Aug 23, 2021

lhoestq commented Aug 23, 2021 • edited Loading

lewtun left a comment

Choose a reason for hiding this comment

lewtun Aug 23, 2021

Choose a reason for hiding this comment

lewtun Aug 23, 2021

Choose a reason for hiding this comment

lhoestq commented Aug 23, 2021

lhoestq commented Aug 20, 2021 •

edited

Loading

lhoestq Aug 23, 2021 •

edited

Loading

lhoestq commented Aug 23, 2021 •

edited

Loading