Add url prefix convention for many compression formats (#2822)

* remove compression="infer" in xopen * add fs protocols for bz2, lz4, xz and zstd * test streaming gz, lz4, bz2, xz and zst * fix test * fix tar streaming * temporarily remove zip and tar data_files streaming * lewis' comments * docs on how streaming works with chained URLs * severo's comment * lewis' comments
huggingface · Aug 23, 2021 · 9adc7db · 9adc7db · github-actions · Aug 23, 2021
1 parent 72ba8c3
commit 9adc7db
Show file tree

Hide file tree

Showing 14 changed files with 457 additions and 167 deletions.
diff --git a/docs/source/dataset_streaming.rst b/docs/source/dataset_streaming.rst
@@ -164,3 +164,108 @@ It is possible to get a ``torch.utils.data.IterableDataset`` from a :class:`data
     {'input_ids': tensor([[101, 11047, 10497, 7869, 2352...]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0...]]), 'attention_mask': tensor([[1, 1, 1, 1, 1...]])}
 
 For now, only the PyTorch format is supported but support for TensorFlow and others will be added soon.
+
+
+How does dataset streaming work ?
+--------------------------------------------------
+
+The StreamingDownloadManager
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The standard (i.e. non-streaming) way of loading a dataset has two steps:
+
+1. download and extract the raw data files of the dataset by using the :class:`datasets.DownloadManager`
+2. process the data files to generate the Arrow file used to load the :class:`datasets.Dataset` object.
+
+For example, in non-streaming mode a file is simply downloaded like this:
+
+.. code-block::
+
+    >>> from datasets import DownloadManager
+    >>> url = "https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt"
+    >>> filepath = DownloadManager().download(url)  # the file is downloaded here
+    >>> print(filepath)
+    '/Users/user/.cache/huggingface/datasets/downloads/16b702620cad8d485bafea59b1d2ed69e796196e6f2c73f005dee935a413aa19.ab631f60c6cb31a079ecf1ad910005a7c009ef0f1e4905b69d489fb2bd162683'
+    >>> with open(filepath) as f:
+    ...     print(f.read())
+
+When you load a dataset in streaming mode, the download manager that is used instead is the :class:`datasets.StreamingDownloadManager`.
+Instead of actually downloading and extracting all the data when you load the dataset, it is done lazily.
+The file starts to be downloaded and extracted only when ``open`` is called.
+This is made possible by extending ``open`` to support opening remote files via HTTP.
+In each dataset script, ``open`` is replaced by our function ``xopen`` that extends ``open`` to be able to stream data from remote files.
+
+Here is a sample code that shows what is done under the hood:
+
+.. code-block::
+
+    >>> from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen
+    >>> url = "https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt"
+    >>> urlpath = StreamingDownloadManager().download(url)
+    >>> print(urlpath)
+    'https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt'
+    >>> with xopen(urlpath) as f:
+    ...     print(f.read())  # the file is actually downloaded here
+
+As you can see, since it's possible to open remote files via an URL, the streaming download manager just returns the URL instead of the path to the local downloaded file.
+
+Then the file is downloaded in a streaming fashion: it is downloaded progessively as you iterate over the data file.
+This is made possible because it is based on ``fsspec``, a library that allows to open and iterate on remote files.
+You can find more information about ``fsspec`` in `its documentation <https://filesystem-spec.readthedocs.io/>`_
+
+Compressed files and archives
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You may have noticed that the streaming download manager returns the exact same URL that was given as input for a text file.
+However if you use ``download_and_extract`` on a compressed file instead, then the output url will be a chained URL.
+Chained URLs are used by ``fsspec`` to navigate in remote compressed archives.
+
+Some examples of chained URL are:
+
+.. code-block::
+
+    >>> from datasets.utils.streaming_download_manager import xopen
+    >>> chained_url = "zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip"
+    >>> with xopen(chained_url) as f:
+    ...     print(f.read()[:100])
+    '{"data": [{"title": "Brain", "paragraphs": [{"context": "Another approach to brain function is to ex'
+    >>> chained_url2 = "gzip://mkqa.jsonl::/~https://github.com/apple/ml-mkqa/raw/master/dataset/mkqa.jsonl.gz"
+    >>> with xopen(chained_url2) as f:
+    ...     print(f.readline()[:100])
+    '{"query": "how long did it take the twin towers to be built", "answers": {"en": [{"type": "number_wi'
+
+We also extended some functions from ``os.path`` to work with chained URLs.
+For example ``os.path.join`` is replaced by our function ``xjoin`` that extends ``os.path.join`` to work with chained URLs:
+
+.. code-block::
+
+    >>> from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen, xjoin
+    >>> url = "https://adversarialqa.github.io/data/aqa_v1.0.zip"
+    >>> archive_path = StreamingDownloadManager().download_and_extract(url)
+    >>> print(archive_path)
+    'zip://::https://adversarialqa.github.io/data/aqa_v1.0.zip'
+    >>> filepath = xjoin(archive_path, "combined", "train.json")
+    >>> print(filepath)
+    'zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip'
+    >>> with xopen(filepath) as f:
+    ...     print(f.read()[:100])
+    '{"data": [{"title": "Brain", "paragraphs": [{"context": "Another approach to brain function is to ex'
+
+You can also take a look at the ``fsspec`` documentation about URL chaining `here <https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining>`_
+
+.. note::
+
+    Streaming data from TAR archives is currently highly inefficient and requires a lot of bandwidth. We are working on optimizing this to offer you the best performance, stay tuned !
+
+Dataset script compatibility
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that you are aware of how dataset streaming works, you can make sure your dataset script work in streaming mode:
+
+1. make sure you use ``open`` to open the data files: it is extended to work with remote files
+2. if you have to deal with archives like ZIP files, make sure you use ``os.path.join`` to navigate in the archive
+
+Currently a few python functions or classes are not supported for dataset streaming:
+
+- ``pathlib.Path`` and all its methods are not supported, please use ``os.path.join`` and string objects
+- ``os.walk``, ``os.listdir``, ``glob.glob`` are not supported yet
diff --git a/src/datasets/config.py b/src/datasets/config.py
@@ -120,21 +120,10 @@
     logger.info("Disabling Apache Beam because USE_BEAM is set to False")
 
 
-USE_RAR = os.environ.get("USE_RAR", "AUTO").upper()
-RARFILE_VERSION = "N/A"
-RARFILE_AVAILABLE = False
-if USE_RAR in ("1", "ON", "YES", "AUTO"):
-    try:
-        RARFILE_VERSION = version.parse(importlib_metadata.version("rarfile"))
-        RARFILE_AVAILABLE = True
-        logger.info("rarfile available.")
-    except importlib_metadata.PackageNotFoundError:
-        pass
-else:
-    logger.info("Disabling rarfile because USE_RAR is set to False")
-
-
+# Optional compression tools
+RARFILE_AVAILABLE = importlib.util.find_spec("rarfile") is not None
 ZSTANDARD_AVAILABLE = importlib.util.find_spec("zstandard") is not None
+LZ4_AVAILABLE = importlib.util.find_spec("lz4") is not None
 
 
 # Cache location

diff --git a/src/datasets/filesystems/__init__.py b/src/datasets/filesystems/__init__.py
@@ -1,4 +1,5 @@
 import importlib
+from typing import List
 
 import fsspec
 
@@ -10,9 +11,17 @@
 if _has_s3fs:
     from .s3filesystem import S3FileSystem  # noqa: F401
 
+COMPRESSION_FILESYSTEMS: List[compression.BaseCompressedFileFileSystem] = [
+    compression.Bz2FileSystem,
+    compression.GzipFileSystem,
+    compression.Lz4FileSystem,
+    compression.XzFileSystem,
+    compression.ZstdFileSystem,
+]
 
 # Register custom filesystems
-fsspec.register_implementation(compression.gzip.GZipFileSystem.protocol, compression.gzip.GZipFileSystem)
+for fs_class in COMPRESSION_FILESYSTEMS:
+    fsspec.register_implementation(fs_class.protocol, fs_class)
 
 
 def extract_path_from_uri(dataset_path: str) -> str:

diff --git a/src/datasets/filesystems/compression.py b/src/datasets/filesystems/compression.py
@@ -0,0 +1,168 @@
+import os
+from typing import Optional
+
+import fsspec
+from fsspec.archive import AbstractArchiveFileSystem
+from fsspec.utils import DEFAULT_BLOCK_SIZE
+
+
+class BaseCompressedFileFileSystem(AbstractArchiveFileSystem):
+    """Read contents of compressed file as a filesystem with one file inside."""
+
+    root_marker = ""
+    protocol: str = (
+        None  # protocol passed in prefix to the url. ex: "gzip", for gzip://file.txt::http://foo.bar/file.txt.gz
+    )
+    compression: str = None  # compression type in fsspec. ex: "gzip"
+    extension: str = None  # extension of the filename to strip. ex: "".gz" to get file.txt from file.txt.gz
+
+    def __init__(
+        self, fo: str = "", target_protocol: Optional[str] = None, target_options: Optional[dict] = None, **kwargs
+    ):
+        """
+        The compressed file system can be instantiated from any compressed file.
+        It reads the contents of compressed file as a filesystem with one file inside, as if it was an archive.
+
+        The single file inside the filesystem is named after the compresssed file,
+        without the compression extension at the end of the filename.
+
+        Args:
+            fo (:obj:``str``): Path to compressed file. Will fetch file using ``fsspec.open()``
+            mode (:obj:``str``): Currently, only 'rb' accepted
+            target_protocol(:obj:``str``, optional): To override the FS protocol inferred from a URL.
+            target_options (:obj:``dict``, optional): Kwargs passed when instantiating the target FS.
+        """
+        super().__init__(self, **kwargs)
+        # always open as "rb" since fsspec can then use the TextIOWrapper to make it work for "r" mode
+        self.file = fsspec.open(
+            fo, mode="rb", protocol=target_protocol, compression=self.compression, **(target_options or {})
+        )
+        self.info = self.file.fs.info(self.file.path)
+        self.compressed_name = os.path.basename(self.file.path.split("::")[0])
+        self.uncompressed_name = self.compressed_name[: self.compressed_name.rindex(".")]
+        self.dir_cache = None
+
+    @classmethod
+    def _strip_protocol(cls, path):
+        # compressed file paths are always relative to the archive root
+        return super()._strip_protocol(path).lstrip("/")
+
+    def _get_dirs(self):
+        if self.dir_cache is None:
+            f = {**self.info, "name": self.uncompressed_name}
+            self.dir_cache = {f["name"]: f}
+
+    def cat(self, path: str):
+        return self.file.open().read()
+
+    def _open(
+        self,
+        path: str,
+        mode: str = "rb",
+        block_size=None,
+        autocommit=True,
+        cache_options=None,
+        **kwargs,
+    ):
+        path = self._strip_protocol(path)
+        if mode != "rb":
+            raise ValueError(f"Tried to read with mode {mode} on file {self.file.path} opened with mode 'rb'")
+        if path != self.uncompressed_name:
+            raise FileNotFoundError(f"Expected file {self.uncompressed_name} but got {path}")
+        return self.file.open()
+
+
+class Bz2FileSystem(BaseCompressedFileFileSystem):
+    """Read contents of BZ2 file as a filesystem with one file inside."""
+
+    protocol = "bz2"
+    compression = "bz2"
+    extension = ".bz2"
+
+
+class GzipFileSystem(BaseCompressedFileFileSystem):
+    """Read contents of GZIP file as a filesystem with one file inside."""
+
+    protocol = "gzip"
+    compression = "gzip"
+    extension = ".gz"
+
+
+class Lz4FileSystem(BaseCompressedFileFileSystem):
+    """Read contents of LZ4 file as a filesystem with one file inside."""
+
+    protocol = "lz4"
+    compression = "lz4"
+    extension = ".lz4"
+
+
+class XzFileSystem(BaseCompressedFileFileSystem):
+    """Read contents of .xz (LZMA) file as a filesystem with one file inside."""
+
+    protocol = "xz"
+    compression = "xz"
+    extension = ".xz"
+
+
+class ZstdFileSystem(BaseCompressedFileFileSystem):
+    """
+    Read contents of zstd file as a filesystem with one file inside.
+
+    Note that reading in binary mode with fsspec isn't supported yet:
+    /~https://github.com/indygreg/python-zstandard/issues/136
+    """
+
+    protocol = "zstd"
+    compression = "zstd"
+    extension = ".zst"
+
+    def __init__(
+        self,
+        fo: str,
+        mode: str = "rb",
+        target_protocol: Optional[str] = None,
+        target_options: Optional[dict] = None,
+        block_size: int = DEFAULT_BLOCK_SIZE,
+        **kwargs,
+    ):
+        super().__init__(
+            fo=fo,
+            mode=mode,
+            target_protocol=target_protocol,
+            target_options=target_options,
+            block_size=block_size,
+            **kwargs,
+        )
+        # We need to wrap the zstd decompressor to avoid this error in fsspec==2021.7.0 and zstandard==0.15.2:
+        #
+        # File "/Users/user/.virtualenvs/hf-datasets/lib/python3.7/site-packages/fsspec/core.py", line 145, in open
+        #     out.close = close
+        # AttributeError: 'zstd.ZstdDecompressionReader' object attribute 'close' is read-only
+        #
+        # see /~https://github.com/intake/filesystem_spec/issues/725
+        _enter = self.file.__enter__
+
+        class WrappedFile:
+            def __init__(self, file_):
+                self._file = file_
+
+            def __enter__(self):
+                self._file.__enter__()
+                return self
+
+            def __exit__(self, *args, **kwargs):
+                self._file.__exit__(*args, **kwargs)
+
+            def __iter__(self):
+                return iter(self._file)
+
+            def __next__(self):
+                return next(self._file)
+
+            def __getattr__(self, attr):
+                return getattr(self._file, attr)
+
+        def fixed_enter(*args, **kwargs):
+            return WrappedFile(_enter(*args, **kwargs))
+
+        self.file.__enter__ = fixed_enter
diff --git a/src/datasets/filesystems/compression/__init__.py b/src/datasets/filesystems/compression/__init__.py