diff --git a/datasets/svhn/README.md b/datasets/svhn/README.md new file mode 100644 index 00000000000..81c43efc57f --- /dev/null +++ b/datasets/svhn/README.md @@ -0,0 +1,207 @@ +--- +annotations_creators: +- machine-generated +- expert-generated +language_creators: +- machine-generated +languages: +- en +licenses: +- other +multilinguality: +- monolingual +size_categories: +- 100K, + 'digits': { + 'bbox': [ + [36, 7, 13, 32], + [50, 7, 12, 32] + ], + 'label': [6, 9] + } +} +``` + +#### cropped_digits + +Character level ground truth in an MNIST-like format. All digits have been resized to a fixed resolution of 32-by-32 pixels. The original character bounding boxes are extended in the appropriate dimension to become square windows, so that resizing them to 32-by-32 pixels does not introduce aspect ratio distortions. Nevertheless this preprocessing introduces some distracting digits to the sides of the digit of interest. + +``` +{ + 'image': , + 'label': 1 +} +``` + +### Data Fields + +#### full_numbers + +- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]` +- `digits`: a dictionary containing digits' bounding boxes and labels + - `bbox`: a list of bounding boxes (in the [coco](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/#coco) format) corresponding to the digits present on the image + - `label`: a list of integers between 0 and 9 representing the digit. + +#### cropped_digits + +- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]` +- `digit`: an integer between 0 and 9 representing the digit. + +### Data Splits + +#### full_numbers + +The data is split into training, test and extra set. The training set contains 33402 images, test set 13068 and the extra set 202353 images. + +#### cropped_digits + +The data is split into training, test and extra set. The training set contains 73257 images, test set 26032 and the extra set 531131 images. + +The extra set can be used as extra training data. The extra set was obtained in a similar manner to the training and test set, but with the increased detection threshold in order to generate this large amount of labeled data. The SVHN extra subset is thus somewhat biased toward less difficult detections, and is thus easier than SVHN train/SVHN test. + +## Dataset Creation + +### Curation Rationale + +From the paper: +> As mentioned above, the venerable MNIST dataset has been a valuable goal post for researchers seeking to build better learning systems whose benchmark performance could be expected to translate into improved performance on realistic applications. However, computers have now reached essentially human levels of performance on this problem—a testament to progress in machine learning and computer vision. The Street View House Numbers (SVHN) digit database that we provide can be seen as similar in flavor to MNIST (e.g., the images are of small cropped characters), but the SVHN dataset incorporates an order of magnitude more labeled data and comes from a significantly harder, unsolved, real world problem. Here the gap between human performance and state of the art feature representations is significant. Going forward, we expect that this dataset may fulfill a similar role for modern feature learning algorithms: it provides a new and difficult benchmark where increased performance can be expected to translate into tangible gains on a realistic application. + +### Source Data + +#### Initial Data Collection and Normalization + +From the paper: +> The SVHN dataset was obtained from a large number of Street View images using a combination +of automated algorithms and the Amazon Mechanical Turk (AMT) framework, which was +used to localize and transcribe the single digits. We downloaded a very large set of images from +urban areas in various countries. + +#### Who are the source language producers? + +[More Information Needed] + +### Annotations + +#### Annotation process + +From the paper: +> From these randomly selected images, the house-number patches were extracted using a dedicated sliding window house-numbers detector using a low threshold on the detector’s confidence score in order to get a varied, unbiased dataset of house-number signs. These low precision detections were screened and transcribed by AMT workers. + +#### Who are the annotators? + +The AMT workers. + +### Personal and Sensitive Information + +[More Information Needed] + +## Considerations for Using the Data + +### Social Impact of Dataset + +[More Information Needed] + +### Discussion of Biases + +[More Information Needed] + +### Other Known Limitations + +[More Information Needed] + +## Additional Information + +### Dataset Curators + +Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu and Andrew Y. Ng + +### Licensing Information + +Non-commerical use only. + +### Citation Information + +``` +@article{netzer2011reading, + title={Reading digits in natural images with unsupervised feature learning}, + author={Netzer, Yuval and Wang, Tao and Coates, Adam and Bissacco, Alessandro and Wu, Bo and Ng, Andrew Y}, + year={2011} +} +``` + +### Contributions + +Thanks to [@mariosasko](/~https://github.com/mariosasko) for adding this dataset. \ No newline at end of file diff --git a/datasets/svhn/dataset_infos.json b/datasets/svhn/dataset_infos.json new file mode 100644 index 00000000000..e0eb0c9b596 --- /dev/null +++ b/datasets/svhn/dataset_infos.json @@ -0,0 +1 @@ +{"full_numbers": {"description": "SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting.\nIt can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images)\nand comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.\n", "citation": "@article{netzer2011reading,\n title={Reading digits in natural images with unsupervised feature learning},\n author={Netzer, Yuval and Wang, Tao and Coates, Adam and Bissacco, Alessandro and Wu, Bo and Ng, Andrew Y},\n year={2011}\n}\n", "homepage": "http://ufldl.stanford.edu/housenumbers/", "license": "Custom (non-commercial)", "features": {"image": {"id": null, "_type": "Image"}, "digits": {"feature": {"bbox": {"feature": {"dtype": "int32", "id": null, "_type": "Value"}, "length": 4, "id": null, "_type": "Sequence"}, "label": {"num_classes": 10, "names": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "svhn", "config_name": "full_numbers", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 390404309, "num_examples": 33402, "dataset_name": "svhn"}, "test": {"name": "test", "num_bytes": 271503052, "num_examples": 13068, "dataset_name": "svhn"}, "extra": {"name": "extra", "num_bytes": 1868720340, "num_examples": 202353, "dataset_name": "svhn"}}, "download_checksums": {"http://ufldl.stanford.edu/housenumbers/train.tar.gz": {"num_bytes": 404141560, "checksum": "4b17bb33b6cd8f963493168f80143da956f28ec406cc12f8e5745a9f91a51898"}, "http://ufldl.stanford.edu/housenumbers/test.tar.gz": {"num_bytes": 276555967, "checksum": "57ac9ceb530e4aa85b55d991be8fc49c695b3d71c6f6a88afea86549efde7fb5"}, "http://ufldl.stanford.edu/housenumbers/extra.tar.gz": {"num_bytes": 1955489752, "checksum": "e857e27d1e65bd1e7d3959b094061777f6506bbc39889a0df3bba6a729d60f9c"}}, "download_size": 2636187279, "post_processing_size": null, "dataset_size": 2530627701, "size_in_bytes": 5166814980}, "cropped_digits": {"description": "SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting.\nIt can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images)\nand comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.\n", "citation": "@article{netzer2011reading,\n title={Reading digits in natural images with unsupervised feature learning},\n author={Netzer, Yuval and Wang, Tao and Coates, Adam and Bissacco, Alessandro and Wu, Bo and Ng, Andrew Y},\n year={2011}\n}\n", "homepage": "http://ufldl.stanford.edu/housenumbers/", "license": "Custom (non-commercial)", "features": {"image": {"id": null, "_type": "Image"}, "label": {"num_classes": 10, "names": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "image-classification", "image_column": "image", "label_column": "label", "labels": null}], "builder_name": "svhn", "config_name": "cropped_digits", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 128364360, "num_examples": 73257, "dataset_name": "svhn"}, "test": {"name": "test", "num_bytes": 44464040, "num_examples": 26032, "dataset_name": "svhn"}, "extra": {"name": "extra", "num_bytes": 967853504, "num_examples": 531131, "dataset_name": "svhn"}}, "download_checksums": {"http://ufldl.stanford.edu/housenumbers/train_32x32.mat": {"num_bytes": 182040794, "checksum": "435e94d69a87fde4fd4d7f3dd208dfc32cb6ae8af2240d066de1df7508d083b8"}, "http://ufldl.stanford.edu/housenumbers/test_32x32.mat": {"num_bytes": 64275384, "checksum": "cdce80dfb2a2c4c6160906d0bd7c68ec5a99d7ca4831afa54f09182025b6a75b"}, "http://ufldl.stanford.edu/housenumbers/extra_32x32.mat": {"num_bytes": 1329278602, "checksum": "a133a4beb38a00fcdda90c9489e0c04f900b660ce8a316a5e854838379a71eb3"}}, "download_size": 1575594780, "post_processing_size": null, "dataset_size": 1140681904, "size_in_bytes": 2716276684}} \ No newline at end of file diff --git a/datasets/svhn/dummy/cropped_digits/1.0.0/dummy_data.zip b/datasets/svhn/dummy/cropped_digits/1.0.0/dummy_data.zip new file mode 100644 index 00000000000..27a15f0f198 Binary files /dev/null and b/datasets/svhn/dummy/cropped_digits/1.0.0/dummy_data.zip differ diff --git a/datasets/svhn/dummy/full_numbers/1.0.0/dummy_data.zip b/datasets/svhn/dummy/full_numbers/1.0.0/dummy_data.zip new file mode 100644 index 00000000000..68dac93d05c Binary files /dev/null and b/datasets/svhn/dummy/full_numbers/1.0.0/dummy_data.zip differ diff --git a/datasets/svhn/svhn.py b/datasets/svhn/svhn.py new file mode 100644 index 00000000000..9f1fdcd6d56 --- /dev/null +++ b/datasets/svhn/svhn.py @@ -0,0 +1,199 @@ +# coding=utf-8 +# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Street View House Numbers (SVHN) dataset.""" + +import io +import os + +import h5py +import numpy as np +import scipy.io as sio + +import datasets +from datasets.tasks import ImageClassification + + +logger = datasets.logging.get_logger(__name__) + + +_CITATION = """\ +@article{netzer2011reading, + title={Reading digits in natural images with unsupervised feature learning}, + author={Netzer, Yuval and Wang, Tao and Coates, Adam and Bissacco, Alessandro and Wu, Bo and Ng, Andrew Y}, + year={2011} +} +""" + +_DESCRIPTION = """\ +SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. +It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) +and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. +""" + +_HOMEPAGE = "http://ufldl.stanford.edu/housenumbers/" + +_LICENSE = "Custom (non-commercial)" + +_URLs = { + "full_numbers": [ + "http://ufldl.stanford.edu/housenumbers/train.tar.gz", + "http://ufldl.stanford.edu/housenumbers/test.tar.gz", + "http://ufldl.stanford.edu/housenumbers/extra.tar.gz", + ], + "cropped_digits": [ + "http://ufldl.stanford.edu/housenumbers/train_32x32.mat", + "http://ufldl.stanford.edu/housenumbers/test_32x32.mat", + "http://ufldl.stanford.edu/housenumbers/extra_32x32.mat", + ], +} + +_DIGIT_LABELS = [str(num) for num in range(10)] + + +class SVHN(datasets.GeneratorBasedBuilder): + """Street View House Numbers (SVHN) dataset.""" + + VERSION = datasets.Version("1.0.0") + + BUILDER_CONFIGS = [ + datasets.BuilderConfig( + name="full_numbers", + version=VERSION, + description="Contains the original, variable-resolution, color house-number images with character level bounding boxes.", + ), + datasets.BuilderConfig( + name="cropped_digits", + version=VERSION, + description="Character level ground truth in an MNIST-like format. All digits have been resized to a fixed resolution of 32-by-32 pixels. The original character bounding boxes are extended in the appropriate dimension to become square windows, so that resizing them to 32-by-32 pixels does not introduce aspect ratio distortions. Nevertheless this preprocessing introduces some distracting digits to the sides of the digit of interest.", + ), + ] + + def _info(self): + if self.config.name == "full_numbers": + features = datasets.Features( + { + "image": datasets.Image(), + "digits": datasets.Sequence( + { + "bbox": datasets.Sequence(datasets.Value("int32"), length=4), + "label": datasets.ClassLabel(num_classes=10), + } + ), + } + ) + else: + features = datasets.Features( + { + "image": datasets.Image(), + "label": datasets.ClassLabel(num_classes=10), + } + ) + return datasets.DatasetInfo( + description=_DESCRIPTION, + features=features, + supervised_keys=None, + homepage=_HOMEPAGE, + license=_LICENSE, + citation=_CITATION, + task_templates=[ImageClassification(image_column="image", label_column="label")] + if self.config.name == "cropped_digits" + else None, + ) + + def _split_generators(self, dl_manager): + if self.config.name == "full_numbers": + train_archive, test_archive, extra_archive = dl_manager.download(_URLs[self.config.name]) + for path, f in dl_manager.iter_archive(train_archive): + if path.endswith("digitStruct.mat"): + train_annot_data = f.read() + break + for path, f in dl_manager.iter_archive(test_archive): + if path.endswith("digitStruct.mat"): + test_annot_data = f.read() + break + for path, f in dl_manager.iter_archive(extra_archive): + if path.endswith("digitStruct.mat"): + extra_annot_data = f.read() + break + train_archive = dl_manager.iter_archive(train_archive) + test_archive = dl_manager.iter_archive(test_archive) + extra_archive = dl_manager.iter_archive(extra_archive) + train_filepath, test_filepath, extra_filepath = None, None, None + else: + train_annot_data, test_annot_data, extra_annot_data = None, None, None + train_archive, test_archive, extra_archive = None, None, None + train_filepath, test_filepath, extra_filepath = dl_manager.download(_URLs[self.config.name]) + return [ + datasets.SplitGenerator( + name=datasets.Split.TRAIN, + gen_kwargs={ + "annot_data": train_annot_data, + "files": train_archive, + "filepath": train_filepath, + }, + ), + datasets.SplitGenerator( + name=datasets.Split.TEST, + gen_kwargs={ + "annot_data": test_annot_data, + "files": test_archive, + "filepath": test_filepath, + }, + ), + datasets.SplitGenerator( + name="extra", + gen_kwargs={ + "annot_data": extra_annot_data, + "files": extra_archive, + "filepath": extra_filepath, + }, + ), + ] + + def _generate_examples(self, annot_data, files, filepath): + if self.config.name == "full_numbers": + + def _get_digits(bboxes, h5_file): + def key_to_values(key, bbox): + if bbox[key].shape[0] == 1: + return [int(bbox[key][0][0])] + else: + return [int(h5_file[bbox[key][i][0]][()].item()) for i in range(bbox[key].shape[0])] + + bbox = h5_file[bboxes[0]] + assert bbox.keys() == {"height", "left", "top", "width", "label"} + bbox_columns = [key_to_values(key, bbox) for key in ["left", "top", "width", "height", "label"]] + return [ + {"bbox": [left, top, width, height], "label": label % 10} + for left, top, width, height, label in zip(*bbox_columns) + ] + + with h5py.File(io.BytesIO(annot_data), "r") as h5_file: + for path, f in files: + root, ext = os.path.splitext(path) + if ext != ".png": + continue + img_idx = int(os.path.basename(root)) - 1 + yield img_idx, { + "image": {"path": path, "bytes": f.read()}, + "digits": _get_digits(h5_file["digitStruct/bbox"][img_idx], h5_file), + } + else: + data = sio.loadmat(filepath) + for i, (image_array, label) in enumerate(zip(np.rollaxis(data["X"], -1), data["y"])): + yield i, { + "image": image_array, + "label": label.item() % 10, + } diff --git a/setup.py b/setup.py index 82f32bae269..6f775f98c27 100644 --- a/setup.py +++ b/setup.py @@ -140,6 +140,7 @@ # datasets dependencies "bs4", "conllu", + "h5py", "langdetect", "lxml", "mwparserfromhell", diff --git a/src/datasets/streaming.py b/src/datasets/streaming.py index f8ef260da72..94d7e9de171 100644 --- a/src/datasets/streaming.py +++ b/src/datasets/streaming.py @@ -25,6 +25,8 @@ xpathrglob, xpathstem, xpathsuffix, + xsio_loadmat, + xsplitext, xwalk, ) @@ -73,6 +75,7 @@ def wrapper(*args, **kwargs): patch_submodule(module, "os.path.join", xjoin).start() patch_submodule(module, "os.path.dirname", xdirname).start() patch_submodule(module, "os.path.basename", xbasename).start() + patch_submodule(module, "os.path.splitext", xsplitext).start() # allow checks on paths patch_submodule(module, "os.path.isdir", wrap_auth(xisdir)).start() patch_submodule(module, "os.path.isfile", wrap_auth(xisfile)).start() @@ -88,6 +91,7 @@ def wrapper(*args, **kwargs): patch.object(module.Path, "suffix", property(fget=xpathsuffix)).start() patch_submodule(module, "pd.read_csv", wrap_auth(xpandas_read_csv), attrs=["__version__"]).start() patch_submodule(module, "pd.read_excel", xpandas_read_excel, attrs=["__version__"]).start() + patch_submodule(module, "sio.loadmat", wrap_auth(xsio_loadmat), attrs=["__version__"]).start() # xml.etree.ElementTree for submodule in ["ElementTree", "ET"]: patch_submodule(module, f"{submodule}.parse", wrap_auth(xet_parse)).start() diff --git a/src/datasets/utils/streaming_download_manager.py b/src/datasets/utils/streaming_download_manager.py index 46b40c1211f..85db96fda9a 100644 --- a/src/datasets/utils/streaming_download_manager.py +++ b/src/datasets/utils/streaming_download_manager.py @@ -161,6 +161,33 @@ def xbasename(a): return posixpath.basename(a) +def xsplitext(a): + """ + This function extends os.path.splitext to support the "::" hop separator. It supports both paths and urls. + + A shorthand, particularly useful where you have multiple hops, is to “chain” the URLs with the special separator "::". + This is used to access files inside a zip file over http for example. + + Let's say you have a zip file at https://host.com/archive.zip, and you want to access the file inside the zip file at /folder1/file.txt. + Then you can just chain the url this way: + + zip://folder1/file.txt::https://host.com/archive.zip + + The xsplitext function allows you to apply the splitext on the first path of the chain. + + Example:: + + >>> xsplitext("zip://folder1/file.txt::https://host.com/archive.zip") + ('zip://folder1/file::https://host.com/archive.zip', '.txt') + """ + a, *b = a.split("::") + if is_local_path(a): + return os.path.splitext(Path(a).as_posix()) + else: + a, ext = posixpath.splitext(a) + return "::".join([a] + b), ext + + def xisfile(path, use_auth_token: Optional[Union[str, bool]] = None) -> bool: """Extend `os.path.isfile` function to support remote files. @@ -551,6 +578,15 @@ def xpandas_read_excel(filepath_or_buffer, **kwargs): return pd.read_excel(BytesIO(filepath_or_buffer.read()), **kwargs) +def xsio_loadmat(filepath_or_buffer, use_auth_token: Optional[Union[str, bool]] = None, **kwargs): + import scipy.io as sio + + if hasattr(filepath_or_buffer, "read"): + return sio.loadmat(filepath_or_buffer, **kwargs) + else: + return sio.loadmat(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs) + + def xet_parse(source, parser=None, use_auth_token: Optional[Union[str, bool]] = None): """Extend `xml.etree.ElementTree.parse` function to support remote files. diff --git a/tests/test_streaming_download_manager.py b/tests/test_streaming_download_manager.py index c5f4dc7a5a2..fa773a0cfd9 100644 --- a/tests/test_streaming_download_manager.py +++ b/tests/test_streaming_download_manager.py @@ -27,6 +27,7 @@ xpathrglob, xpathstem, xpathsuffix, + xsplitext, ) from .utils import require_lz4, require_zstandard @@ -229,6 +230,28 @@ def test_xdirname(input_path, expected_path): assert output_path == _readd_double_slash_removed_by_path(Path(expected_path).as_posix()) +@pytest.mark.parametrize( + "input_path, expected_path_and_ext", + [ + ( + str(Path(__file__).resolve()), + (str(Path(__file__).resolve().with_suffix("")), str(Path(__file__).resolve().suffix)), + ), + ("https://host.com/archive.zip", ("https://host.com/archive", ".zip")), + ("zip://file.txt::https://host.com/archive.zip", ("zip://file::https://host.com/archive.zip", ".txt")), + ("zip://folder::https://host.com/archive.zip", ("zip://folder::https://host.com/archive.zip", "")), + ("zip://::https://host.com/archive.zip", ("zip://::https://host.com/archive.zip", "")), + ], +) +def test_xsplitext(input_path, expected_path_and_ext): + output_path, ext = xsplitext(input_path) + expected_path, expected_ext = expected_path_and_ext + output_path = _readd_double_slash_removed_by_path(Path(output_path).as_posix()) + expected_path = _readd_double_slash_removed_by_path(Path(expected_path).as_posix()) + assert output_path == expected_path + assert ext == expected_ext + + def test_xopen_local(text_path): with xopen(text_path, "r", encoding="utf-8") as f, open(text_path, encoding="utf-8") as expected_file: assert list(f) == list(expected_file)