Releases · huggingface/datasets

19 Oct 16:46

albertvillanova

1.14.0

ec82422

1.14.0

Dataset changes

Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
Update: SUPERB - use Audio features #3101 (@anton-l)
Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
Fix project description in PyPI #3103 (@albertvillanova)
Align tqdm control with cache control #3031 (@mariosasko)
Add paper BibTeX citation #3107 (@albertvillanova)

Contributors

iliaschalkidis, albertvillanova, and 3 other contributors

Assets 2

15 Oct 15:50

albertvillanova

1.13.3

10dc68c

1.13.3

Dataset changes

Update: Adapt all audio datasets #3081 (@patrickvonplaten)

Bug fixes

Update BibTeX entry #3090 (@albertvillanova)
Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
Fix Audio feature mp3 resampling #3096 (@albertvillanova)

Contributors

albertvillanova, patrickvonplaten, and mariosasko

Assets 2

14 Oct 16:02

albertvillanova

1.13.2

e82164f

1.13.2

Bug fixes

Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
Remove _resampler from Audio fields #3086 (@albertvillanova)

Contributors

albertvillanova

Assets 2

14 Oct 12:50

albertvillanova

1.13.1

2ed762b

1.13.1

Bug fixes

Fix loading a metric with internal import #3077 (@albertvillanova)

Contributors

albertvillanova

Assets 2

13 Oct 15:15

lhoestq

1.13.0

38ec259

1.13.0

Dataset changes

New: CaSiNo #2867 (@kushalchawla)
New: Mostly Basic Python Problems #2893 (@lvwerra)
New: OpenAI's HumanEval #2897 (@lvwerra)
New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
New: SEDE #2942 (@Hazoom)
New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
New: AMI #2853 (@cahya-wirawan)
New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
New: KanHope #2985 (@adeepH)
New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
New: SwedMedNER #2940 (@bwang482)
New: SberQuAD #3039 (@Alenush)
New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
New: Greek Legal Code #2966 (@christospi)
New: Story Cloze Test #3067 (@zaidalyafeai)
Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
Update: TriviaQA - add web and wiki config #2949 (@shirte)
Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
Update: Biosses - fix column names #3054 (@bwang482)
Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

Update: meteor - update from nltk update #2946 (@lhoestq)
Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

Use with TensorFlow:
- Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add remove_columns to IterableDataset #3030 (@cccntu)
- All the above ZIP features also work in streaming mode
New utilities:
- Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
Replace script_version with revision #2933 (@albertvillanova)
- The script_version parameter in load_dataset is now deprecated, in favor of revision
Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)

Documentation

Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

Fix filter leaking #3019 (@lhoestq)
- calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
Update BibTeX entry #2928 (@albertvillanova)
Fix exception chaining #2911 (@albertvillanova)
Add regression test for null Sequence #2929 (@albertvillanova)
Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
Fix fn kwargs in filter #2950 (@lhoestq)
Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
Fix missing conda deps #2952 (@lhoestq)
Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
Fix CI doc build #2961 (@albertvillanova)
Run tests in parallel #2954 (@albertvillanova)
Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
Take namespace into account in caching #2938 (@lhoestq)
Make Dataset.map accept list of np.array #2990 (@albertvillanova)
Fix loading compressed CSV without streaming #2994 (@albertvillanova)
Fix json loader when conversion not implemented #3000 (@lhoestq)
Remove all query parameters when extracting protocol #2996 (@albertvillanova)
Correct a typo #3007 (@Yann21)
Fix Windows test suite #3025 (@albertvillanova)
Remove unused parameter in xdirname #3017 (@albertvillanova)
Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
Fix typo #3023 (@qqaatw)
Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
Use cache folder for lockfile #2887 (@Dref360)
Fix streaming: catch Timeout error #3050 (@borisdayma)
Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
Fix task reloading from cache #3059 (@lhoestq)
Fix test command after refac #3065 (@lhoestq)
Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
Update summary on PyPi beyond NLP #3062 (@thomwolf)
Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
feat: increase streaming retry config #3068 (@borisdayma)
Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.

Contributors

jimregan, craffel, and 33 other contributors

Assets 2

15 Sep 17:45

lhoestq

1.12.1

2c1fc9c

1.12.1

Bug fixes

Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
- this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors

Contributors

pierre-godard and lhoestq

Assets 2

13 Sep 18:35

lhoestq

1.12.0

c65dccc

1.12.0

New documentation

New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference

See the new documentation here !

Datasets changes

New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
New: The Pile books3 #2801 (@richarddwang)
New: The Pile stack exchange #2803 (@richarddwang)
New: The Pile openwebtext2 #2802 (@richarddwang)
New: Food-101 #2804 (@nateraw)
New: Beans #2809 (@nateraw)
New: cedr #2796 (@naumov-al)
New: cats_vs_dogs #2807 (@nateraw)
New: MultiEURLEX #2865 (@iliaschalkidis)
New: BIOSSES #2881 (@bwang482)
Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
Update: SUPERB - Add SD task #2661 (@albertvillanova)
Update: SUPERB - Add KS task #2783 (@anton-l)
Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
Update: Openwebtext - update size #2857 (@lhoestq)
Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
Fix: linnaeus - fix url #2852 (@lhoestq)
Fix ToTTo - fix data URL #2864 (@albertvillanova)
Fix: wikicorpus - fix keys #2844 (@lhoestq)
Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
add multi-proc in to_json #2747 (@bhavitvyamalik)
Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

Fix streaming zip files #2798 (@albertvillanova)
Support streaming tar files #2800 (@albertvillanova)
Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
Add url prefix convention for many compression formats #2822 (@lhoestq)
Support streaming datasets that use pathlib #2874 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

Update release instructions #2740 (@albertvillanova)
Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
Allow PyArrow from source #2769 (@patrickvonplaten)
fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
Fix typo in test_dataset_common #2790 (@nateraw)
Fix type hint for data_files #2793 (@albertvillanova)
Bump tqdm version #2814 (@mariosasko)
Use packaging to handle versions #2777 (@albertvillanova)
Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
Rename The Pile subsets #2817 (@lhoestq)
Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
Fix extraction protocol inference from urls with params #2843 (@lhoestq)
Fix caching when moving script #2854 (@lhoestq)
Fix windows CI CondaError #2855 (@lhoestq)
fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
Fix s3fs version in CI #2858 (@lhoestq)
Fix three typos in two files for documentation #2870 (@leny-mi)
Move checks from _map_single to map #2660 (@mariosasko)
fix regex to accept negative timezone #2847 (@jadermcs)
Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
Fix null sequence encoding #2900 (@lhoestq)

Contributors

iliaschalkidis, severo, and 22 other contributors

Assets 2

30 Jul 14:27

albertvillanova

1.11.0

ea7f0b8

1.11.0

Datasets Changes

New: Add Russian SuperGLUE #2668 (@slowwavesleep)
New: Add Disfl-QA #2473 (@bhavitvyamalik)
New: Add TimeDial #2476 (@bhavitvyamalik)
Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
Fix: Update WikiANN data URL #2710 (@albertvillanova)
Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)

General improvements and bug fixes

fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
Update BibTeX entry #2706 (@albertvillanova)
Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
Add support for disable_progress_bar on Windows #2696 (@mariosasko)
Ignore empty batch when writing #2698 (@pcuenca)
Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
fix: 🐛 fix two typos #2720 (@severo)
Docs details #2690 (@severo)
Deal with the bad check in test_load.py #2721 (@mariosasko)
Pass use_auth_token to request_etags #2725 (@albertvillanova)
Typo fix tokenize_exemple #2726 (@shabie)
Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
Add missing parquet known extension #2733 (@lhoestq)

Contributors

pcuenca, severo, and 8 other contributors

Assets 2

22 Jul 10:08

lhoestq

1.10.2

cea1a29

1.10.2

The error message to tell which dataset config name to load was not displayed:

Fix pick default config name message #2704 (@lhoestq)

Docstrings:

Fix download_mode docstrings #2701 (@albertvillanova)

Assets 2

22 Jul 08:47

lhoestq

1.10.1

6b7b227

1.10.1

Fix minimum tqdm version and import on Colab #2697 (@nateraw)
Fix OSCAR Esperanto #2693 (@lhoestq)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset changes

Dataset features

General improvements and bug fixes

Contributors

Dataset changes

Bug fixes

Contributors

Bug fixes

Contributors

Bug fixes

Contributors

Dataset changes

Metric changes

Dataset features

Dataset cards

Documentation

General improvements and bug fixes

Breaking changes:

Contributors

Bug fixes

Contributors

New documentation

Datasets changes

Datasets features

Dataset streaming - better support for compression:

Metrics changes

Dataset cards

General improvements and bug fixes

Contributors

Datasets Changes

General improvements and bug fixes

Contributors

Releases: huggingface/datasets

1.14.0

Dataset changes

Dataset features

General improvements and bug fixes

Contributors

1.13.3

Dataset changes

Bug fixes

Contributors

1.13.2

Bug fixes

Contributors

1.13.1

Bug fixes

Contributors

1.13.0

Dataset changes

Metric changes

Dataset features

Dataset cards

Documentation

General improvements and bug fixes

Breaking changes:

Contributors

1.12.1

Bug fixes

Contributors

1.12.0

New documentation

Datasets changes

Datasets features

Dataset streaming - better support for compression:

Metrics changes

Dataset cards

General improvements and bug fixes

Contributors

1.11.0

Datasets Changes

General improvements and bug fixes

Contributors

1.10.2

1.10.1