Releases: huggingface/datasets
Releases · huggingface/datasets
1.14.0
Dataset changes
- Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
- Update: SUPERB - use Audio features #3101 (@anton-l)
- Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)
Dataset features
General improvements and bug fixes
- Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
- Fix project description in PyPI #3103 (@albertvillanova)
- Align tqdm control with cache control #3031 (@mariosasko)
- Add paper BibTeX citation #3107 (@albertvillanova)
1.13.3
Dataset changes
- Update: Adapt all audio datasets #3081 (@patrickvonplaten)
Bug fixes
- Update BibTeX entry #3090 (@albertvillanova)
- Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
- Fix Audio feature mp3 resampling #3096 (@albertvillanova)
1.13.2
Bug fixes
- Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
- Remove _resampler from Audio fields #3086 (@albertvillanova)
1.13.1
Bug fixes
- Fix loading a metric with internal import #3077 (@albertvillanova)
1.13.0
Dataset changes
- New: CaSiNo #2867 (@kushalchawla)
- New: Mostly Basic Python Problems #2893 (@lvwerra)
- New: OpenAI's HumanEval #2897 (@lvwerra)
- New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
- New: SEDE #2942 (@Hazoom)
- New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
- New: AMI #2853 (@cahya-wirawan)
- New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
- New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
- New: KanHope #2985 (@adeepH)
- New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
- New: SwedMedNER #2940 (@bwang482)
- New: SberQuAD #3039 (@Alenush)
- New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
- New: Greek Legal Code #2966 (@christospi)
- New: Story Cloze Test #3067 (@zaidalyafeai)
- Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
- Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
- Update: TriviaQA - add web and wiki config #2949 (@shirte)
- Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
- Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
- Update: Biosses - fix column names #3054 (@bwang482)
- Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
- Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
- Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
- Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
- Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
- Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)
Metric changes
- Update: meteor - update from nltk update #2946 (@lhoestq)
- Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
- Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
- Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)
Dataset features
- Use with TensorFlow:
- Adding
to_tf_dataset
method #2731 #2931 #2951 #2974 (@Rocketknight1)
- Adding
- Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
- Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add
remove_columns
toIterableDataset
#3030 (@cccntu) - All the above ZIP features also work in streaming mode
- New utilities:
- Replace script_version with revision #2933 (@albertvillanova)
- The
script_version
parameter inload_dataset
is now deprecated, in favor ofrevision
- The
- Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed
Dataset cards
- Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)
Documentation
General improvements and bug fixes
- Fix filter leaking #3019 (@lhoestq)
- calling
filter
several times in a row was not returning the right results in 1.12.0 and 1.12.1
- calling
- Update BibTeX entry #2928 (@albertvillanova)
- Fix exception chaining #2911 (@albertvillanova)
- Add regression test for null Sequence #2929 (@albertvillanova)
- Don't use old, incompatible cache for the new
filter
#2947 (@lhoestq) - Fix fn kwargs in filter #2950 (@lhoestq)
- Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
- Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
- Fix missing conda deps #2952 (@lhoestq)
- Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
- Support pandas 1.3 new
read_csv
parameters #2960 (@SBrandeis) - Fix CI doc build #2961 (@albertvillanova)
- Run tests in parallel #2954 (@albertvillanova)
- Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
- Take namespace into account in caching #2938 (@lhoestq)
- Make Dataset.map accept list of np.array #2990 (@albertvillanova)
- Fix loading compressed CSV without streaming #2994 (@albertvillanova)
- Fix json loader when conversion not implemented #3000 (@lhoestq)
- Remove all query parameters when extracting protocol #2996 (@albertvillanova)
- Correct a typo #3007 (@Yann21)
- Fix Windows test suite #3025 (@albertvillanova)
- Remove unused parameter in xdirname #3017 (@albertvillanova)
- Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
- Fix typo #3023 (@qqaatw)
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
- Use cache folder for lockfile #2887 (@Dref360)
- Fix streaming: catch Timeout error #3050 (@borisdayma)
- Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
- Fix task reloading from cache #3059 (@lhoestq)
- Fix test command after refac #3065 (@lhoestq)
- Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
- Update summary on PyPi beyond NLP #3062 (@thomwolf)
- Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
- feat: increase streaming retry config #3068 (@borisdayma)
- Fix pathlib patches for streaming #3072 (@lhoestq)
Breaking changes:
- Due to the big refactoring at #2986, the
prepare_module
function doesn't support thereturn_resolved_file_path
andreturn_associated_base_path
parameters. As an alternative, you may use thedataset_module_factory
instead.
1.12.1
1.12.0
New documentation
- New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference
See the new documentation here !
Datasets changes
- New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
- New: The Pile books3 #2801 (@richarddwang)
- New: The Pile stack exchange #2803 (@richarddwang)
- New: The Pile openwebtext2 #2802 (@richarddwang)
- New: Food-101 #2804 (@nateraw)
- New: Beans #2809 (@nateraw)
- New: cedr #2796 (@naumov-al)
- New: cats_vs_dogs #2807 (@nateraw)
- New: MultiEURLEX #2865 (@iliaschalkidis)
- New: BIOSSES #2881 (@bwang482)
- Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
- Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
- Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
- Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
- Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
- Update: SUPERB - Add SD task #2661 (@albertvillanova)
- Update: SUPERB - Add KS task #2783 (@anton-l)
- Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
- Update: Openwebtext - update size #2857 (@lhoestq)
- Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
- Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
- Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
- Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
- Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
- Fix: linnaeus - fix url #2852 (@lhoestq)
- Fix ToTTo - fix data URL #2864 (@albertvillanova)
- Fix: wikicorpus - fix keys #2844 (@lhoestq)
- Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
- Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)
Datasets features
- Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
- Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
- add multi-proc in
to_json
#2747 (@bhavitvyamalik) - Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)
Dataset streaming - better support for compression:
- Fix streaming zip files #2798 (@albertvillanova)
- Support streaming tar files #2800 (@albertvillanova)
- Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
- Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
- Add url prefix convention for many compression formats #2822 (@lhoestq)
- Support streaming datasets that use pathlib #2874 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)
Metrics changes
- Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
- Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)
Dataset cards
- Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
- Update ELI5 README.md #2848 (@odellus)
General improvements and bug fixes
- Update release instructions #2740 (@albertvillanova)
- Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
- Allow PyArrow from source #2769 (@patrickvonplaten)
- fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
- Fix typo in test_dataset_common #2790 (@nateraw)
- Fix type hint for data_files #2793 (@albertvillanova)
- Bump tqdm version #2814 (@mariosasko)
- Use packaging to handle versions #2777 (@albertvillanova)
- Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
- Rename The Pile subsets #2817 (@lhoestq)
- Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
- Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
- Fix extraction protocol inference from urls with params #2843 (@lhoestq)
- Fix caching when moving script #2854 (@lhoestq)
- Fix windows CI CondaError #2855 (@lhoestq)
- fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
- Update
column_names
showed as:func:
in exploring.st #2851 (@ClementRomac) - Fix s3fs version in CI #2858 (@lhoestq)
- Fix three typos in two files for documentation #2870 (@leny-mi)
- Move checks from _map_single to map #2660 (@mariosasko)
- fix regex to accept negative timezone #2847 (@jadermcs)
- Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
- Fix null sequence encoding #2900 (@lhoestq)
1.11.0
Datasets Changes
- New: Add Russian SuperGLUE #2668 (@slowwavesleep)
- New: Add Disfl-QA #2473 (@bhavitvyamalik)
- New: Add TimeDial #2476 (@bhavitvyamalik)
- Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
- Fix: Update WikiANN data URL #2710 (@albertvillanova)
- Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
- Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)
General improvements and bug fixes
- fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
- Update BibTeX entry #2706 (@albertvillanova)
- Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
- Add support for disable_progress_bar on Windows #2696 (@mariosasko)
- Ignore empty batch when writing #2698 (@pcuenca)
- Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
- fix: 🐛 fix two typos #2720 (@severo)
- Docs details #2690 (@severo)
- Deal with the bad check in test_load.py #2721 (@mariosasko)
- Pass use_auth_token to request_etags #2725 (@albertvillanova)
- Typo fix
tokenize_exemple
#2726 (@shabie) - Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
- Add missing parquet known extension #2733 (@lhoestq)