Skip to content

Commit

Permalink
Release: 1.18.0
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Jan 21, 2022
1 parent 6ca96c7 commit c0aea8d
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
# The short X.Y version
version = ""
# The full version, including alpha/beta/rc tags
release = "1.17.0"
release = "1.18.0"


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@

setup(
name="datasets",
version="1.17.1.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="1.18.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "1.17.1.dev0"
__version__ = "1.18.0"

import pyarrow
from packaging import version as _version
Expand Down

2 comments on commit c0aea8d

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011072 / 0.011353 (-0.000280) 0.004796 / 0.011008 (-0.006212) 0.034732 / 0.038508 (-0.003776) 0.036420 / 0.023109 (0.013311) 0.330803 / 0.275898 (0.054905) 0.382682 / 0.323480 (0.059202) 0.009742 / 0.007986 (0.001756) 0.003608 / 0.004328 (-0.000720) 0.010407 / 0.004250 (0.006157) 0.050510 / 0.037052 (0.013457) 0.344404 / 0.258489 (0.085914) 0.369120 / 0.293841 (0.075279) 0.047110 / 0.128546 (-0.081436) 0.012570 / 0.075646 (-0.063076) 0.321377 / 0.419271 (-0.097894) 0.056726 / 0.043533 (0.013193) 0.330403 / 0.255139 (0.075264) 0.353649 / 0.283200 (0.070449) 0.105706 / 0.141683 (-0.035977) 1.977034 / 1.452155 (0.524879) 2.078609 / 1.492716 (0.585892)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.268522 / 0.018006 (0.250516) 0.482403 / 0.000490 (0.481913) 0.032196 / 0.000200 (0.031996) 0.000304 / 0.000054 (0.000250)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039412 / 0.037411 (0.002000) 0.023747 / 0.014526 (0.009221) 0.034001 / 0.176557 (-0.142555) 0.080690 / 0.737135 (-0.656446) 0.034877 / 0.296338 (-0.261462)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.625989 / 0.215209 (0.410780) 5.918052 / 2.077655 (3.840397) 2.081826 / 1.504120 (0.577706) 1.764401 / 1.541195 (0.223207) 1.811466 / 1.468490 (0.342976) 0.729458 / 4.584777 (-3.855319) 6.283200 / 3.745712 (2.537488) 2.916183 / 5.269862 (-2.353678) 1.471653 / 4.565676 (-3.094023) 0.094190 / 0.424275 (-0.330085) 0.013595 / 0.007607 (0.005988) 0.725877 / 0.226044 (0.499833) 7.432641 / 2.268929 (5.163713) 2.797580 / 55.444624 (-52.647044) 2.141819 / 6.876477 (-4.734658) 2.252072 / 2.142072 (0.110000) 0.903347 / 4.805227 (-3.901881) 0.177137 / 6.500664 (-6.323527) 0.069015 / 0.075469 (-0.006454)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.904600 / 1.841788 (0.062813) 16.102169 / 8.074308 (8.027861) 41.873780 / 10.191392 (31.682388) 1.012720 / 0.680424 (0.332297) 0.614671 / 0.534201 (0.080470) 0.554652 / 0.579283 (-0.024632) 0.657284 / 0.434364 (0.222920) 0.377555 / 0.540337 (-0.162782) 0.403528 / 1.386936 (-0.983408)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009021 / 0.011353 (-0.002332) 0.004265 / 0.011008 (-0.006744) 0.041443 / 0.038508 (0.002935) 0.033247 / 0.023109 (0.010137) 0.335289 / 0.275898 (0.059391) 0.363328 / 0.323480 (0.039848) 0.007041 / 0.007986 (-0.000945) 0.003542 / 0.004328 (-0.000786) 0.008372 / 0.004250 (0.004122) 0.037915 / 0.037052 (0.000863) 0.329461 / 0.258489 (0.070972) 0.349771 / 0.293841 (0.055930) 0.045342 / 0.128546 (-0.083204) 0.014184 / 0.075646 (-0.061462) 0.294046 / 0.419271 (-0.125225) 0.059403 / 0.043533 (0.015870) 0.347966 / 0.255139 (0.092827) 0.362626 / 0.283200 (0.079427) 0.104087 / 0.141683 (-0.037596) 1.991183 / 1.452155 (0.539028) 2.065929 / 1.492716 (0.573212)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.251142 / 0.018006 (0.233136) 0.497580 / 0.000490 (0.497090) 0.006897 / 0.000200 (0.006697) 0.000100 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033738 / 0.037411 (-0.003673) 0.023821 / 0.014526 (0.009295) 0.027009 / 0.176557 (-0.149547) 0.082134 / 0.737135 (-0.655001) 0.030828 / 0.296338 (-0.265511)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.614085 / 0.215209 (0.398876) 6.130017 / 2.077655 (4.052363) 2.286809 / 1.504120 (0.782689) 1.941785 / 1.541195 (0.400590) 1.970805 / 1.468490 (0.502315) 0.749373 / 4.584777 (-3.835404) 6.373598 / 3.745712 (2.627886) 4.678504 / 5.269862 (-0.591358) 1.485845 / 4.565676 (-3.079832) 0.090117 / 0.424275 (-0.334158) 0.013633 / 0.007607 (0.006026) 0.767826 / 0.226044 (0.541782) 7.630000 / 2.268929 (5.361071) 2.979161 / 55.444624 (-52.465463) 2.330632 / 6.876477 (-4.545845) 2.307337 / 2.142072 (0.165265) 0.922183 / 4.805227 (-3.883044) 0.187837 / 6.500664 (-6.312827) 0.076325 / 0.075469 (0.000856)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.929779 / 1.841788 (0.087992) 15.666850 / 8.074308 (7.592542) 41.196758 / 10.191392 (31.005366) 1.082362 / 0.680424 (0.401938) 0.617455 / 0.534201 (0.083254) 0.564114 / 0.579283 (-0.015169) 0.705197 / 0.434364 (0.270833) 0.394179 / 0.540337 (-0.146158) 0.412103 / 1.386936 (-0.974833)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.013815 / 0.011353 (0.002463) 0.005436 / 0.011008 (-0.005573) 0.043424 / 0.038508 (0.004916) 0.042278 / 0.023109 (0.019168) 0.426895 / 0.275898 (0.150997) 0.447052 / 0.323480 (0.123572) 0.011386 / 0.007986 (0.003401) 0.004749 / 0.004328 (0.000420) 0.011834 / 0.004250 (0.007583) 0.056060 / 0.037052 (0.019007) 0.401440 / 0.258489 (0.142951) 0.453898 / 0.293841 (0.160058) 0.050438 / 0.128546 (-0.078108) 0.016187 / 0.075646 (-0.059459) 0.347728 / 0.419271 (-0.071544) 0.067859 / 0.043533 (0.024326) 0.408336 / 0.255139 (0.153197) 0.439560 / 0.283200 (0.156360) 0.124597 / 0.141683 (-0.017085) 2.315763 / 1.452155 (0.863609) 2.339894 / 1.492716 (0.847177)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.359370 / 0.018006 (0.341364) 0.608771 / 0.000490 (0.608281) 0.045542 / 0.000200 (0.045342) 0.000464 / 0.000054 (0.000410)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.047318 / 0.037411 (0.009907) 0.030830 / 0.014526 (0.016304) 0.042065 / 0.176557 (-0.134491) 0.086776 / 0.737135 (-0.650359) 0.039235 / 0.296338 (-0.257103)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.687805 / 0.215209 (0.472595) 6.599803 / 2.077655 (4.522149) 2.500299 / 1.504120 (0.996179) 2.087917 / 1.541195 (0.546722) 2.190104 / 1.468490 (0.721614) 0.787077 / 4.584777 (-3.797700) 6.863191 / 3.745712 (3.117479) 3.515014 / 5.269862 (-1.754848) 1.525978 / 4.565676 (-3.039698) 0.088480 / 0.424275 (-0.335796) 0.015477 / 0.007607 (0.007870) 0.794965 / 0.226044 (0.568920) 8.066844 / 2.268929 (5.797915) 3.245942 / 55.444624 (-52.198682) 2.571215 / 6.876477 (-4.305262) 2.608948 / 2.142072 (0.466875) 0.976232 / 4.805227 (-3.828996) 0.199525 / 6.500664 (-6.301139) 0.083798 / 0.075469 (0.008329)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.203683 / 1.841788 (0.361896) 18.728355 / 8.074308 (10.654047) 43.597703 / 10.191392 (33.406310) 1.217911 / 0.680424 (0.537487) 0.717707 / 0.534201 (0.183506) 0.638286 / 0.579283 (0.059003) 0.772725 / 0.434364 (0.338361) 0.435620 / 0.540337 (-0.104718) 0.457459 / 1.386936 (-0.929478)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011770 / 0.011353 (0.000417) 0.005400 / 0.011008 (-0.005609) 0.040413 / 0.038508 (0.001905) 0.041191 / 0.023109 (0.018082) 0.414416 / 0.275898 (0.138518) 0.468200 / 0.323480 (0.144721) 0.008317 / 0.007986 (0.000332) 0.004469 / 0.004328 (0.000141) 0.009255 / 0.004250 (0.005004) 0.049411 / 0.037052 (0.012359) 0.429601 / 0.258489 (0.171112) 0.468815 / 0.293841 (0.174974) 0.049149 / 0.128546 (-0.079398) 0.014990 / 0.075646 (-0.060657) 0.336894 / 0.419271 (-0.082377) 0.067261 / 0.043533 (0.023729) 0.417611 / 0.255139 (0.162472) 0.431438 / 0.283200 (0.148239) 0.113095 / 0.141683 (-0.028588) 2.401733 / 1.452155 (0.949579) 2.377476 / 1.492716 (0.884759)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.316046 / 0.018006 (0.298039) 0.575309 / 0.000490 (0.574819) 0.035657 / 0.000200 (0.035457) 0.000249 / 0.000054 (0.000195)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041215 / 0.037411 (0.003804) 0.032229 / 0.014526 (0.017703) 0.039530 / 0.176557 (-0.137027) 0.083519 / 0.737135 (-0.653617) 0.039409 / 0.296338 (-0.256929)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.671804 / 0.215209 (0.456595) 6.668814 / 2.077655 (4.591159) 2.677209 / 1.504120 (1.173089) 2.311860 / 1.541195 (0.770665) 2.374814 / 1.468490 (0.906324) 0.794149 / 4.584777 (-3.790628) 7.091949 / 3.745712 (3.346236) 3.295632 / 5.269862 (-1.974230) 1.542143 / 4.565676 (-3.023534) 0.090467 / 0.424275 (-0.333808) 0.016017 / 0.007607 (0.008410) 0.823757 / 0.226044 (0.597713) 8.301541 / 2.268929 (6.032613) 3.459367 / 55.444624 (-51.985258) 2.749240 / 6.876477 (-4.127237) 2.814750 / 2.142072 (0.672677) 0.981907 / 4.805227 (-3.823320) 0.195220 / 6.500664 (-6.305445) 0.079311 / 0.075469 (0.003842)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.221056 / 1.841788 (0.379269) 18.298719 / 8.074308 (10.224411) 44.195387 / 10.191392 (34.003995) 1.268960 / 0.680424 (0.588536) 0.720654 / 0.534201 (0.186453) 0.666867 / 0.579283 (0.087584) 0.770422 / 0.434364 (0.336058) 0.441162 / 0.540337 (-0.099176) 0.443788 / 1.386936 (-0.943148)

CML watermark

Please sign in to comment.