Skip to content

Commit

Permalink
Release: 1.10.2
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Jul 22, 2021
1 parent f74e245 commit cea1a29
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
# The short X.Y version
version = ""
# The full version, including alpha/beta/rc tags
release = "1.10.1"
release = "1.10.2"


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@

setup(
name="datasets",
version="1.10.2.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="1.10.2", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description=DOCLINES[0],
long_description="\n".join(DOCLINES[2:]),
author="HuggingFace Inc.",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "1.10.2.dev0"
__version__ = "1.10.2"

import pyarrow
from pyarrow import total_allocated_bytes
Expand Down

2 comments on commit cea1a29

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010249 / 0.011353 (-0.001104) 0.003735 / 0.011008 (-0.007273) 0.034719 / 0.038508 (-0.003790) 0.035826 / 0.023109 (0.012717) 0.323899 / 0.275898 (0.048001) 0.377579 / 0.323480 (0.054099) 0.009246 / 0.007986 (0.001260) 0.004577 / 0.004328 (0.000248) 0.010039 / 0.004250 (0.005788) 0.039754 / 0.037052 (0.002702) 0.337770 / 0.258489 (0.079281) 0.384477 / 0.293841 (0.090636) 0.036661 / 0.128546 (-0.091886) 0.011112 / 0.075646 (-0.064534) 0.280599 / 0.419271 (-0.138673) 0.064833 / 0.043533 (0.021300) 0.333882 / 0.255139 (0.078743) 0.382766 / 0.283200 (0.099566) 0.079437 / 0.141683 (-0.062246) 1.701847 / 1.452155 (0.249692) 1.764488 / 1.492716 (0.271772)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.017198 / 0.018006 (-0.000809) 0.461096 / 0.000490 (0.460606) 0.002782 / 0.000200 (0.002582) 0.000434 / 0.000054 (0.000379)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.038381 / 0.037411 (0.000970) 0.025006 / 0.014526 (0.010480) 0.025535 / 0.176557 (-0.151021) 0.120914 / 0.737135 (-0.616221) 0.027811 / 0.296338 (-0.268527)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.431327 / 0.215209 (0.216118) 4.390998 / 2.077655 (2.313343) 1.967808 / 1.504120 (0.463688) 1.797684 / 1.541195 (0.256490) 1.665200 / 1.468490 (0.196710) 0.485331 / 4.584777 (-4.099446) 5.583692 / 3.745712 (1.837980) 3.640582 / 5.269862 (-1.629279) 1.597524 / 4.565676 (-2.968152) 0.060102 / 0.424275 (-0.364173) 0.006385 / 0.007607 (-0.001222) 0.622964 / 0.226044 (0.396920) 6.239752 / 2.268929 (3.970824) 2.670444 / 55.444624 (-52.774181) 2.051224 / 6.876477 (-4.825253) 2.076105 / 2.142072 (-0.065968) 0.682532 / 4.805227 (-4.122695) 0.152427 / 6.500664 (-6.348237) 0.059261 / 0.075469 (-0.016208)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 15.802987 / 1.841788 (13.961199) 13.705671 / 8.074308 (5.631363) 37.986066 / 10.191392 (27.794674) 0.840481 / 0.680424 (0.160057) 0.568186 / 0.534201 (0.033985) 0.240249 / 0.579283 (-0.339034) 0.605786 / 0.434364 (0.171422) 0.214771 / 0.540337 (-0.325566) 1.051119 / 1.386936 (-0.335817)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009836 / 0.011353 (-0.001517) 0.003778 / 0.011008 (-0.007230) 0.033196 / 0.038508 (-0.005312) 0.033990 / 0.023109 (0.010881) 0.342928 / 0.275898 (0.067030) 0.388599 / 0.323480 (0.065119) 0.008688 / 0.007986 (0.000703) 0.004680 / 0.004328 (0.000352) 0.009004 / 0.004250 (0.004753) 0.035331 / 0.037052 (-0.001721) 0.346612 / 0.258489 (0.088123) 0.373403 / 0.293841 (0.079562) 0.038561 / 0.128546 (-0.089985) 0.011515 / 0.075646 (-0.064131) 0.277473 / 0.419271 (-0.141798) 0.051510 / 0.043533 (0.007978) 0.352279 / 0.255139 (0.097140) 0.385289 / 0.283200 (0.102089) 0.074424 / 0.141683 (-0.067259) 1.799493 / 1.452155 (0.347338) 1.747546 / 1.492716 (0.254830)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.081702 / 0.018006 (0.063696) 0.456252 / 0.000490 (0.455762) 0.034543 / 0.000200 (0.034343) 0.000947 / 0.000054 (0.000893)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036986 / 0.037411 (-0.000425) 0.026566 / 0.014526 (0.012041) 0.025072 / 0.176557 (-0.151485) 0.121108 / 0.737135 (-0.616028) 0.026662 / 0.296338 (-0.269677)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.432303 / 0.215209 (0.217094) 4.596113 / 2.077655 (2.518459) 2.079533 / 1.504120 (0.575413) 1.835280 / 1.541195 (0.294086) 1.766105 / 1.468490 (0.297615) 0.528138 / 4.584777 (-4.056639) 5.603164 / 3.745712 (1.857452) 3.498030 / 5.269862 (-1.771832) 1.586904 / 4.565676 (-2.978773) 0.054743 / 0.424275 (-0.369532) 0.005230 / 0.007607 (-0.002377) 0.568150 / 0.226044 (0.342106) 6.056982 / 2.268929 (3.788054) 2.774409 / 55.444624 (-52.670215) 2.175419 / 6.876477 (-4.701058) 2.071495 / 2.142072 (-0.070577) 0.624328 / 4.805227 (-4.180899) 0.132617 / 6.500664 (-6.368047) 0.054242 / 0.075469 (-0.021227)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 16.132153 / 1.841788 (14.290366) 13.623512 / 8.074308 (5.549204) 38.387327 / 10.191392 (28.195935) 0.827057 / 0.680424 (0.146634) 0.547161 / 0.534201 (0.012960) 0.256053 / 0.579283 (-0.323230) 0.591733 / 0.434364 (0.157369) 0.200629 / 0.540337 (-0.339708) 1.000498 / 1.386936 (-0.386439)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010681 / 0.011353 (-0.000672) 0.004022 / 0.011008 (-0.006986) 0.039751 / 0.038508 (0.001243) 0.039490 / 0.023109 (0.016381) 0.339098 / 0.275898 (0.063200) 0.387923 / 0.323480 (0.064443) 0.008591 / 0.007986 (0.000605) 0.006044 / 0.004328 (0.001715) 0.011006 / 0.004250 (0.006755) 0.042054 / 0.037052 (0.005002) 0.332241 / 0.258489 (0.073752) 0.426868 / 0.293841 (0.133027) 0.033265 / 0.128546 (-0.095282) 0.011645 / 0.075646 (-0.064001) 0.291804 / 0.419271 (-0.127468) 0.055542 / 0.043533 (0.012009) 0.336757 / 0.255139 (0.081618) 0.367872 / 0.283200 (0.084673) 0.087039 / 0.141683 (-0.054644) 1.818348 / 1.452155 (0.366193) 1.810091 / 1.492716 (0.317374)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.017630 / 0.018006 (-0.000376) 0.587440 / 0.000490 (0.586950) 0.003729 / 0.000200 (0.003529) 0.000388 / 0.000054 (0.000333)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.048519 / 0.037411 (0.011108) 0.026758 / 0.014526 (0.012232) 0.029864 / 0.176557 (-0.146692) 0.143052 / 0.737135 (-0.594083) 0.030423 / 0.296338 (-0.265915)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.513062 / 0.215209 (0.297853) 5.080072 / 2.077655 (3.002417) 2.176134 / 1.504120 (0.672014) 1.850025 / 1.541195 (0.308830) 1.872155 / 1.468490 (0.403665) 0.520451 / 4.584777 (-4.064326) 6.225584 / 3.745712 (2.479872) 3.720614 / 5.269862 (-1.549247) 1.566565 / 4.565676 (-2.999112) 0.055229 / 0.424275 (-0.369046) 0.006300 / 0.007607 (-0.001307) 0.629814 / 0.226044 (0.403770) 6.330017 / 2.268929 (4.061089) 2.839832 / 55.444624 (-52.604793) 2.320409 / 6.876477 (-4.556068) 2.207881 / 2.142072 (0.065809) 0.674975 / 4.805227 (-4.130252) 0.144605 / 6.500664 (-6.356059) 0.054717 / 0.075469 (-0.020752)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 16.402955 / 1.841788 (14.561168) 14.541293 / 8.074308 (6.466985) 41.759403 / 10.191392 (31.568011) 0.948295 / 0.680424 (0.267871) 0.676208 / 0.534201 (0.142007) 0.289308 / 0.579283 (-0.289975) 0.678491 / 0.434364 (0.244127) 0.224357 / 0.540337 (-0.315981) 1.047602 / 1.386936 (-0.339334)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011387 / 0.011353 (0.000035) 0.004128 / 0.011008 (-0.006880) 0.038167 / 0.038508 (-0.000341) 0.039570 / 0.023109 (0.016460) 0.333731 / 0.275898 (0.057833) 0.395028 / 0.323480 (0.071548) 0.009191 / 0.007986 (0.001206) 0.006590 / 0.004328 (0.002261) 0.013409 / 0.004250 (0.009159) 0.042414 / 0.037052 (0.005362) 0.339285 / 0.258489 (0.080796) 0.391960 / 0.293841 (0.098120) 0.034603 / 0.128546 (-0.093943) 0.011240 / 0.075646 (-0.064406) 0.312761 / 0.419271 (-0.106511) 0.058911 / 0.043533 (0.015378) 0.360862 / 0.255139 (0.105723) 0.387045 / 0.283200 (0.103846) 0.093306 / 0.141683 (-0.048377) 1.729290 / 1.452155 (0.277136) 1.920260 / 1.492716 (0.427544)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.050852 / 0.018006 (0.032846) 0.564442 / 0.000490 (0.563953) 0.022922 / 0.000200 (0.022722) 0.000349 / 0.000054 (0.000294)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.038918 / 0.037411 (0.001507) 0.036704 / 0.014526 (0.022178) 0.030369 / 0.176557 (-0.146187) 0.144154 / 0.737135 (-0.592981) 0.033911 / 0.296338 (-0.262428)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.522735 / 0.215209 (0.307526) 5.037515 / 2.077655 (2.959861) 2.565802 / 1.504120 (1.061682) 2.216260 / 1.541195 (0.675065) 2.114039 / 1.468490 (0.645549) 0.515723 / 4.584777 (-4.069054) 6.247140 / 3.745712 (2.501428) 5.416167 / 5.269862 (0.146305) 1.702482 / 4.565676 (-2.863195) 0.063091 / 0.424275 (-0.361184) 0.006253 / 0.007607 (-0.001354) 0.677881 / 0.226044 (0.451836) 6.790229 / 2.268929 (4.521301) 3.050846 / 55.444624 (-52.393779) 2.274910 / 6.876477 (-4.601567) 2.258020 / 2.142072 (0.115947) 0.653520 / 4.805227 (-4.151708) 0.146967 / 6.500664 (-6.353697) 0.058203 / 0.075469 (-0.017266)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 16.820525 / 1.841788 (14.978737) 16.202577 / 8.074308 (8.128269) 40.539311 / 10.191392 (30.347919) 0.922768 / 0.680424 (0.242344) 0.661208 / 0.534201 (0.127007) 0.298368 / 0.579283 (-0.280915) 0.681640 / 0.434364 (0.247276) 0.227465 / 0.540337 (-0.312872) 1.156776 / 1.386936 (-0.230160)

CML watermark

Please sign in to comment.