Skip to content

Commit

Permalink
Fixing things because I am bad at merging
Browse files Browse the repository at this point in the history
  • Loading branch information
Rocketknight1 committed Sep 15, 2021
1 parent c8f251b commit f1f8888
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,7 @@ class NonExistentDatasetError(Exception):
pass


class Dataset(DatasetInfoMixin, IndexableMixin):
class Dataset(DatasetInfoMixin, IndexableMixin, TensorflowDatasetMixIn):
"""A Dataset backed by an Arrow table."""

def __init__(
Expand Down

1 comment on commit f1f8888

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012736 / 0.011353 (0.001383) 0.004683 / 0.011008 (-0.006325) 0.044459 / 0.038508 (0.005951) 0.044187 / 0.023109 (0.021078) 0.435298 / 0.275898 (0.159400) 0.489741 / 0.323480 (0.166262) 0.010233 / 0.007986 (0.002247) 0.006053 / 0.004328 (0.001724) 0.011794 / 0.004250 (0.007544) 0.051834 / 0.037052 (0.014782) 0.445652 / 0.258489 (0.187162) 0.504472 / 0.293841 (0.210631) 0.039444 / 0.128546 (-0.089102) 0.012067 / 0.075646 (-0.063579) 0.357938 / 0.419271 (-0.061334) 0.063310 / 0.043533 (0.019777) 0.435055 / 0.255139 (0.179916) 0.474336 / 0.283200 (0.191136) 0.165041 / 0.141683 (0.023358) 2.352103 / 1.452155 (0.899948) 2.376197 / 1.492716 (0.883481)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.255247 / 0.018006 (0.237241) 0.608644 / 0.000490 (0.608154) 0.010598 / 0.000200 (0.010399) 0.000172 / 0.000054 (0.000117)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.047238 / 0.037411 (0.009826) 0.031988 / 0.014526 (0.017462) 0.037882 / 0.176557 (-0.138675) 0.162645 / 0.737135 (-0.574490) 0.036625 / 0.296338 (-0.259714)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.556932 / 0.215209 (0.341723) 5.748535 / 2.077655 (3.670880) 2.725332 / 1.504120 (1.221212) 2.393737 / 1.541195 (0.852542) 2.427049 / 1.468490 (0.958559) 0.571387 / 4.584777 (-4.013390) 7.090975 / 3.745712 (3.345262) 5.819122 / 5.269862 (0.549260) 3.640109 / 4.565676 (-0.925568) 0.065994 / 0.424275 (-0.358281) 0.007621 / 0.007607 (0.000014) 0.743353 / 0.226044 (0.517308) 7.355149 / 2.268929 (5.086221) 3.402874 / 55.444624 (-52.041750) 2.752674 / 6.876477 (-4.123802) 2.914539 / 2.142072 (0.772466) 0.764884 / 4.805227 (-4.040343) 0.170785 / 6.500664 (-6.329879) 0.070013 / 0.075469 (-0.005456)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.228124 / 1.841788 (-0.613664) 16.326001 / 8.074308 (8.251692) 44.810260 / 10.191392 (34.618868) 1.019462 / 0.680424 (0.339039) 0.708803 / 0.534201 (0.174603) 0.309671 / 0.579283 (-0.269612) 0.736732 / 0.434364 (0.302369) 0.431914 / 0.540337 (-0.108424) 0.444117 / 1.386936 (-0.942819)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012757 / 0.011353 (0.001405) 0.004834 / 0.011008 (-0.006174) 0.044949 / 0.038508 (0.006441) 0.043238 / 0.023109 (0.020128) 0.395436 / 0.275898 (0.119538) 0.490801 / 0.323480 (0.167321) 0.010408 / 0.007986 (0.002422) 0.006345 / 0.004328 (0.002016) 0.012077 / 0.004250 (0.007827) 0.049117 / 0.037052 (0.012064) 0.412029 / 0.258489 (0.153540) 0.462997 / 0.293841 (0.169156) 0.038383 / 0.128546 (-0.090163) 0.013539 / 0.075646 (-0.062107) 0.348590 / 0.419271 (-0.070681) 0.062354 / 0.043533 (0.018821) 0.412055 / 0.255139 (0.156916) 0.461962 / 0.283200 (0.178762) 0.121918 / 0.141683 (-0.019765) 2.232137 / 1.452155 (0.779982) 2.288381 / 1.492716 (0.795665)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.354785 / 0.018006 (0.336779) 0.582217 / 0.000490 (0.581728) 0.046645 / 0.000200 (0.046445) 0.000483 / 0.000054 (0.000429)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.046068 / 0.037411 (0.008657) 0.031043 / 0.014526 (0.016518) 0.033345 / 0.176557 (-0.143211) 0.159597 / 0.737135 (-0.577538) 0.035435 / 0.296338 (-0.260904)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.546573 / 0.215209 (0.331364) 5.476165 / 2.077655 (3.398510) 2.592783 / 1.504120 (1.088663) 2.223854 / 1.541195 (0.682660) 2.305229 / 1.468490 (0.836738) 0.594846 / 4.584777 (-3.989931) 7.014213 / 3.745712 (3.268501) 7.170406 / 5.269862 (1.900545) 3.387581 / 4.565676 (-1.178096) 0.065313 / 0.424275 (-0.358962) 0.006608 / 0.007607 (-0.000999) 0.702341 / 0.226044 (0.476297) 6.949387 / 2.268929 (4.680458) 3.296431 / 55.444624 (-52.148193) 2.664121 / 6.876477 (-4.212356) 2.736361 / 2.142072 (0.594288) 0.760833 / 4.805227 (-4.044394) 0.171651 / 6.500664 (-6.329013) 0.074569 / 0.075469 (-0.000900)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.234569 / 1.841788 (-0.607219) 16.506617 / 8.074308 (8.432309) 45.462834 / 10.191392 (35.271442) 1.021873 / 0.680424 (0.341449) 0.706307 / 0.534201 (0.172106) 0.308812 / 0.579283 (-0.270471) 0.762979 / 0.434364 (0.328615) 0.457309 / 0.540337 (-0.083028) 0.495789 / 1.386936 (-0.891147)

CML watermark

Please sign in to comment.