Skip to content

Commit

Permalink
Update datasets/wikicorpus/wikicorpus.py
Browse files Browse the repository at this point in the history
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
  • Loading branch information
lhoestq and albertvillanova authored Sep 6, 2021
1 parent 382af94 commit c1b2857
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion datasets/wikicorpus/wikicorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def _generate_examples(self, dirpath):
pass
elif row.startswith("ENDOFARTICLE") or row.startswith("\n"):
if len(words) > 1: # some content besides only (. . Fp 0)
yield (file_idx, row_idx), {
yield f"{file_idx}_{row_idx}", {
"id": example["id"],
"title": example["title"],
"sentence": words,
Expand Down

1 comment on commit c1b2857

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009269 / 0.011353 (-0.002084) 0.003880 / 0.011008 (-0.007128) 0.031749 / 0.038508 (-0.006759) 0.035959 / 0.023109 (0.012849) 0.302235 / 0.275898 (0.026337) 0.336459 / 0.323480 (0.012979) 0.008123 / 0.007986 (0.000137) 0.004947 / 0.004328 (0.000619) 0.009172 / 0.004250 (0.004922) 0.046760 / 0.037052 (0.009708) 0.319316 / 0.258489 (0.060827) 0.359816 / 0.293841 (0.065975) 0.022922 / 0.128546 (-0.105625) 0.007806 / 0.075646 (-0.067841) 0.256121 / 0.419271 (-0.163150) 0.045669 / 0.043533 (0.002136) 0.298036 / 0.255139 (0.042897) 0.328414 / 0.283200 (0.045214) 0.148612 / 0.141683 (0.006929) 1.675403 / 1.452155 (0.223248) 1.758165 / 1.492716 (0.265448)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.234874 / 0.018006 (0.216868) 0.563245 / 0.000490 (0.562755) 0.004427 / 0.000200 (0.004227) 0.000242 / 0.000054 (0.000188)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036663 / 0.037411 (-0.000748) 0.024055 / 0.014526 (0.009529) 0.028950 / 0.176557 (-0.147606) 0.126241 / 0.737135 (-0.610894) 0.030963 / 0.296338 (-0.265376)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.347681 / 0.215209 (0.132472) 3.461011 / 2.077655 (1.383357) 1.737232 / 1.504120 (0.233113) 1.565553 / 1.541195 (0.024358) 1.691999 / 1.468490 (0.223509) 0.306638 / 4.584777 (-4.278139) 4.464232 / 3.745712 (0.718520) 4.762547 / 5.269862 (-0.507314) 2.144671 / 4.565676 (-2.421005) 0.036073 / 0.424275 (-0.388202) 0.005428 / 0.007607 (-0.002179) 0.450585 / 0.226044 (0.224540) 4.513426 / 2.268929 (2.244497) 2.218948 / 55.444624 (-53.225676) 1.865554 / 6.876477 (-5.010922) 1.948830 / 2.142072 (-0.193242) 0.416924 / 4.805227 (-4.388303) 0.096452 / 6.500664 (-6.404212) 0.051416 / 0.075469 (-0.024053)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 97.122653 / 1.841788 (95.280865) 12.970421 / 8.074308 (4.896113) 26.838049 / 10.191392 (16.646657) 0.804414 / 0.680424 (0.123990) 0.516402 / 0.534201 (-0.017799) 0.225059 / 0.579283 (-0.354224) 0.480092 / 0.434364 (0.045728) 0.317885 / 0.540337 (-0.222452) 1.015327 / 1.386936 (-0.371609)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009493 / 0.011353 (-0.001860) 0.003889 / 0.011008 (-0.007119) 0.031753 / 0.038508 (-0.006755) 0.037511 / 0.023109 (0.014402) 0.287247 / 0.275898 (0.011349) 0.330382 / 0.323480 (0.006902) 0.008340 / 0.007986 (0.000354) 0.004924 / 0.004328 (0.000596) 0.009240 / 0.004250 (0.004990) 0.046154 / 0.037052 (0.009102) 0.287124 / 0.258489 (0.028635) 0.330112 / 0.293841 (0.036271) 0.023467 / 0.128546 (-0.105079) 0.007736 / 0.075646 (-0.067911) 0.255554 / 0.419271 (-0.163718) 0.046040 / 0.043533 (0.002508) 0.279652 / 0.255139 (0.024513) 0.313076 / 0.283200 (0.029876) 0.149395 / 0.141683 (0.007712) 1.700046 / 1.452155 (0.247892) 1.754496 / 1.492716 (0.261780)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.394371 / 0.018006 (0.376365) 0.569983 / 0.000490 (0.569493) 0.052009 / 0.000200 (0.051809) 0.000507 / 0.000054 (0.000453)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036409 / 0.037411 (-0.001002) 0.023845 / 0.014526 (0.009319) 0.028215 / 0.176557 (-0.148341) 0.145955 / 0.737135 (-0.591181) 0.030502 / 0.296338 (-0.265837)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.348389 / 0.215209 (0.133180) 3.496400 / 2.077655 (1.418745) 1.751486 / 1.504120 (0.247366) 1.576771 / 1.541195 (0.035576) 1.652430 / 1.468490 (0.183940) 0.308209 / 4.584777 (-4.276568) 4.446963 / 3.745712 (0.701251) 5.672957 / 5.269862 (0.403096) 2.130212 / 4.565676 (-2.435464) 0.036488 / 0.424275 (-0.387787) 0.005248 / 0.007607 (-0.002359) 0.452399 / 0.226044 (0.226354) 4.579275 / 2.268929 (2.310347) 2.201722 / 55.444624 (-53.242903) 1.858927 / 6.876477 (-5.017550) 1.916007 / 2.142072 (-0.226066) 0.418095 / 4.805227 (-4.387132) 0.098070 / 6.500664 (-6.402594) 0.133705 / 0.075469 (0.058236)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 100.632181 / 1.841788 (98.790393) 13.247032 / 8.074308 (5.172723) 26.609225 / 10.191392 (16.417833) 0.749533 / 0.680424 (0.069109) 0.509009 / 0.534201 (-0.025192) 0.226000 / 0.579283 (-0.353283) 0.498654 / 0.434364 (0.064290) 0.340975 / 0.540337 (-0.199363) 1.037990 / 1.386936 (-0.348946)

CML watermark

Please sign in to comment.