Skip to content

Commit

Permalink
Add The Pile Enron Emails subset (#3427)
Browse files Browse the repository at this point in the history
* Add The Pile Enron Emails subset

* Update dataset card

* Fix style
  • Loading branch information
albertvillanova authored Dec 14, 2021
1 parent 0d814bd commit 7601a7b
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 1 deletion.
14 changes: 14 additions & 0 deletions datasets/the_pile/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,15 @@ This dataset is in English (`EN`).
}
```

#### enron_emails
```
{
'text': 'Name\t\t\tNew Title\t\t\t\tEffective Date\t\t\tMid Year promotion Yes/No\n\nFloyd, Jodie\t\tSr Cust Svc Rep (no change)\t\t7/16/01\t\t\t\tNo\n\nBuehler, Craig\t\tSr Mkt/Sup Analyst (no change)\t\t7/16/01\t\t\t\tNo\n\nWagoner, Mike\t\tTeam Advisor - Gas Control\t\t7/1/01\t\t\t\tNo\n\nClapper, Karen\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nGreaney, Chris\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nWilkens, Jerry\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nMinton, Kevin\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nCox, Don\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nHanagriff, Richard\tSr Accounting Control Spec\t\t8/1/01\t\t\t\tYes\n\n\nThanks,\nMS'
'meta': "{}",
}
```

#### europarl
```
{
Expand Down Expand Up @@ -154,6 +163,11 @@ This dataset is in English (`EN`).
- `meta` (dict): Metadata of the data instance with keys:
- pile_set_name: Name of the subset.

#### enron_emails

- `text` (str): Text.
- `meta` (str): Metadata of the data instance.

#### europarl

- `text` (str): Text.
Expand Down
10 changes: 9 additions & 1 deletion datasets/the_pile/the_pile.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@

_LICENSES = {
"all": "Multiple: see each subset license",
"enron_emails": "Unknown",
"europarl": "Unknown",
"free_law": "Unknown",
"hacker_news": "Unknown",
Expand All @@ -55,6 +56,7 @@
"validation": ["https://the-eye.eu/public/AI/pile/val.jsonl.zst"],
"test": ["https://the-eye.eu/public/AI/pile/test.jsonl.zst"],
},
"enron_emails": "http://eaidata.bmk.sh/data/enron_emails.jsonl.zst",
"europarl": "https://the-eye.eu/public/AI/pile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst",
"free_law": "https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
"hacker_news": "https://the-eye.eu/public/AI/pile_preliminary_components/hn.tar.gz",
Expand All @@ -72,6 +74,12 @@
"meta": {"pile_set_name": datasets.Value("string")},
}
),
"enron_emails": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"europarl": datasets.Features(
{
"text": datasets.Value("string"),
Expand Down Expand Up @@ -213,7 +221,7 @@ def _generate_examples(self, files):
key += 1
else:
for subset in files:
if subset in {"europarl", "free_law", "nih_exporter", "pubmed", "ubuntu_irc"}:
if subset in {"enron_emails", "europarl", "free_law", "nih_exporter", "pubmed", "ubuntu_irc"}:
import zstandard as zstd

with zstd.open(open(files[subset], "rb"), "rt", encoding="utf-8") as f:
Expand Down

1 comment on commit 7601a7b

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010132 / 0.011353 (-0.001221) 0.004080 / 0.011008 (-0.006928) 0.031448 / 0.038508 (-0.007060) 0.035051 / 0.023109 (0.011942) 0.302659 / 0.275898 (0.026761) 0.324804 / 0.323480 (0.001324) 0.008127 / 0.007986 (0.000141) 0.003627 / 0.004328 (-0.000702) 0.009072 / 0.004250 (0.004822) 0.045580 / 0.037052 (0.008528) 0.294628 / 0.258489 (0.036139) 0.329850 / 0.293841 (0.036009) 0.030888 / 0.128546 (-0.097658) 0.008970 / 0.075646 (-0.066676) 0.256502 / 0.419271 (-0.162769) 0.050079 / 0.043533 (0.006546) 0.288222 / 0.255139 (0.033083) 0.318775 / 0.283200 (0.035576) 0.080574 / 0.141683 (-0.061109) 1.721976 / 1.452155 (0.269821) 1.771749 / 1.492716 (0.279033)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.282270 / 0.018006 (0.264263) 0.539905 / 0.000490 (0.539415) 0.004612 / 0.000200 (0.004412) 0.000095 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.037342 / 0.037411 (-0.000069) 0.021940 / 0.014526 (0.007414) 0.028239 / 0.176557 (-0.148317) 0.067933 / 0.737135 (-0.669202) 0.027308 / 0.296338 (-0.269031)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.432062 / 0.215209 (0.216853) 4.333356 / 2.077655 (2.255701) 1.919434 / 1.504120 (0.415314) 1.734686 / 1.541195 (0.193491) 1.869445 / 1.468490 (0.400955) 0.444877 / 4.584777 (-4.139900) 4.753126 / 3.745712 (1.007414) 3.814145 / 5.269862 (-1.455717) 0.914788 / 4.565676 (-3.650888) 0.054244 / 0.424275 (-0.370031) 0.011962 / 0.007607 (0.004355) 0.541128 / 0.226044 (0.315083) 5.420171 / 2.268929 (3.151243) 2.508097 / 55.444624 (-52.936527) 2.033124 / 6.876477 (-4.843353) 2.071986 / 2.142072 (-0.070086) 0.564798 / 4.805227 (-4.240429) 0.122406 / 6.500664 (-6.378259) 0.062226 / 0.075469 (-0.013243)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.534516 / 1.841788 (-0.307271) 12.087780 / 8.074308 (4.013472) 27.396021 / 10.191392 (17.204629) 0.708084 / 0.680424 (0.027660) 0.523175 / 0.534201 (-0.011026) 0.495004 / 0.579283 (-0.084279) 0.505573 / 0.434364 (0.071209) 0.318175 / 0.540337 (-0.222162) 0.330537 / 1.386936 (-1.056399)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008492 / 0.011353 (-0.002861) 0.003903 / 0.011008 (-0.007105) 0.029879 / 0.038508 (-0.008629) 0.034010 / 0.023109 (0.010901) 0.305290 / 0.275898 (0.029392) 0.335098 / 0.323480 (0.011618) 0.006475 / 0.007986 (-0.001511) 0.003662 / 0.004328 (-0.000667) 0.007429 / 0.004250 (0.003178) 0.044730 / 0.037052 (0.007678) 0.293422 / 0.258489 (0.034933) 0.337160 / 0.293841 (0.043319) 0.031341 / 0.128546 (-0.097205) 0.009123 / 0.075646 (-0.066523) 0.254529 / 0.419271 (-0.164742) 0.056884 / 0.043533 (0.013351) 0.296716 / 0.255139 (0.041578) 0.318623 / 0.283200 (0.035423) 0.083493 / 0.141683 (-0.058190) 1.801611 / 1.452155 (0.349457) 1.891950 / 1.492716 (0.399233)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.282908 / 0.018006 (0.264902) 0.538098 / 0.000490 (0.537609) 0.002241 / 0.000200 (0.002041) 0.000094 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033257 / 0.037411 (-0.004154) 0.020729 / 0.014526 (0.006204) 0.028427 / 0.176557 (-0.148130) 0.067127 / 0.737135 (-0.670009) 0.028457 / 0.296338 (-0.267881)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.424201 / 0.215209 (0.208992) 4.242375 / 2.077655 (2.164720) 1.864288 / 1.504120 (0.360168) 1.622088 / 1.541195 (0.080893) 1.762948 / 1.468490 (0.294458) 0.445914 / 4.584777 (-4.138863) 4.704186 / 3.745712 (0.958474) 2.248455 / 5.269862 (-3.021406) 0.930975 / 4.565676 (-3.634702) 0.053629 / 0.424275 (-0.370646) 0.012088 / 0.007607 (0.004481) 0.539071 / 0.226044 (0.313027) 5.372337 / 2.268929 (3.103408) 2.313623 / 55.444624 (-53.131001) 1.925709 / 6.876477 (-4.950768) 1.960417 / 2.142072 (-0.181655) 0.562631 / 4.805227 (-4.242597) 0.122692 / 6.500664 (-6.377972) 0.061053 / 0.075469 (-0.014416)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.554741 / 1.841788 (-0.287046) 12.134389 / 8.074308 (4.060081) 27.118978 / 10.191392 (16.927586) 0.744653 / 0.680424 (0.064229) 0.542648 / 0.534201 (0.008447) 0.494321 / 0.579283 (-0.084962) 0.515968 / 0.434364 (0.081604) 0.328317 / 0.540337 (-0.212021) 0.342089 / 1.386936 (-1.044847)

CML watermark

Please sign in to comment.