Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add The Pile Free Law subset #3359

Merged
merged 2 commits into from
Dec 1, 2021
Merged

Add The Pile Free Law subset #3359

merged 2 commits into from
Dec 1, 2021

Conversation

albertvillanova
Copy link
Member

Add:

  • Free Law subset of The Pile: "free_law" config

Close bigscience-workshop/data_tooling#75.

CC: @StellaAthena

@albertvillanova albertvillanova merged commit 702389e into master Dec 1, 2021
@albertvillanova albertvillanova deleted the the-pile-free-law branch December 1, 2021 17:30
@StellaAthena
Copy link

@albertvillanova Is there a specific reason you’re adding the Pile under “the” instead of under “pile”? That does not appear to be consistent with other datasets.

@albertvillanova
Copy link
Member Author

albertvillanova commented Dec 2, 2021

Hi @StellaAthena,

I asked myself the same question, but at the end I decided to be consistent with previously added Pile subsets:

I guess the reason is to stress that the definite article is always used before the name of the dataset (your site says: "The Pile. An 800GB Dataset of Diverse Text for Language Modeling"). Other datasets are not usually preceded by the definite article, like "the SQuAD" or "the GLUE" or "the Common Voice"...

CC: @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Dec 6, 2021

I guess the reason is to stress that the definite article is always used before the name of the dataset (your site says: "The Pile. An 800GB Dataset of Diverse Text for Language Modeling").

Yes that's because of this that it starts with "the"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create license-compliant version of the Pile: FreeLaw
3 participants