Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAWS-X: Fix csv Dictreader splitting data on quotes #1763

Merged
merged 4 commits into from
Jan 22, 2021

Conversation

gowtham1997
Copy link
Contributor

from datasets import load_dataset
# load english paws-x dataset 
datasets = load_dataset('paws-x', 'en')
print(len(datasets['train']))                     # outputs 49202 but official dataset has 49401 pairs
print(datasets['train'].unique('label'))     # outputs [1, 0, -1] but labels are binary [0,1]

changed data = csv.DictReader(f, delimiter="\t") to data = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE) in the dataloader to make csv module not split by quotes.

The results are as expected for all languages after the change.

@lhoestq lhoestq changed the title Fix csv Dictreader splitting data on quotes PAWS-X: Fix csv Dictreader splitting data on quotes Jan 22, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch ! And thank you for the fix :)

I also removed the code that could make -1 labels, and I updated the dataset_infos.json file as well as the readme

@lhoestq lhoestq merged commit 0281f9d into huggingface:master Jan 22, 2021
@gowtham1997 gowtham1997 deleted the patch-1 branch January 22, 2021 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants