Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added semeval18_emotion_classification dataset #2745

Merged
merged 12 commits into from
Sep 21, 2021

Conversation

maxpel
Copy link
Contributor

@maxpel maxpel commented Aug 2, 2021

I added the data set of SemEval 2018 Task 1 (Subtask 5) for emotion detection in three languages.

datasets-cli test datasets/semeval18_emotion_classification/ --save_infos --all_configs

RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_semeval18_emotion_classification

Both commands ran successfully.

I couldn't create the dummy data (the files are tsvs but have .txt ending, maybe that's the problem?) and therefore the test on the dummy data fails, maybe someone can help here.

I also formatted the code:

black --line-length 119 --target-version py36 datasets/semeval18_emotion_classification/
isort datasets/semeval18_emotion_classification/
flake8 datasets/semeval18_emotion_classification/

That's the publication for reference:

Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018). SemEval-2018 task 1: Affect in tweets. Proceedings of the 12th International Workshop on Semantic Evaluation, 1–17. https://doi.org/10.18653/v1/S18-1001

@maxpel
Copy link
Contributor Author

maxpel commented Aug 6, 2021

For training the multilabel classifier, I would combine the labels into a list, for example for the English dataset:

dfpre=pd.read_csv(path+"2018-E-c-En-train.txt",sep="\t")
dfpre['list'] = dfpre[dfpre.columns[2:]].values.tolist()
df = dfpre[['Tweet', 'list']].copy()
df.rename(columns={'list': 'labels'}, inplace=True)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ! Thanks for adding this one :)

The dataset script looks all good !

For consistency with the other SemEval datasets, could you name this one sem_eval_2018_task_5 please ?

Also could you please add a dataset card ? You can find a template here and a guide here

To be able to properly run our test suite on this dataset we also require dummy data. You can see how to generate them here: /~https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md#automatically-add-code-metadata

@@ -0,0 +1,154 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.

}


class SemEval18EmotionClassification(datasets.GeneratorBasedBuilder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rename the dataset sem_eval_2018_task_5 you will have to rename this class SemEval2018Task5

@lhoestq
Copy link
Member

lhoestq commented Sep 7, 2021

Hi @maxpel , have you had a chance to take my comments into account ?

Let me know if you have questions or if I can help :)

@maxpel
Copy link
Contributor Author

maxpel commented Sep 7, 2021

Hi @lhoestq ! I did take your comments into account, changed the naming and tried to add dummy data (manually). I am not sure if the dummy data is correct, maybe you can take a look at that.
The model card is still missing as I am currently very busy.

@lhoestq
Copy link
Member

lhoestq commented Sep 9, 2021

Thanks ! The dummy data looks all good, good job :)

The CI error can be fixed by merging master into your branch

git fetch upstream
git merge upstream/master

@maxpel
Copy link
Contributor Author

maxpel commented Sep 10, 2021

Hi! I just added the model card and I did the merge you showed above. Should I then add and commit again? The CI error is still there right now.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the dataset sem_eval_2018_task_1 since it's actually task 1 and not 5.
Though only the subtask 5 of the SemEval 2018 Task 1 is available here (the one for emotion classification)

I also did some other minor changes

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks all good to me know :) merging !

Let me know if you have other comments or changes you wanted to do - we can see in another PR.
Thanks a lot for adding this dataset :)

@lhoestq lhoestq merged commit f7d50b6 into huggingface:master Sep 21, 2021
@maxpel
Copy link
Contributor Author

maxpel commented Oct 27, 2021

@lhoestq Unfortunately, I discovered a problem with the test data sets on the competion page (train and dev is fine). They still contain NONE labels for each of the emotions, for example for English: http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/AIT2018-TEST-DATA/semeval2018englishtestfiles/2018-E-c-En-test.zip
Luckily, a zip file with all data of the competition contains the correct labels also for the test set:
http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/SemEval2018-Task1-all-data.zip
What's the best way to correct this?

@lhoestq
Copy link
Member

lhoestq commented Oct 29, 2021

Hi ! I think we can edit the sem_eval_2018_task_1.py file to use this URL instead, and maybe update the os.path.join calls to the new paths to the text data in the new ZIP file. Would you like to try to make this work ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants