-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added semeval18_emotion_classification dataset #2745
Conversation
For training the multilabel classifier, I would combine the labels into a list, for example for the English dataset:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi ! Thanks for adding this one :)
The dataset script looks all good !
For consistency with the other SemEval datasets, could you name this one sem_eval_2018_task_5
please ?
Also could you please add a dataset card ? You can find a template here and a guide here
To be able to properly run our test suite on this dataset we also require dummy data. You can see how to generate them here: /~https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md#automatically-add-code-metadata
@@ -0,0 +1,154 @@ | |||
# coding=utf-8 | |||
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. | |
# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor. |
} | ||
|
||
|
||
class SemEval18EmotionClassification(datasets.GeneratorBasedBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rename the dataset sem_eval_2018_task_5
you will have to rename this class SemEval2018Task5
Hi @maxpel , have you had a chance to take my comments into account ? Let me know if you have questions or if I can help :) |
Hi @lhoestq ! I did take your comments into account, changed the naming and tried to add dummy data (manually). I am not sure if the dummy data is correct, maybe you can take a look at that. |
Thanks ! The dummy data looks all good, good job :) The CI error can be fixed by merging git fetch upstream
git merge upstream/master |
Hi! I just added the model card and I did the merge you showed above. Should I then add and commit again? The CI error is still there right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the dataset sem_eval_2018_task_1
since it's actually task 1 and not 5.
Though only the subtask 5 of the SemEval 2018 Task 1 is available here (the one for emotion classification)
I also did some other minor changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks all good to me know :) merging !
Let me know if you have other comments or changes you wanted to do - we can see in another PR.
Thanks a lot for adding this dataset :)
@lhoestq Unfortunately, I discovered a problem with the test data sets on the competion page (train and dev is fine). They still contain NONE labels for each of the emotions, for example for English: http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/AIT2018-TEST-DATA/semeval2018englishtestfiles/2018-E-c-En-test.zip |
Hi ! I think we can edit the sem_eval_2018_task_1.py file to use this URL instead, and maybe update the |
I added the data set of SemEval 2018 Task 1 (Subtask 5) for emotion detection in three languages.
Both commands ran successfully.
I couldn't create the dummy data (the files are tsvs but have .txt ending, maybe that's the problem?) and therefore the test on the dummy data fails, maybe someone can help here.
I also formatted the code:
That's the publication for reference:
Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018). SemEval-2018 task 1: Affect in tweets. Proceedings of the 12th International Workshop on Semantic Evaluation, 1–17. https://doi.org/10.18653/v1/S18-1001