added semeval18_emotion_classification dataset #2745

maxpel · 2021-08-02T15:39:55Z

I added the data set of SemEval 2018 Task 1 (Subtask 5) for emotion detection in three languages.

datasets-cli test datasets/semeval18_emotion_classification/ --save_infos --all_configs

RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_semeval18_emotion_classification

Both commands ran successfully.

I couldn't create the dummy data (the files are tsvs but have .txt ending, maybe that's the problem?) and therefore the test on the dummy data fails, maybe someone can help here.

I also formatted the code:

black --line-length 119 --target-version py36 datasets/semeval18_emotion_classification/
isort datasets/semeval18_emotion_classification/
flake8 datasets/semeval18_emotion_classification/

That's the publication for reference:

Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018). SemEval-2018 task 1: Affect in tweets. Proceedings of the 12th International Workshop on Semantic Evaluation, 1–17. https://doi.org/10.18653/v1/S18-1001

maxpel · 2021-08-06T13:02:38Z

For training the multilabel classifier, I would combine the labels into a list, for example for the English dataset:

dfpre=pd.read_csv(path+"2018-E-c-En-train.txt",sep="\t")
dfpre['list'] = dfpre[dfpre.columns[2:]].values.tolist()
df = dfpre[['Tweet', 'list']].copy()
df.rename(columns={'list': 'labels'}, inplace=True)

lhoestq

Hi ! Thanks for adding this one :)

The dataset script looks all good !

For consistency with the other SemEval datasets, could you name this one sem_eval_2018_task_5 please ?

Also could you please add a dataset card ? You can find a template here and a guide here

To be able to properly run our test suite on this dataset we also require dummy data. You can see how to generate them here: /~https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md#automatically-add-code-metadata

lhoestq · 2021-08-17T12:43:23Z

datasets/semeval18_emotion_classification/semeval18_emotion_classification.py

@@ -0,0 +1,154 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.


Suggested change

# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.

# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.

lhoestq · 2021-08-17T12:45:39Z

datasets/semeval18_emotion_classification/semeval18_emotion_classification.py

+}
+
+
+class SemEval18EmotionClassification(datasets.GeneratorBasedBuilder):


If you rename the dataset sem_eval_2018_task_5 you will have to rename this class SemEval2018Task5

lhoestq · 2021-09-07T09:47:54Z

Hi @maxpel , have you had a chance to take my comments into account ?

Let me know if you have questions or if I can help :)

maxpel · 2021-09-07T10:33:27Z

Hi @lhoestq ! I did take your comments into account, changed the naming and tried to add dummy data (manually). I am not sure if the dummy data is correct, maybe you can take a look at that.
The model card is still missing as I am currently very busy.

lhoestq · 2021-09-09T10:13:56Z

Thanks ! The dummy data looks all good, good job :)

The CI error can be fixed by merging master into your branch

git fetch upstream
git merge upstream/master

maxpel · 2021-09-10T12:10:59Z

Hi! I just added the model card and I did the merge you showed above. Should I then add and commit again? The CI error is still there right now.

lhoestq

I renamed the dataset sem_eval_2018_task_1 since it's actually task 1 and not 5.
Though only the subtask 5 of the SemEval 2018 Task 1 is available here (the one for emotion classification)

I also did some other minor changes

lhoestq

It looks all good to me know :) merging !

Let me know if you have other comments or changes you wanted to do - we can see in another PR.
Thanks a lot for adding this dataset :)

maxpel · 2021-10-27T13:29:38Z

@lhoestq Unfortunately, I discovered a problem with the test data sets on the competion page (train and dev is fine). They still contain NONE labels for each of the emotions, for example for English: http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/AIT2018-TEST-DATA/semeval2018englishtestfiles/2018-E-c-En-test.zip
Luckily, a zip file with all data of the competition contains the correct labels also for the test set:
http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/SemEval2018-Task1-all-data.zip
What's the best way to correct this?

lhoestq · 2021-10-29T09:22:04Z

Hi ! I think we can edit the sem_eval_2018_task_1.py file to use this URL instead, and maybe update the os.path.join calls to the new paths to the text data in the new ZIP file. Would you like to try to make this work ?

added semeval18_emotion_classification dataset

0d1b82e

lhoestq reviewed Aug 17, 2021

View reviewed changes

changed name according to guidelines, tried to add dummy data

4ce6c38

maxpel added 2 commits September 10, 2021 14:05

added modelcard

edb535d

Merge remote-tracking branch 'upstream/master' into semeval18

4bb1b85

maxpel and others added 7 commits September 15, 2021 16:54

Merge remote-tracking branch 'upstream/master' into semeval18

73dc36b

fixing ci error

892dca2

fix CI

d0affbe

remove html file

b871fc0

added terms and conditions

4ab1b24

rename to sem_eval_2018_task_1

41dbde5

Merge remote-tracking branch 'upstream/master' into semeval18

ca71510

lhoestq reviewed Sep 21, 2021

View reviewed changes

style

467998d

lhoestq approved these changes Sep 21, 2021

View reviewed changes

lhoestq merged commit f7d50b6 into huggingface:master Sep 21, 2021

maxpel mentioned this pull request Jan 7, 2022

Fix sem_eval_2018_task_1 download location #3549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added semeval18_emotion_classification dataset #2745

added semeval18_emotion_classification dataset #2745

maxpel commented Aug 2, 2021

maxpel commented Aug 6, 2021

lhoestq left a comment

lhoestq Aug 17, 2021

lhoestq Aug 17, 2021

lhoestq commented Sep 7, 2021

maxpel commented Sep 7, 2021

lhoestq commented Sep 9, 2021

maxpel commented Sep 10, 2021

lhoestq left a comment

lhoestq left a comment •

edited

Loading

maxpel commented Oct 27, 2021

lhoestq commented Oct 29, 2021

		@@ -0,0 +1,154 @@
		# coding=utf-8
		# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.

	# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
	# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.

		}


		class SemEval18EmotionClassification(datasets.GeneratorBasedBuilder):

added semeval18_emotion_classification dataset #2745

added semeval18_emotion_classification dataset #2745

Conversation

maxpel commented Aug 2, 2021

maxpel commented Aug 6, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Aug 17, 2021

Choose a reason for hiding this comment

lhoestq Aug 17, 2021

Choose a reason for hiding this comment

lhoestq commented Sep 7, 2021

maxpel commented Sep 7, 2021

lhoestq commented Sep 9, 2021

maxpel commented Sep 10, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

maxpel commented Oct 27, 2021

lhoestq commented Oct 29, 2021

lhoestq left a comment •

edited

Loading