Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added semeval18_emotion_classification dataset #2745

Merged
merged 12 commits into from
Sep 21, 2021
219 changes: 219 additions & 0 deletions datasets/sem_eval_2018_task_1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
annotations_creators:
- crowdsourced
language_creators:
- found
languages:
- en
- ar
- es
licenses:
- unknown
multilinguality:
- multilingual
pretty_name: 'SemEval-2018 Task 1: Affect in Tweets'
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-label-classification
- text-classification-other-emotion-classification
---

# Dataset Card for SemEval-2018 Task 1: Affect in Tweets

## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage: https://competitions.codalab.org/competitions/17751**
- **Repository:**
- **Paper: http://saifmohammad.com/WebDocs/semeval2018-task1.pdf**
- **Leaderboard:**
- **Point of Contact: https://www.saifmohammad.com/**

### Dataset Summary

Tasks: We present an array of tasks where systems have to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of the tweeters from their tweets. (The term tweeter refers to the person who has posted the tweet.) We also include a multi-label emotion classification task for tweets. For each task, we provide separate training and test datasets for English, Arabic, and Spanish tweets. The individual tasks are described below:

1. EI-reg (an emotion intensity regression task): Given a tweet and an emotion E, determine the intensity of E that best represents the mental state of the tweeter—a real-valued score between 0 (least E) and 1 (most E).
Separate datasets are provided for anger, fear, joy, and sadness.

2. EI-oc (an emotion intensity ordinal classification task): Given a tweet and an emotion E, classify the tweet into one of four ordinal classes of intensity of E that best represents the mental state of the tweeter.
Separate datasets are provided for anger, fear, joy, and sadness.

3. V-reg (a sentiment intensity regression task): Given a tweet, determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter—a real-valued score between 0 (most negative) and 1 (most positive).

4. V-oc (a sentiment analysis, ordinal classification, task): Given a tweet, classify it into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity, that best represents the mental state of the tweeter.

5. E-c (an emotion classification task): Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter.
Here, E refers to emotion, EI refers to emotion intensity, V refers to valence or sentiment intensity, reg refers to regression, oc refers to ordinal classification, c refers to classification.

Together, these tasks encompass various emotion and sentiment analysis tasks. You are free to participate in any number of tasks and on any of the datasets.

**Currently only the subtask 5 (E-c) is available on the Hugging Face Dataset Hub.**

### Supported Tasks and Leaderboards

### Languages

English, Arabic and Spanish

## Dataset Structure

### Data Instances

An example from the `subtask5.english` config is:

```
{'ID': '2017-En-21441',
'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer. #motivation #leadership #worry",
'anger': False,
'anticipation': True,
'disgust': False,
'fear': False,
'joy': False,
'love': False,
'optimism': True,
'pessimism': False,
'sadness': False,
'surprise': False,
'trust': True}
```

### Data Fields

For any config of the subtask 5:
- ID: string id of the tweet
- Tweet: text content of the tweet as a string
- anger: boolean, True if anger represents the mental state of the tweeter
- anticipation: boolean, True if anticipation represents the mental state of the tweeter
- disgust: boolean, True if disgust represents the mental state of the tweeter
- fear: boolean, True if fear represents the mental state of the tweeter
- joy: boolean, True if joy represents the mental state of the tweeter
- love: boolean, True if love represents the mental state of the tweeter
- optimism: boolean, True if optimism represents the mental state of the tweeter
- pessimism: boolean, True if pessimism represents the mental state of the tweeter
- sadness: boolean, True if sadness represents the mental state of the tweeter
- surprise: boolean, True if surprise represents the mental state of the tweeter
- trust: boolean, True if trust represents the mental state of the tweeter

Note that the test set has no labels, and therefore all labels are set to False.

### Data Splits

| | Tain | Dev | Test |
| ----- | ------ | ----- | ---- |
| English | 6,838 | 886 | 3,259|
| Arabic | 2,278 | 585 | 1,518|
| Spanish | 3,561 | 679 | 2,854|


## Dataset Creation

### Curation Rationale

### Source Data

Tweets

#### Initial Data Collection and Normalization

#### Who are the source language producers?

Twitter users.

### Annotations

#### Annotation process

We presented one tweet at a time to the annotators
and asked which of the following options best de-
scribed the emotional state of the tweeter:
– anger (also includes annoyance, rage)
– anticipation (also includes interest, vigilance)
– disgust (also includes disinterest, dislike, loathing)
– fear (also includes apprehension, anxiety, terror)
– joy (also includes serenity, ecstasy)
– love (also includes affection)
– optimism (also includes hopefulness, confidence)
– pessimism (also includes cynicism, no confidence)
– sadness (also includes pensiveness, grief)
– surprise (also includes distraction, amazement)
– trust (also includes acceptance, liking, admiration)
– neutral or no emotion
Example tweets were provided in advance with ex-
amples of suitable responses.
On the Figure Eight task settings, we specified
that we needed annotations from seven people for
each tweet. However, because of the way the gold
tweets were set up, they were annotated by more
than seven people. The median number of anno-
tations was still seven. In total, 303 people anno-
tated between 10 and 4,670 tweets each. A total of
174,356 responses were obtained.

Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018). SemEval-2018 task 1: Affect in tweets. Proceedings of the 12th International Workshop on Semantic Evaluation, 1–17. https://doi.org/10.18653/v1/S18-1001

#### Who are the annotators?

Crowdworkers on Figure Eight.

### Personal and Sensitive Information

## Considerations for Using the Data

### Social Impact of Dataset

### Discussion of Biases

### Other Known Limitations

## Additional Information

### Dataset Curators

Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh and Svetlana Kiritchenko

### Licensing Information

See the official [Terms and Conditions](https://competitions.codalab.org/competitions/17751#learn_the_details-terms_and_conditions)

### Citation Information

@InProceedings{SemEval2018Task1,
author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},
booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},
address = {New Orleans, LA, USA},
year = {2018}}

### Contributions

Thanks to [@maxpel](/~https://github.com/maxpel) for adding this dataset.
1 change: 1 addition & 0 deletions datasets/sem_eval_2018_task_1/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"subtask5.english": {"description": " SemEval-2018 Task 1: Affect in Tweets: SubTask 5: Emotion Classification.\n This is a dataset for multilabel emotion classification for tweets.\n 'Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter.'\n It contains 22467 tweets in three languages manually annotated by crowdworkers using Best\u2013Worst Scaling.\n", "citation": "@InProceedings{SemEval2018Task1,\n author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},\n title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},\n booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},\n address = {New Orleans, LA, USA},\n year = {2018}}\n", "homepage": "https://competitions.codalab.org/competitions/17751", "license": "", "features": {"ID": {"dtype": "string", "id": null, "_type": "Value"}, "Tweet": {"dtype": "string", "id": null, "_type": "Value"}, "anger": {"dtype": "bool", "id": null, "_type": "Value"}, "anticipation": {"dtype": "bool", "id": null, "_type": "Value"}, "disgust": {"dtype": "bool", "id": null, "_type": "Value"}, "fear": {"dtype": "bool", "id": null, "_type": "Value"}, "joy": {"dtype": "bool", "id": null, "_type": "Value"}, "love": {"dtype": "bool", "id": null, "_type": "Value"}, "optimism": {"dtype": "bool", "id": null, "_type": "Value"}, "pessimism": {"dtype": "bool", "id": null, "_type": "Value"}, "sadness": {"dtype": "bool", "id": null, "_type": "Value"}, "surprise": {"dtype": "bool", "id": null, "_type": "Value"}, "trust": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "sem_eval2018_task1", "config_name": "subtask5.english", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 809768, "num_examples": 6838, "dataset_name": "sem_eval2018_task1"}, "test": {"name": "test", "num_bytes": 384519, "num_examples": 3259, "dataset_name": "sem_eval2018_task1"}, "validation": {"name": "validation", "num_bytes": 104660, "num_examples": 886, "dataset_name": "sem_eval2018_task1"}}, "download_checksums": {"http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/E-c/English/2018-E-c-En-train.zip": {"num_bytes": 359408, "checksum": "7a64a0ffc7d54505ae6556d17d37ad56bd8817ef5724c6e3782909e3a3bca0ae"}, "http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/E-c/English/2018-E-c-En-dev.zip": {"num_bytes": 48375, "checksum": "3279ba27452162b1ce0f58b23442ca3fb57c749c3dae7944cbda3ea0984c8a1e"}, "http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/AIT2018-TEST-DATA/semeval2018englishtestfiles/2018-E-c-En-test.zip": {"num_bytes": 174899, "checksum": "9afa650190d749561749348e360fd1fc0d0a80c5f374d12cc5ef4b9a9ffc4430"}}, "download_size": 582682, "post_processing_size": null, "dataset_size": 1298947, "size_in_bytes": 1881629}, "subtask5.spanish": {"description": " SemEval-2018 Task 1: Affect in Tweets: SubTask 5: Emotion Classification.\n This is a dataset for multilabel emotion classification for tweets.\n 'Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter.'\n It contains 22467 tweets in three languages manually annotated by crowdworkers using Best\u2013Worst Scaling.\n", "citation": "@InProceedings{SemEval2018Task1,\n author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},\n title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},\n booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},\n address = {New Orleans, LA, USA},\n year = {2018}}\n", "homepage": "https://competitions.codalab.org/competitions/17751", "license": "", "features": {"ID": {"dtype": "string", "id": null, "_type": "Value"}, "Tweet": {"dtype": "string", "id": null, "_type": "Value"}, "anger": {"dtype": "bool", "id": null, "_type": "Value"}, "anticipation": {"dtype": "bool", "id": null, "_type": "Value"}, "disgust": {"dtype": "bool", "id": null, "_type": "Value"}, "fear": {"dtype": "bool", "id": null, "_type": "Value"}, "joy": {"dtype": "bool", "id": null, "_type": "Value"}, "love": {"dtype": "bool", "id": null, "_type": "Value"}, "optimism": {"dtype": "bool", "id": null, "_type": "Value"}, "pessimism": {"dtype": "bool", "id": null, "_type": "Value"}, "sadness": {"dtype": "bool", "id": null, "_type": "Value"}, "surprise": {"dtype": "bool", "id": null, "_type": "Value"}, "trust": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "sem_eval2018_task1", "config_name": "subtask5.spanish", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 362549, "num_examples": 3561, "dataset_name": "sem_eval2018_task1"}, "test": {"name": "test", "num_bytes": 288692, "num_examples": 2854, "dataset_name": "sem_eval2018_task1"}, "validation": {"name": "validation", "num_bytes": 67259, "num_examples": 679, "dataset_name": "sem_eval2018_task1"}}, "download_checksums": {"http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/E-c/Spanish/2018-E-c-Es-train.zip": {"num_bytes": 156975, "checksum": "28547e933b3087b8a82d7997e15021ef2f3680f6a1b134ca41766ce44034a276"}, "http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/E-c/Spanish/2018-E-c-Es-dev.zip": {"num_bytes": 30152, "checksum": "399cd39ae7dc00b11b2f319dfbb9360614e86c92898318fdfd06af46a81f5ebe"}, "http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/AIT2018-TEST-DATA/semeval2018spanishtestfiles/2018-E-c-Es-test.zip": {"num_bytes": 126924, "checksum": "3909e38a167ec40250b0b78f254e03fc3fb79ac7790bce6b695ef273a1d289d1"}}, "download_size": 314051, "post_processing_size": null, "dataset_size": 718500, "size_in_bytes": 1032551}, "subtask5.arabic": {"description": " SemEval-2018 Task 1: Affect in Tweets: SubTask 5: Emotion Classification.\n This is a dataset for multilabel emotion classification for tweets.\n 'Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter.'\n It contains 22467 tweets in three languages manually annotated by crowdworkers using Best\u2013Worst Scaling.\n", "citation": "@InProceedings{SemEval2018Task1,\n author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},\n title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},\n booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},\n address = {New Orleans, LA, USA},\n year = {2018}}\n", "homepage": "https://competitions.codalab.org/competitions/17751", "license": "", "features": {"ID": {"dtype": "string", "id": null, "_type": "Value"}, "Tweet": {"dtype": "string", "id": null, "_type": "Value"}, "anger": {"dtype": "bool", "id": null, "_type": "Value"}, "anticipation": {"dtype": "bool", "id": null, "_type": "Value"}, "disgust": {"dtype": "bool", "id": null, "_type": "Value"}, "fear": {"dtype": "bool", "id": null, "_type": "Value"}, "joy": {"dtype": "bool", "id": null, "_type": "Value"}, "love": {"dtype": "bool", "id": null, "_type": "Value"}, "optimism": {"dtype": "bool", "id": null, "_type": "Value"}, "pessimism": {"dtype": "bool", "id": null, "_type": "Value"}, "sadness": {"dtype": "bool", "id": null, "_type": "Value"}, "surprise": {"dtype": "bool", "id": null, "_type": "Value"}, "trust": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "sem_eval2018_task1", "config_name": "subtask5.arabic", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 414458, "num_examples": 2278, "dataset_name": "sem_eval2018_task1"}, "test": {"name": "test", "num_bytes": 278715, "num_examples": 1518, "dataset_name": "sem_eval2018_task1"}, "validation": {"name": "validation", "num_bytes": 105452, "num_examples": 585, "dataset_name": "sem_eval2018_task1"}}, "download_checksums": {"http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/E-c/Arabic/2018-E-c-Ar-train.zip": {"num_bytes": 142792, "checksum": "cd25acadaf262e1e8dfb27c4d12f392ccb9caf648933a183fc0c83255a86f4a1"}, "http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/E-c/Arabic/2018-E-c-Ar-dev.zip": {"num_bytes": 37428, "checksum": "177e1eee9967cd5dd4b4853ef0cde694b9c20a7b4eb8bfbcb82b11d53cbd30f9"}, "http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/AIT2018-TEST-DATA/semeval2018arabictestfiles/2018-E-c-Ar-test.zip": {"num_bytes": 97606, "checksum": "4f1fc9f082c08c29b0acec180ebcb10ff425b96c117d8aa86a13ea092fce59f3"}}, "download_size": 277826, "post_processing_size": null, "dataset_size": 798625, "size_in_bytes": 1076451}}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading