CodaLab - Competition

SemEval-2018 Task 1: Affect in Tweets (AIT-2018)

Organized by felipebravom - Current server time: April 23, 2025, 7:25 a.m. UTC

Evaluation Period

Jan. 8, 2018, midnight UTC

Current

Post-Evaluation Period

Jan. 28, 2018, 11:59 p.m. UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions
Organizers
Datasets
Submission Format
Schedule
Manual Annotation of the Data
Resources
References
FAQ
Codalab Directions
System-Description Papers

SemEval-2018 Task 1: Affect in Tweets

SemEval-2018: International Workshop on Semantic Evaluation will be held in conjunction with NAACL-2018 in New Orleans, LA, USA, June 5-6, 2018.

Cite this paper for the task: Semeval-2018 Task 1: Affect in Tweets. Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. In Proceedings of the International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.

@InProceedings{SemEval2018Task1,
author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},
booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},
address = {New Orleans, LA, USA},
year = {2018}}

The Equity Evaluation Corpus (EEC), which consists of 8,640 English sentences carefully chosen to tease out biases towards certain races and genders is available here. The EEC was the mystery test set added to the tweets test sets for the English EI-reg and V-reg tasks. Below is the *Sem paper describing the EEC dataset and the bias evaluation.

Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics (*SEM), New Orleans, LA, USA, June 2018.

Join the official task mailing group: EmotionIntensity@googlegroups.com

It is crucial that you join the mailing list to receive the latest news and updates. Also note that even if you join the mailing list now, you will be able to see all messages posted earlier.

Evaluation phase has concluded. Over 70 teams participated. The official results have been posted.

For teams that participated, please register your team (via this Google form) by Feb 1st, 2018 (this is mandatory).

Background and Significance: We use language to communicate not only the emotion or sentiment we are feeling but also the intensity of the emotion or sentiment. For example, our utterances can convey that we are very angry, slightly sad, absolutely elated, etc. Here, intensity refers to the degree or amount of an emotion or degree of sentiment. We will refer to emotion-related categories such as anger, fear, sentiment, and arousal, by the term affect. Existing affect datasets are mainly annotated categorically without an indication of intensity. Further, past shared tasks have almost always been framed as classification tasks (identify one among n affect categories for this sentence). In contrast, it is often useful for applications to know the degree to which affect is expressed in text.

Tasks: We present an array of tasks where systems have to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of the tweeters from their tweets. (The term tweeter refers to the person who has posted the tweet.) We also include a multi-label emotion classification task for tweets. For each task, we provide separate training and test datasets for English, Arabic, and Spanish tweets. The individual tasks are described below:

EI-reg (an emotion intensity regression task): Given a tweet and an emotion E, determine the intensity of E that best represents the mental state of the tweeter—a real-valued score between 0 (least E) and 1 (most E).

Separate datasets are provided for anger, fear, joy, and sadness.

EI-oc (an emotion intensity ordinal classification task): Given a tweet and an emotion E, classify the tweet into one of four ordinal classes of intensity of E that best represents the mental state of the tweeter.

Separate datasets are provided for anger, fear, joy, and sadness.

V-reg (a sentiment intensity regression task): Given a tweet, determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter—a real-valued score between 0 (most negative) and 1 (most positive).
V-oc (a sentiment analysis, ordinal classification, task): Given a tweet, classify it into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity, that best represents the mental state of the tweeter.
E-c (an emotion classification task): Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter.

Here, E refers to emotion, EI refers to emotion intensity, V refers to valence or sentiment intensity, reg refers to regression, oc refers to ordinal classification, c refers to classification.

Together, these tasks encompass various emotion and sentiment analysis tasks. You are free to participate in any number of tasks and on any of the datasets. Further details on each of the tasks are provided below.

1. Task EI-reg: Detecting Emotion Intensity (regression)

Given:

a tweet
an emotion E (anger, fear, joy, or sadness)

Task: determine the intensity of E that best represents the mental state of the tweeter—a real-valued score between 0 and 1:

a score of 1: highest amount of E can be inferred
a score of 0: lowest amount of E can be inferred

For each language: 4 training sets and 4 test sets: one for each emotion E.

(Note that the absolute scores have no inherent meaning -- they are used only as a means to convey that the instances with higher scores correspond to a greater degree of E than instances with lower scores.)

2. Task EI-oc: Detecting Emotion Intensity (ordinal classification)

Given:

a tweet
an emotion E (anger, fear, joy, or sadness)

Task: classify the tweet into one of four ordinal classes of intensity of E that best represents the mental state of the tweeter:

0: no E can be inferred
1: low amount of E can be inferred
2: moderate amount of E can be inferred
3: high amount of E can be inferred

For each language: 4 training sets and 4 test sets: one for each emotion E.

3. Task V-reg: Detecting Valence or Sentiment Intensity (regression)

Given:

a tweet

Task: determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter—a real-valued score between 0 and 1:

a score of 1: most positive mental state can be inferred
a score of 0: most negative mental state can be inferred

For each language: 1 training set, 1 test set.

(Note that the absolute scores have no inherent meaning -- they are used only as a means to convey that the instances with higher scores correspond to a greater degree of positive sentiment than instances with lower scores.)

4. Task V-oc: Detecting Valence (ordinal classification) -- This is the traditional Sentiment Analysis Task

Given:

a tweet

Task: classify the tweet into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity, that best represents the mental state of the tweeter:

3: very positive mental state can be inferred
2: moderately positive mental state can be inferred
1: slightly positive mental state can be inferred
0: neutral or mixed mental state can be inferred
-1: slightly negative mental state can be inferred
-2: moderately negative mental state can be inferred
-3: very negative mental state can be inferred

For each language: 1 training set, 1 test set.

5. Task E-c: Detecting Emotions (multi-label classification) -- This is a traditional Emotion Classification Task

Given:

a tweet

Task: classify the tweet as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter:

anger (also includes annoyance and rage) can be inferred
anticipation (also includes interest and vigilance) can be inferred
disgust (also includes disinterest, dislike and loathing) can be inferred
fear (also includes apprehension, anxiety, concern, and terror) can be inferred
joy (also includes serenity and ecstasy) can be inferred
love (also includes affection) can be inferred
optimism (also includes hopefulness and confidence) can be inferred
pessimism (also includes cynicism and lack of confidence) can be inferred
sadness (also includes pensiveness and grief) can be inferred
suprise (also includes distraction and amazement) can be inferred
trust (also includes acceptance, liking, and admiration) can be inferred

For each language: 1 training set, 1 test set.

(Note that the set of emotions includes the eight basic emotions as per Plutchik (1980), as well as a few other emotions that are common in tweets (love, optimism, and pessimism).)

Paper: Participants will be given the opportunity to write a system-description paper that describes their system, resources used, results, and analysis. This paper will be part of the official SemEval-2018 proceedings. The paper is to be four pages long plus two pages at most for references. The papers are to follow the format and style files provided by ACL/NAACL/EMNLP-2018.

Related Past Shared Tasks on Affect Intensity

WASSA-2017 Shared Task on Emotion Intensity (EmoInt)
Affect in Tweets is an expanded version of this WASSA-2017 shared task.
The CodaLab website for the 2017 task is still open. You can train on the official 2017 training data and test on the official 2017 test set and compare against the best 2017 systems on the Leaderboard.
SemEval-2016 Shared Task on Determining Sentiment Intensity of English and Arabic Phrases
SemEval-2017, SemEval-2016, SemEval-2015, SemEval-2014, SemEval-2013 Shared Tasks on Sentiment Analysis in Twitter
TASS-2017, TASS-2016, TASS-2015, TASS-2014, TASS-2013, TASS-2012 Shared Tasks on Sentiment Analysis in Twitter in Spanish
SemEval-2007 Shared Task on Affective Text

EVALUATION

The full official evaluation script that covers all subtasks is available here. You should run the script on your system’s predictions for purposes such as cross-validation experiments, determining progress on the development set, and to check the format of your submission.

The CodaLab website for the 2017 task is still open. You can train on the official 2017 training data and test on the official 2017 test set and compare against the best 2017 systems on the Leaderboard.

For the Tasks EI-reg, EI-oc, V-reg, and V-oc

Official Competition Metric: For each task, language, and affect category, systems are evaluated by calculating the Pearson Correlation Coefficient with the Gold ratings/labels.

The correlation scores across all four emotions will be averaged (macro-average) to determine the bottom-line competition metric for EI-reg and EI-oc by which the submissions will be ranked for those tasks.
The correlation scores for valence will be used as the bottom-line competition metric for V-reg and V-oc by which the submissions will be ranked for those tasks.

Secondary Evaluation Metrics: Apart from the official competition metric described above, some additional metrics will also be calculated for your submissions. These are intended to provide a different perspective on the results.

The secondary metric used for the regression tasks:

Pearson correlation for a subset of the test set that includes only those tweets with intensity score greater or equal to 0.5.

The secondary metrics used for the ordinal classification tasks:

Pearson correlation for a subset of the test set that includes only those tweets with intensity classes low X, moderate X, or high X (where X is an emotion). We will refer to this set of tweets as the some-emotion subset.
Weighted quadratic kappa on the full test set.
Weighted quadratic kappa on the some-emotion subset of the test set.

For the Task E-c

Official Competition Metric: For each language, systems are evaluated by calculating multi-label accuracy (or Jaccard index). Since this is a multi-label classification task, each tweet can have one or more gold emotion labels, and one or more predicted emotion labels. Multi-label accuracy is defined as the size of the intersection of the predicted and gold label sets divided by the size of their union. This measure is calculated for each tweet t, and then is averaged over all tweets in the dataset T:

where G_t is the set of the gold labels for tweet t, P_t is the set of the predicted labels for tweet t, and T is the set of tweets.

Secondary Evaluation Metrics: Apart from the official competition metric (multi-label accuracy), we will also calculate micro-averaged F-score and macro-averaged F-score for your submissions. These additional metrics are intended to provide a different perspective on the results.

Micro-averaged F-score is calculated as follows:

where E is the given set of eleven emotions.

Macro-averaged F-score is calculated as follows:

Terms and Conditions

By participating in this task you agree to these terms and conditions. If, however, one or more of this conditions is a concern for you, send us an email and we will consider if an exception can be made.

By submitting results to this competition, you consent to the public release of your scores at this website and at SemEval-2018 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
A participant can be involved in exactly one team (no more). If there are reasons why it makes sense for you to be on more than one team, then email us before the evaluation period begins. In special circumstances this may be allowed.
Each team must create and use exactly one CodaLab account.
Team constitution (members of a team) cannot be changed after the evaluation period has begun.
During the evaluation period:
- Each team can submit as many as fifty submissions . However, only the final submission will be considered as the official submission to the competition.
- You will not be able to see results of your submission on the test set.
- You will be able to see any warnings and errors for each of your submission.
- Leaderboard is disabled.
Once the competition is over, we will release the gold labels and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.
We will make the final submissions of the teams public at some point after the evaluation period.
The organizers and their affiliated institutions makes no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
Each task participant will be assigned another teams’ system description papers for review, using the START system. The papers will thus be peer reviewed.
The dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
The datasets must not be redistributed or shared in part or full with any third party. Redirect interested parties to this website.
If you use any of the datasets provided here, cite this paper: Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 Task 1: Affect in tweets. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.

Organizers of the shared task:

Saif M. Mohammad

saif.mohammad@nrc-cnrc.gc.ca

National Research Council Canada

Felipe Bravo-Marquez

fbravoma@waikato.ac.nz

The University of Waikato

Mohammad Salameh

msalameh@qatar.cmu.edu

Carnegie Mellon University, Qatar

Svetlana Kiritchenko

svetlana.kiritchenko@nrc-cnrc.gc.ca

National Research Council Canada

Post emails about the task on the task mailing list: EmotionIntensity@googlegroups.com

If you need to send an email to only the task organizers, send it to: aff-int-organizers@googlegroups.com

DATA

(See 'Terms and Conditions' page for terms of use.)

POST-COMPETITION: The official competition is now over, but you are welcome to develop and test new solutions on this website. All data with gold labels (training, developing, and test) are available here. The test data in this archive do not include the instances from the Equity Evaluation Corpus (EEC) used for bias evaluation. The EEC corpus is available here.

If you use any of the data below, please cite this paper:

Semeval-2018 Task 1: Affect in Tweets. Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.

@inproceedings{SemEval2018Task1,

author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},

title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},

booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},

address = {New Orleans, LA, USA},

year = {2018}}

Further details of the English data creation methodology are available in this paper:

Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories. Saif M. Mohammad and Svetlana Kiritchenko. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.

@inproceedings{LREC18-TweetEmo,

author = {Mohammad, Saif M. and Kiritchenko, Svetlana},

title = {Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories},

booktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference},

year = {2018},

address={Miyazaki, Japan}}

The Spanish and Arabic data creation followed the same approach with some implementation differences (as stated in the SemEval-2018 Task 1 paper above).

@InProceedings{SA-Biases2018,
author = {Kiritchenko, Svetlana and Mohammad, Saif M.},
title = {Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems},
booktitle = {Proceedings of the 7th Joint Conference on Lexical and Computational Semantics (*SEM)},
address = {New Orleans, LA, USA},
year = {2018}}

SemEval-2018 Affect in Tweets DIstant Supervision Corpus (SemEval-2018 AIT DISC)

This corpus of tweets was collected by polling the Twitter API for tweets that included emotion-related words such as '#angry', 'annoyed', 'panic', 'happy', 'elated', 'surprised', etc. The full list of query terms will be made available February 2018 (after the evaluation period). You are free to use this corpus to make submissions for any of the five tasks.

Click here to download the script and ~100 million English tweet ids (size warning: 610MB) (released Dec 4, 2017). Click here to download the revised script to handle long retweets properly (revised by Hardik Meisheri Dec 5, 2017).
Click here to download the ~1.2 million Spanish tweet ids (released Dec 5, 2017).
Click here to download the ~16.3 million Arabic tweet ids (released Dec 25, 2017).

Training, Development, and Test Datasets: For five tasks and three languages

The Evaluation phase has concluded. The gold labels for the test data are made available. The post-evaluation phase will stay open. You can still continue to upload submissions to it if you wish to. However, make sure not to train in any way from the gold labels for the test data.

In the data files below, E refers to emotion, EI refers to emotion intensity, V refers to valence or sentiment intensity, reg refers to regression, oc refers to ordinal classification, c refers to classification. All test sets were released January 5, 2018.

EI-reg:

English (Note: This particular training set was created from a BWS annotation effort in 2016. The development and test sets were created from a common 2017 annotation effort. Thus, the scores for tweets across the training and development sets or across the training and test sets are not directly comparable. However, the scores in each dataset indicate relative positions of the tweets in that dataset.)

Training Set (taken from EmoInt 2017, re-released Aug, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Test Set

Arabic

Training Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Test Set

Spanish
- Training Set (released Oct. 12, 2017; last updated Nov. 23, 2017)
- Development Set (released Oct. 12, 2017; last updated Nov. 23, 2017)

Test Set

EI-oc:

English

Training Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Test Set

Arabic

Training Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Test Set

Spanish

Training Set (released Oct. 19, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 19, 2017; last updated Nov. 23, 2017)
Test Set

V-reg:

English

Training Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Test Set

Arabic

Training Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Test Set

Spanish

Training Set (released Oct. 24, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 24, 2017; last updated Nov. 23, 2017)
Test Set

V-oc:

English

Training Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Test Set

Arabic

Training Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Test Set

Spanish

Training Set (released Oct. 26, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 26, 2017; last updated Nov. 23, 2017)
Test Set

E-c: (Regarding the notation in these files: For a given emotion, 1 means emotion can be inferred, whereas 0 means emotion cannot be inferred. 0's for all of the 11 emotions means 'neutral or no emotion'.)

English

Training Set (released Nov. 9, 2017; last updated Nov. 23, 2017)
Development Set (released Nov. 9, 2017; last updated Nov. 23 Nov. 28, 2017)
Test Set

Arabic

Training Set (released Nov. 10, 2017; last updated ~~Nov. 23~~ Nov. 28, 2017)
Development Set (released Nov. 10, 2017; last updated ~~Nov. 23~~ Nov. 28, 2017)
Test Set

Spanish

Training Set (released Nov. 10, 2017; last updated Nov. 23, 2017)
Development Set (released Nov. 10, 2017; last updated Nov. 23, 2017)
Test Set

Note: The November 23 2017 update to the data includes these changes:

All files have a header row indicating what the column stands for.
IDs have a new format to make them all consistent. We also drop train, dev, test from the id's.
New tweets added to training and development data.
Some tweets moved from the development set to the training set to ensure consistency: a tweet that occurs in multiple datasets (say for an emotion and valence) will be in the training set everywhere or development set everywhere.
Some file names changed to make the naming consistent across all files.

Note: The Arabic E-c train and dev data was updated November 28, 2017. The update removes three duplicate tweets from the train and one duplicate from the dev set.

Even though the changes are small, it is crucial that you delete old copies of the data and download the data again with these updates.

Note: The datasets above share a large number of common tweets, however, they were often created from independent annotations by different annotators. Further, decisions on where to mark thresholds in the different datasets are made independently as well. For example, in E-c we chose a somewhat generous criteria that if at least two out of seven people indicate that a certain emotion can be inferred, then that emotion is chosen as one of the labels for the tweet (likely along with another emotion with 3, 4, or 5 votes). Thus, a small number of inconsistencies in the annotations across different datasets is expected. For example, a tweet may be marked as 'no anger' in E-oc, but may have 'anger' as one of its labels in E-c. Of course, such instances are greatly outnumbered by consistent annotations across the datasets.

Query Terms: This distribution includes lists of query terms used to poll Twitter to obtain tweets. The training, development, and test sets for SemEval-2018 Task 1 were created by sampling from these tweets. The distant supervision corpora released as part of the competition were also created by sampling from the remaining tweets. We include query terms used for all three languages (English, Arabic, and Spanish) in the corresponding folders. The English folder includes two subfolders: (1) EmoInt-2017: for the query terms used to collect the WASSA 2017 shared task (EmoInt) tweets (which in turn formed the training data for SemEval-2018 Task 1) and (2) SemEval2018-Task1: for the query terms used to collect the dev and test set tweets of SemEval-2018 Task 1.

Additional Mystery Test Set for Some Tasks:

For certain task--language combinations, in addition to the main test set, we will also include a mystery test set. No details will be given about the source of this mystery test set until after the evaluation period. Below are some notes about this:

The myster test set will be provided for EI-reg--English and V-reg--English.
The primary evaluation of your system will not be based on the mystery test set. The leaderboard will only show results on the main test set (not the mystery test set).
The main test set and the mystery test set will be combined into one test file (for EI-reg--English and V-reg--English). (The mystery test set instances will have 'mystery' as part of their id.) Thus for all practical purposes, you do not have to do anything different than what you were already planning to do. Run your system on the test files for the task you are participating in and upload the submission.
You should not make any changes to your system by looking at any test set (main or mystery). The same system as developed for the main test set is to be used on the mystery test set.
We will provide details about the mystery test set after the evaluation period. Results on the mystery test set will be provided at least two weeks after the evaluation period. These results will not be shown on the leaderboard. The later date (compared to results on the main set) are because we have to evaluate this outside of CodaLab.

Submission format:

A valid submission for CodaLab is a zip-compressed file with files containing the predictions made for all the subtasks you want to participate in. Note that even if you upload results from multiple submissions on to the leaderboard, only your latest submission is displayed on the leaderboard. During the evaluation period, each team can submit as many as fifty submissions. However, only the final submission will be considered as the official submission to the competition. (Make sure to upload it to the leaderboard.) This means that your final submission must have your entries for all the tasks you want to participate in.

Submitted files must have the same format as the training and test files after replacing the NONEs in the last columns with your system's predictions. The filenames associated with each subtask and the corresponding line formats are given below:

EI-reg:

header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Score

data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_core

Note that the emotion name must be in English even for Spanish and Arabic data.

English

EI-reg_en_anger_pred.txt
EI-reg_en_fear_pred.txt
EI-reg_en_sadness_pred.txt
EI-reg_en_joy_pred.txt

Arabic

EI-reg_ar_anger_pred.txt
EI-reg_ar_fear_pred.txt
EI-reg_ar_sadness_pred.txt
EI-reg_ar_joy_pred.txt

Spanish

EI-reg_es_anger_pred.txt
EI-reg_es_fear_pred.txt
EI-reg_es_sadness_pred.txt
EI-reg_es_joy_pred.txt

EI-oc:

header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Class

data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_class

English

EI-oc_en_anger_pred.txt
EI-oc_en_fear_pred.txt
EI-oc_en_sadness_pred.txt
EI-oc_en_joy_pred.txt

Arabic

EI-oc_ar_anger_pred.txt
EI-oc_ar_fear_pred.txt
EI-oc_ar_sadness_pred.txt
EI-oc_ar_joy_pred.txt

Spanish

EI-oc_es_anger_pred.txt
EI-oc_es_fear_pred.txt
EI-oc_es_sadness_pred.txt
EI-oc_es_joy_pred.txt

V-reg:

header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Score

data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_core

English

V-reg_en_pred.txt

Arabic

V-reg_ar_pred.txt

Spanish

V-reg_es_pred.txt

V-oc:

header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Class

data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_class

English

V-oc_en_pred.txt

Arabic

V-oc_ar_pred.txt

Spanish

V-oc_es_pred.txt

E-c:

header row: ID[tab]Tweet[tab]anger[tab]anticipation[tab]disgust[tab]fear[tab]joy[tab]love[tab]optimism[tab]pessimism[tab] sadness[tab]surprise[tab]trust

data row: $id[tab]$tweet[tab]$anger_val[tab]$anticipation_val[tab]$disgust_val[tab]$fear_val[tab]$joy_val[tab]

$love_val[tab]$optimism_val[tab]$pessimism_val[tab]$sadness_val[tab]$surprise_val[tab]$trust_val

(Note: Each emotion value (e.g., $love_val) takes binary values: 1 means emotion can be inferred, whereas 0 means emotion cannot be inferred. 0's for all of the 11 emotions means 'neutral or no emotion'.)

English

E-C_en_pred.txt

Arabic

E-C_ar_pred.txt

Spanish

E-C_es_pred.txt

Participants are not required to participate in all subtasks. A valid submission must provide at least all the files associated with one combination of subtask and language.

Example of a valid combination of files:

EI-reg_en_anger_pred.txt
EI-reg_en_fear_pred.txt
EI-reg_en_sadness_pred.txt
EI-reg_en_joy_pred.txt

A zip file containing the above files will only be participating in the EI-reg task for English.

Example of an invalid combination of files:

EI-oc_en_sadness_pred.txt
EI-reg_en_joy_pred.txt
EI-reg_es_joy_pred.txt

Schedule:

Training data ready: September 25, 2017
Evaluation period starts: January 8, 2018
Evaluation period ends: January 28, 2018
Results posted: Feb 5, 2018
System description paper submission deadline: Mon 5 Mar, 2018 by 23:59 GMT -12:00.
Author notifications : Mon 02 Apr, 2018
Camera ready submissions due: Mon 16 Apr, 2018

Manual Annotation: Obtaining real-valued annotations has several challenges. Respondents are faced with a higher cognitive load when asked for real-valued scores as opposed to simply classifying terms into pre-chosen discrete classes. Besides, it is difficult for an annotator to remain consistent with his/her annotations. Further, the same score may map to different sentiment scores in the minds of different annotators. One could overcome these problems by providing annotators with pairs of terms and asking which is stronger in terms of association with the property of interest (a comparative approach); however, that requires a much larger set of annotations (order NxN, where N is the number of instances to be annotated).

Best–Worst Scaling (BWS), also sometimes referred to as Maximum Difference Scaling (MaxDiff), is an annotation scheme that exploits the comparative approach to annotation (Louviere and Woodworth, 1990; Cohen, 2003; Louviere et al., 2015). Annotators are given four items (4-tuple) and asked which item is the Best (highest in terms of the property of interest) and which is the Worst (least in terms of the property of interest). These annotations can then be easily converted into real-valued scores of association between the items and the property, which eventually allows for creating a ranked list of items as per their association with the property of interest.

Kiritchenko and Mohammad (2016, 2017) show that ranking of terms remains remarkably consistent even when the annotation process is repeated with a different set of annotators. See the hyperlinked webpages for details on Reliability of the Annotations and a comparison of BWS with Rating Scales.

We created all the datasets for this task using Best–Worst Scaling.

Papers:

Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2016. San Diego, CA.
Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2017), Vancouver, Canada, 2017.

You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. You must report all resources used in the system-description paper.

Baseline System

In order to assist testing of ideas, we also provide the AffectiveTweets Package that you can use and build on. A common use of the package is to generate feature vectors from various resources and append it to one’s own feature representation of the tweet. The use of this package is completely optional. It is available here. Instructions for using the package are available here.

The AffectiveTweets package was used by the teams that ranked first, second, and third in the WASSA-2017 Shared Task on Emotion Intensity.

Word-Emotion and Word-Sentiment Association lexicons

Large lists of manually created and automatically generated word-emotion and word-sentiment association lexicons are available here.

References:

Emotion Intensities in Tweets. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the sixth joint conference on lexical and computational semantics (*Sem), August 2017, Vancouver, Canada.
WASSA-2017 Shared Task on Emotion Intensity. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.
Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories. Saif M. Mohammad and Svetlana Kiritchenko. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.
Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 Task 1: Affect in tweets. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.
Picard, R. W. (1997, 2000). Affective computing. MIT press.
Using Hashtags to Capture Fine Emotion Categories from Tweets. Saif M. Mohammad, Svetlana Kiritchenko, Computational Intelligence, Volume 31, Issue 2, Pages 301-326, May 2015.
Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Kiritchenko, S. and Mohammad, S. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2017), Vancouver, Canada, 2017.
Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6 (3), 169-200.
#Emotional Tweets, Saif Mohammad, In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*Sem), June 2012, Montreal, Canada.
Portable Features for Classifying Emotional Text, Saif Mohammad, In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2012, Montreal, Canada.
Strapparava, C., & Mihalcea, R. (2007). Semeval-2007 task 14: Affective text. In Proceedings of SemEval-2007, pp. 70-74, Prague, Czech Republic.
From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales, Saif Mohammad, In Proceedings of the ACL 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), June 2011, Portland, OR.
Plutchik, R. (1980). A general psychoevolutionary theory of emotion. Emotion: Theory, research, and experience, 1(3), 3-33.
Stance and Sentiment in Tweets. Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. Special Section of the ACM Transactions on Internet Technology on Argumentation in Social Media, In Press.
Determining Word-Emotion Associations from Tweets by Multi-Label Classification. Felipe Bravo-Marquez, Eibe Frank, Saif Mohammad, and Bernhard Pfahringer. In Proceedings of the 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI'16), Omaha, Nebraska, USA.
Challenges in Sentiment Analysis. Saif M. Mohammad, A Practical Guide to Sentiment Analysis, Springer, 2016.
Osgood, C. E., Suci, G. J., & Tannenbaum, P. (1957). The measurement of meaning. University of Illinois Press.
Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2016. San Diego, CA.
Ortony, A., Clore, G. L., & Collins, A. (1988). The Cognitive Structure of Emotions. Cambridge University Press.
Semeval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases. Svetlana Kiritchenko, Saif M. Mohammad, and Mohammad Salameh. In Proceedings of the International Workshop on Semantic Evaluation (SemEval-16). June 2016. San Diego, California.
Alm, C. O. (2008). Affect in text and speech. ProQuest.
Aman, S., & Szpakowicz, S. (2007). Identifying expressions of emotion in text. In Text, Speech and Dialogue, Vol. 4629 of Lecture Notes in Computer Science, pp. 196-205.
The Effect of Negators, Modals, and Degree Adverbs on Sentiment Composition. Svetlana Kiritchenko and Saif M. Mohammad, In Proceedings of the NAACL 2016 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), June 2014, San Diego, California.
Sentiment Analysis: Detecting Valence, Emotions, and Other Affectual States from Text. Saif M. Mohammad, Emotion Measurement, 2016.
NRC-Canada-2014: Detecting Aspects and Sentiment in Customer Reviews, Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and Saif M. Mohammad. In Proceedings of the eighth international workshop on Semantic Evaluation Exercises (SemEval-2014), August 2014, Dublin, Ireland.
Barrett, L. F. (2006). Are emotions natural kinds?. Perspectives on psychological science, 1(1), 28-58.

FAQ

Q. Why restrict the number of official submissions to one?

A. Since this is a competition, we do not want teams to submit a large number of submissions using different parameters and systems without being confident which will work best.

Even though the number of official submissions is restricted to one, the gold data will be released soon after the evaluation period. Thus you can use it to determine results from many different system variants. You are strongly encouraged to report these additional results in the system-description paper in addition to the official submission results.

Number of submissions allowed per team on CodaLab in the test phase is restricted to 50. However, only your final valid submission will be your official submission to the competition.

Q. How do I include more than one score on the leaderboard?

A. CodaLab allows only one score on the leaderboard per user.

Directions for Participating via CodaLab
Steps:

Create an account in CodaLab (https://competitions.codalab.org/). Sign in.
Edit your profile appropriately. Make sure to add a team name, and enter names of team members. (Go to "Settings", and look under "Competition settings".)
Read information on all the pages of the task website.
Download data: training, development, and test (when released)
Run your system on the data and generate a submission file.
Make submissions on the development set (Phase 1).

Wait a few moments for the submission to execute.
Click on the ‘Refresh Status’ button to check status.
Check to make sure submission is successful:

System will show status as “Finished”
Click on ‘Download evaluation output from scoring step’ to examine the result.
If you choose to, you can upload the result on the leaderboard.

If unsuccessful, check error log, fix format issues (if any), resubmit updated zip.
Number of submissions allowed is restricted to 50.

Once the evaluation period begins, you can make submissions for the test set (Phase 2). The procedure is similar to that on the dev set. These differences apply:

The leader board will be disabled until the end of the evaluation period.
You cannot see the results of your submission. They will be posted on a later date after the evaluation period ends.
You can still see if your submission was successful or resulted in some error.
In case of error, you can view the error log.
Number of submissions allowed per team is restricted to 50. However, only your final valid submission will be your official submission to the competition.

System-Description Papers

Participants who made a submission on the CodaLab website during the official evaluation period are given the opportunity to submit a system-description paper that describes their system, resources used, results, and analysis. This paper will be part of the official SemEval-2018 proceedings.

If describing only one task, then up to 4 pages + references. If you took part in two or more tasks then you can go up to 6 pages + references. If you took part in four or five tasks and would really like 8 pages + references, send us an email.

See details provided by SemEval here.
Link to submit your system description paper is also provided there.

Papers are due Mon 26 February, 2018, by 23:59 GMT -12:00.

You do not have to repeat details of the task and data. Just cite the Task paper (details below) quickly summarize the tasks you made submissions to and then you can get into details of the related work, your submissions, experiments, and results.

Cite the task paper as shown below:

Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 Task 1: Affect in tweets. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.

@InProceedings{SemEval2018Task1,

author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},

title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},

booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},

address = {New Orleans, LA, USA},

year = {2018}}

A copy of this paper will be made available in early March. This is after the deadline for your paper submission but you will be able to see this paper well-before the camera-ready deadline. So after access to the task paper, you can still update your paper as you see fit.

The paper below describes how the English data was created:

Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories. Saif M. Mohammad and Svetlana Kiritchenko. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.

@inproceedings{LREC18-TweetEmo,

author = {Mohammad, Saif M. and Kiritchenko, Svetlana},

title = {Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories},

booktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference},

year = {2018},

address={Miyazaki, Japan}}

Spanish and Arabic data creation followed a largely similar approach, but specific details will be provided in the AIT task paper above.

See the EmoInt-2017 website for the previous iteration of this task, and the corresponding EmoInt-2017 shared task paper (BibTex). System papers from that task can be found in the proceedings (look for papers with EmoInt in the title).

Important Notes:

You are not obligated to submit a system-description paper, however, we strongly encourage all participating teams to do so.
SemEval seeks to have all participants publish a paper, unless the paper does a poor job of describing their system. Your system rank and scores will not impact whether the paper is accepted or not.
Note that SemEval submission is not anonymous; author names should be included.
Later each task participant will be assigned another teams’ system description papers for review, using the START system.
All task participant teams should prepare a poster for display at SemEval. One selected team will be asked to prepare a short talk. Details will be provided at a later date.
Please do not dwell too much on rankings. Focus instead on analysis and the research questions that your system can help address.
References: References do not count against the page limit (6 pages for system description papers). You may have at most *four pages* for references.
It may also be helpful to look at some of the papers from past SemEval competitions, e.g., from https://aclweb.org/anthology/S/S16/.

What to include in a system-description paper?

Here are some key pointers:

Replicability: Present all details that will allow someone else to replicate your system.

Analysis: Focus more on results and analysis and less on discussing rankings. Report results on several variants of the system (even beyond the official submission); present sensitivity analysis of your system's parameters, network architecture, etc.; present ablation experiments showing usefulness of different features and techniques; show comparisons with baselines. You can use the gold labels that we will release later next week for the extra analysis. However, clearly mark what the official submission results were and what the ranks were.
Related work: Place your work in context of previously published related work. Cite all data and resources used in your submission.

FAQ

Q. My system did not get a good rank. Should I still write a system-description paper?

Ans. We encourage all participants to submit a system description paper. The goal is to record all the approaches that were used and how effective they were. Do not dwell too much on rankings. Focus instead on analysis and the research questions that your system can help address. What has not worked is also useful information.

You can also write a paper with a focus on testing a hypotheses that your system and this task allows you to explore.

Q. Can we describe results of new techniques that we haven't submitted to the eval phase?
Ans. Yes, you are allowed, and even encouraged. But: clearly mark what the official submission results were and what the ranks were.

Q. I took part in multiple tasks for SemEval-2018 Task1 Affect in Tweets. Specifically EI-reg English, EI-oc Spanish, and E-C English.

- How many papers must I write?

Ans. One paper describing system and results for all Affect in Tweets tasks.

- What should the title prefix look like?

Ans. Your title should be something like this: "<team name> at SemEval-2018 Task 1: [Some More Title Text]"

The "1" here is because AIT-2018 is SemEval-201 Task 1. Not because of the EI-reg task in Task 1.

- How many pages can my paper be?

Ans. Since you took part in multiple tasks, your paper can be up to 6 pages + references. (If describing only one task, then up to 4 pages + references. If you took part in four or five tasks and would really like 8 pages + references, send us an email.)

Q. How do I cite the task?

Ans. All system papers must cite the task paper:

author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},

title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},

booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},

address = {New Orleans, LA, USA},

year = {2018}}

Additionally, we will be grateful if you also cite the paper below which descibes how the data was created:

@inproceedings{LREC18-TweetEmo,

author = {Mohammad, Saif M. and Kiritchenko, Svetlana},

title = {Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories},

booktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference},

year = {2018},

address={Miyazaki, Japan}}

Pre-Evaluation Period

Start: Aug. 14, 2017, midnight

Evaluation Period

Start: Jan. 8, 2018, midnight

Post-Evaluation Period

Start: Jan. 28, 2018, 11:59 p.m.

Competition Ends

Never

You must be logged in to participate in competitions.

Competition

SemEval-2018 Task 1: Affect in Tweets (AIT-2018)

Previous

Current

End

EVALUATION

Terms and Conditions

DATA

SemEval-2018 Affect in Tweets DIstant Supervision Corpus (SemEval-2018 AIT DISC)

Training, Development, and Test Datasets: For five tasks and three languages

Directions for Participating via CodaLabSteps:

System-Description Papers

Pre-Evaluation Period

Evaluation Period

Post-Evaluation Period

Competition Ends

Directions for Participating via CodaLab
Steps: