SemEval-2018 Task 1: Affect in Tweets
SemEval-2018: International Workshop on Semantic Evaluation will be held in conjunction with NAACL-2018 in New Orleans, LA, USA, June 5-6, 2018.
Cite this paper for the task: Semeval-2018 Task 1: Affect in Tweets. Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. In Proceedings of the International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.
@InProceedings{SemEval2018Task1,
author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets},
booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)},
address = {New Orleans, LA, USA},
year = {2018}}
The Equity Evaluation Corpus (EEC), which consists of 8,640 English sentences carefully chosen to tease out biases towards certain races and genders is available here. The EEC was the mystery test set added to the tweets test sets for the English EI-reg and V-reg tasks. Below is the *Sem paper describing the EEC dataset and the bias evaluation.
Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics (*SEM), New Orleans, LA, USA, June 2018.
Join the official task mailing group: EmotionIntensity@googlegroups.com
It is crucial that you join the mailing list to receive the latest news and updates. Also note that even if you join the mailing list now, you will be able to see all messages posted earlier.
Evaluation phase has concluded. Over 70 teams participated. The official results have been posted.
For teams that participated, please register your team (via this Google form) by Feb 1st, 2018 (this is mandatory).
Background and Significance: We use language to communicate not only the emotion or sentiment we are feeling but also the intensity of the emotion or sentiment. For example, our utterances can convey that we are very angry, slightly sad, absolutely elated, etc. Here, intensity refers to the degree or amount of an emotion or degree of sentiment. We will refer to emotion-related categories such as anger, fear, sentiment, and arousal, by the term affect. Existing affect datasets are mainly annotated categorically without an indication of intensity. Further, past shared tasks have almost always been framed as classification tasks (identify one among n affect categories for this sentence). In contrast, it is often useful for applications to know the degree to which affect is expressed in text.
Tasks: We present an array of tasks where systems have to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of the tweeters from their tweets. (The term tweeter refers to the person who has posted the tweet.) We also include a multi-label emotion classification task for tweets. For each task, we provide separate training and test datasets for English, Arabic, and Spanish tweets. The individual tasks are described below:
Here, E refers to emotion, EI refers to emotion intensity, V refers to valence or sentiment intensity, reg refers to regression, oc refers to ordinal classification, c refers to classification.
Together, these tasks encompass various emotion and sentiment analysis tasks. You are free to participate in any number of tasks and on any of the datasets. Further details on each of the tasks are provided below.
1. Task EI-reg: Detecting Emotion Intensity (regression)
Given:
a tweet
an emotion E (anger, fear, joy, or sadness)
Task: determine the intensity of E that best represents the mental state of the tweeter—a real-valued score between 0 and 1:
a score of 1: highest amount of E can be inferred
a score of 0: lowest amount of E can be inferred
For each language: 4 training sets and 4 test sets: one for each emotion E.
(Note that the absolute scores have no inherent meaning -- they are used only as a means to convey that the instances with higher scores correspond to a greater degree of E than instances with lower scores.)
2. Task EI-oc: Detecting Emotion Intensity (ordinal classification)
Given:
a tweet
an emotion E (anger, fear, joy, or sadness)
Task: classify the tweet into one of four ordinal classes of intensity of E that best represents the mental state of the tweeter:
0: no E can be inferred
1: low amount of E can be inferred
2: moderate amount of E can be inferred
3: high amount of E can be inferred
For each language: 4 training sets and 4 test sets: one for each emotion E.
3. Task V-reg: Detecting Valence or Sentiment Intensity (regression)
Given:
a tweet
Task: determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter—a real-valued score between 0 and 1:
a score of 1: most positive mental state can be inferred
a score of 0: most negative mental state can be inferred
For each language: 1 training set, 1 test set.
(Note that the absolute scores have no inherent meaning -- they are used only as a means to convey that the instances with higher scores correspond to a greater degree of positive sentiment than instances with lower scores.)
4. Task V-oc: Detecting Valence (ordinal classification) -- This is the traditional Sentiment Analysis Task
Given:
a tweet
Task: classify the tweet into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity, that best represents the mental state of the tweeter:
3: very positive mental state can be inferred
2: moderately positive mental state can be inferred
1: slightly positive mental state can be inferred
0: neutral or mixed mental state can be inferred
-1: slightly negative mental state can be inferred
-2: moderately negative mental state can be inferred
-3: very negative mental state can be inferred
For each language: 1 training set, 1 test set.
5. Task E-c: Detecting Emotions (multi-label classification) -- This is a traditional Emotion Classification Task
Given:
a tweet
Task: classify the tweet as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter:
For each language: 1 training set, 1 test set.
(Note that the set of emotions includes the eight basic emotions as per Plutchik (1980), as well as a few other emotions that are common in tweets (love, optimism, and pessimism).)
Paper: Participants will be given the opportunity to write a system-description paper that describes their system, resources used, results, and analysis. This paper will be part of the official SemEval-2018 proceedings. The paper is to be four pages long plus two pages at most for references. The papers are to follow the format and style files provided by ACL/NAACL/EMNLP-2018.
Related Past Shared Tasks on Affect Intensity
WASSA-2017 Shared Task on Emotion Intensity (EmoInt)
Affect in Tweets is an expanded version of this WASSA-2017 shared task.
The CodaLab website for the 2017 task is still open. You can train on the official 2017 training data and test on the official 2017 test set and compare against the best 2017 systems on the Leaderboard.
SemEval-2016 Shared Task on Determining Sentiment Intensity of English and Arabic Phrases
SemEval-2017, SemEval-2016, SemEval-2015, SemEval-2014, SemEval-2013 Shared Tasks on Sentiment Analysis in Twitter
TASS-2017, TASS-2016, TASS-2015, TASS-2014, TASS-2013, TASS-2012 Shared Tasks on Sentiment Analysis in Twitter in Spanish
The full official evaluation script that covers all subtasks is available here. You should run the script on your system’s predictions for purposes such as cross-validation experiments, determining progress on the development set, and to check the format of your submission.
The CodaLab website for the 2017 task is still open. You can train on the official 2017 training data and test on the official 2017 test set and compare against the best 2017 systems on the Leaderboard.
For the Tasks EI-reg, EI-oc, V-reg, and V-oc
Official Competition Metric: For each task, language, and affect category, systems are evaluated by calculating the Pearson Correlation Coefficient with the Gold ratings/labels.
The correlation scores across all four emotions will be averaged (macro-average) to determine the bottom-line competition metric for EI-reg and EI-oc by which the submissions will be ranked for those tasks.
The correlation scores for valence will be used as the bottom-line competition metric for V-reg and V-oc by which the submissions will be ranked for those tasks.
Secondary Evaluation Metrics: Apart from the official competition metric described above, some additional metrics will also be calculated for your submissions. These are intended to provide a different perspective on the results.
The secondary metric used for the regression tasks:
The secondary metrics used for the ordinal classification tasks:
For the Task E-c
Official Competition Metric: For each language, systems are evaluated by calculating multi-label accuracy (or Jaccard index). Since this is a multi-label classification task, each tweet can have one or more gold emotion labels, and one or more predicted emotion labels. Multi-label accuracy is defined as the size of the intersection of the predicted and gold label sets divided by the size of their union. This measure is calculated for each tweet t, and then is averaged over all tweets in the dataset T:
where Gt is the set of the gold labels for tweet t, Pt is the set of the predicted labels for tweet t, and T is the set of tweets.
Secondary Evaluation Metrics: Apart from the official competition metric (multi-label accuracy), we will also calculate micro-averaged F-score and macro-averaged F-score for your submissions. These additional metrics are intended to provide a different perspective on the results.
Micro-averaged F-score is calculated as follows:
where E is the given set of eleven emotions.
Macro-averaged F-score is calculated as follows:
By participating in this task you agree to these terms and conditions. If, however, one or more of this conditions is a concern for you, send us an email and we will consider if an exception can be made.
A participant can be involved in exactly one team (no more). If there are reasons why it makes sense for you to be on more than one team, then email us before the evaluation period begins. In special circumstances this may be allowed.
Each team must create and use exactly one CodaLab account.
Team constitution (members of a team) cannot be changed after the evaluation period has begun.
During the evaluation period:
Each team can submit as many as fifty submissions . However, only the final submission will be considered as the official submission to the competition.
You will not be able to see results of your submission on the test set.
You will be able to see any warnings and errors for each of your submission.
Leaderboard is disabled.
Once the competition is over, we will release the gold labels and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.
We will make the final submissions of the teams public at some point after the evaluation period.
The organizers and their affiliated institutions makes no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
The dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
The datasets must not be redistributed or shared in part or full with any third party. Redirect interested parties to this website.
Organizers of the shared task:
National Research Council Canada
The University of Waikato
Mohammad Salameh
Carnegie Mellon University, Qatar
Svetlana Kiritchenko
svetlana.kiritchenko@nrc-cnrc.gc.ca
National Research Council Canada
Post emails about the task on the task mailing list: EmotionIntensity@googlegroups.com
If you need to send an email to only the task organizers, send it to: aff-int-organizers@googlegroups.com
(See 'Terms and Conditions' page for terms of use.)
POST-COMPETITION: The official competition is now over, but you are welcome to develop and test new solutions on this website. All data with gold labels (training, developing, and test) are available here. The test data in this archive do not include the instances from the Equity Evaluation Corpus (EEC) used for bias evaluation. The EEC corpus is available here.
If you use any of the data below, please cite this paper:
Semeval-2018 Task 1: Affect in Tweets. Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.
The Spanish and Arabic data creation followed the same approach with some implementation differences (as stated in the SemEval-2018 Task 1 paper above).
The Equity Evaluation Corpus (EEC), which consists of 8,640 English sentences carefully chosen to tease out biases towards certain races and genders is available here. The EEC was the mystery test set added to the tweets test sets for the English EI-reg and V-reg tasks. Below is the *Sem paper describing the EEC dataset and the bias evaluation.
Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics (*SEM), New Orleans, LA, USA, June 2018.
@InProceedings{SA-Biases2018,
author = {Kiritchenko, Svetlana and Mohammad, Saif M.},
title = {Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems},
booktitle = {Proceedings of the 7th Joint Conference on Lexical and Computational Semantics (*SEM)},
address = {New Orleans, LA, USA},
year = {2018}}
This corpus of tweets was collected by polling the Twitter API for tweets that included emotion-related words such as '#angry', 'annoyed', 'panic', 'happy', 'elated', 'surprised', etc. The full list of query terms will be made available February 2018 (after the evaluation period). You are free to use this corpus to make submissions for any of the five tasks.
The Evaluation phase has concluded. The gold labels for the test data are made available. The post-evaluation phase will stay open. You can still continue to upload submissions to it if you wish to. However, make sure not to train in any way from the gold labels for the test data.
In the data files below, E refers to emotion, EI refers to emotion intensity, V refers to valence or sentiment intensity, reg refers to regression, oc refers to ordinal classification, c refers to classification. All test sets were released January 5, 2018.
EI-reg:
English (Note: This particular training set was created from a BWS annotation effort in 2016. The development and test sets were created from a common 2017 annotation effort. Thus, the scores for tweets across the training and development sets or across the training and test sets are not directly comparable. However, the scores in each dataset indicate relative positions of the tweets in that dataset.)
Training Set (taken from EmoInt 2017, re-released Aug, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Arabic
Training Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Spanish
Training Set (released Oct. 12, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 12, 2017; last updated Nov. 23, 2017)
EI-oc:
English
Training Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Arabic
Training Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Spanish
Training Set (released Oct. 19, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 19, 2017; last updated Nov. 23, 2017)
V-reg:
English
Training Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Arabic
Training Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Development Set (released Sep. 25, 2017; last updated Nov. 23, 2017)
Spanish
Training Set (released Oct. 24, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 24, 2017; last updated Nov. 23, 2017)
V-oc:
English
Training Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 17, 2017; last updated Nov. 23, 2017)
Arabic
Training Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 20, 2017; last updated Nov. 23, 2017)
Spanish
Training Set (released Oct. 26, 2017; last updated Nov. 23, 2017)
Development Set (released Oct. 26, 2017; last updated Nov. 23, 2017)
Note: The November 23 2017 update to the data includes these changes:
Note: The Arabic E-c train and dev data was updated November 28, 2017. The update removes three duplicate tweets from the train and one duplicate from the dev set.
Even though the changes are small, it is crucial that you delete old copies of the data and download the data again with these updates.
Note: The datasets above share a large number of common tweets, however, they were often created from independent annotations by different annotators. Further, decisions on where to mark thresholds in the different datasets are made independently as well. For example, in E-c we chose a somewhat generous criteria that if at least two out of seven people indicate that a certain emotion can be inferred, then that emotion is chosen as one of the labels for the tweet (likely along with another emotion with 3, 4, or 5 votes). Thus, a small number of inconsistencies in the annotations across different datasets is expected. For example, a tweet may be marked as 'no anger' in E-oc, but may have 'anger' as one of its labels in E-c. Of course, such instances are greatly outnumbered by consistent annotations across the datasets.
Query Terms: This distribution includes lists of query terms used to poll Twitter to obtain tweets. The training, development, and test sets for SemEval-2018 Task 1 were created by sampling from these tweets. The distant supervision corpora released as part of the competition were also created by sampling from the remaining tweets. We include query terms used for all three languages (English, Arabic, and Spanish) in the corresponding folders. The English folder includes two subfolders: (1) EmoInt-2017: for the query terms used to collect the WASSA 2017 shared task (EmoInt) tweets (which in turn formed the training data for SemEval-2018 Task 1) and (2) SemEval2018-Task1: for the query terms used to collect the dev and test set tweets of SemEval-2018 Task 1.
Additional Mystery Test Set for Some Tasks:
Submission format:
A valid submission for CodaLab is a zip-compressed file with files containing the predictions made for all the subtasks you want to participate in. Note that even if you upload results from multiple submissions on to the leaderboard, only your latest submission is displayed on the leaderboard. During the evaluation period, each team can submit as many as fifty submissions. However, only the final submission will be considered as the official submission to the competition. (Make sure to upload it to the leaderboard.) This means that your final submission must have your entries for all the tasks you want to participate in.
Submitted files must have the same format as the training and test files after replacing the NONEs in the last columns with your system's predictions. The filenames associated with each subtask and the corresponding line formats are given below:
EI-reg:
header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Score
data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_core
Note that the emotion name must be in English even for Spanish and Arabic data.
English
EI-reg_en_anger_pred.txt
EI-reg_en_fear_pred.txt
EI-reg_en_sadness_pred.txt
EI-reg_en_joy_pred.txt
Arabic
EI-reg_ar_anger_pred.txt
EI-reg_ar_fear_pred.txt
EI-reg_ar_sadness_pred.txt
EI-reg_ar_joy_pred.txt
Spanish
EI-reg_es_anger_pred.txt
EI-reg_es_fear_pred.txt
EI-reg_es_sadness_pred.txt
EI-reg_es_joy_pred.txt
EI-oc:
header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Class
data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_class
English
EI-oc_en_anger_pred.txt
EI-oc_en_fear_pred.txt
EI-oc_en_sadness_pred.txt
EI-oc_en_joy_pred.txt
Arabic
EI-oc_ar_anger_pred.txt
EI-oc_ar_fear_pred.txt
EI-oc_ar_sadness_pred.txt
EI-oc_ar_joy_pred.txt
Spanish
EI-oc_es_anger_pred.txt
EI-oc_es_fear_pred.txt
EI-oc_es_sadness_pred.txt
EI-oc_es_joy_pred.txt
V-reg:
header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Score
data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_core
English
V-reg_en_pred.txt
Arabic
V-reg_ar_pred.txt
Spanish
V-reg_es_pred.txt
V-oc:
header row: ID[tab]Tweet[tab]Affect Dimension[tab]Intensity Class
data row: $id[tab]$tweet[tab]$affect_dimension[tab]$intensity_class
English
V-oc_en_pred.txt
Arabic
V-oc_ar_pred.txt
Spanish
V-oc_es_pred.txt
E-c:
header row: ID[tab]Tweet[tab]anger[tab]anticipation[tab]disgust[tab]fear[tab]joy[tab]love[tab]optimism[tab]pessimism[tab] sadness[tab]surprise[tab]trust
data row: $id[tab]$tweet[tab]$anger_val[tab]$anticipation_val[tab]$disgust_val[tab]$fear_val[tab]$joy_val[tab]
$love_val[tab]$optimism_val[tab]$pessimism_val[tab]$sadness_val[tab]$surprise_val[tab]$trust_val
(Note: Each emotion value (e.g., $love_val) takes binary values: 1 means emotion can be inferred, whereas 0 means emotion cannot be inferred. 0's for all of the 11 emotions means 'neutral or no emotion'.)
English
E-C_en_pred.txt
Arabic
E-C_ar_pred.txt
Spanish
E-C_es_pred.txt
Participants are not required to participate in all subtasks. A valid submission must provide at least all the files associated with one combination of subtask and language.
Example of a valid combination of files:
EI-reg_en_anger_pred.txt
EI-reg_en_fear_pred.txt
EI-reg_en_sadness_pred.txt
EI-reg_en_joy_pred.txt
A zip file containing the above files will only be participating in the EI-reg task for English.
Example of an invalid combination of files:
EI-oc_en_sadness_pred.txt
EI-reg_en_joy_pred.txt
EI-reg_es_joy_pred.txt
Schedule:
Training data ready: September 25, 2017
Evaluation period starts: January 8, 2018
Evaluation period ends: January 28, 2018
Results posted: Feb 5, 2018
System description paper submission deadline: Mon 5 Mar, 2018 by 23:59 GMT -12:00.
Author notifications : Mon 02 Apr, 2018
Camera ready submissions due: Mon 16 Apr, 2018
Manual Annotation: Obtaining real-valued annotations has several challenges. Respondents are faced with a higher cognitive load when asked for real-valued scores as opposed to simply classifying terms into pre-chosen discrete classes. Besides, it is difficult for an annotator to remain consistent with his/her annotations. Further, the same score may map to different sentiment scores in the minds of different annotators. One could overcome these problems by providing annotators with pairs of terms and asking which is stronger in terms of association with the property of interest (a comparative approach); however, that requires a much larger set of annotations (order NxN, where N is the number of instances to be annotated).
Best–Worst Scaling (BWS), also sometimes referred to as Maximum Difference Scaling (MaxDiff), is an annotation scheme that exploits the comparative approach to annotation (Louviere and Woodworth, 1990; Cohen, 2003; Louviere et al., 2015). Annotators are given four items (4-tuple) and asked which item is the Best (highest in terms of the property of interest) and which is the Worst (least in terms of the property of interest). These annotations can then be easily converted into real-valued scores of association between the items and the property, which eventually allows for creating a ranked list of items as per their association with the property of interest.
Kiritchenko and Mohammad (2016, 2017) show that ranking of terms remains remarkably consistent even when the annotation process is repeated with a different set of annotators. See the hyperlinked webpages for details on Reliability of the Annotations and a comparison of BWS with Rating Scales.
We created all the datasets for this task using Best–Worst Scaling.
Papers:
Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2016. San Diego, CA.
Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2017), Vancouver, Canada, 2017.
You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. You must report all resources used in the system-description paper.
Baseline System
In order to assist testing of ideas, we also provide the AffectiveTweets Package that you can use and build on. A common use of the package is to generate feature vectors from various resources and append it to one’s own feature representation of the tweet. The use of this package is completely optional. It is available here. Instructions for using the package are available here.
The AffectiveTweets package was used by the teams that ranked first, second, and third in the WASSA-2017 Shared Task on Emotion Intensity.
Word-Emotion and Word-Sentiment Association lexicons
Large lists of manually created and automatically generated word-emotion and word-sentiment association lexicons are available here.
References:
Emotion Intensities in Tweets. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the sixth joint conference on lexical and computational semantics (*Sem), August 2017, Vancouver, Canada.
WASSA-2017 Shared Task on Emotion Intensity. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.
Picard, R. W. (1997, 2000). Affective computing. MIT press.
Using Hashtags to Capture Fine Emotion Categories from Tweets. Saif M. Mohammad, Svetlana Kiritchenko, Computational Intelligence, Volume 31, Issue 2, Pages 301-326, May 2015.
Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Kiritchenko, S. and Mohammad, S. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2017), Vancouver, Canada, 2017.
Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6 (3), 169-200.
#Emotional Tweets, Saif Mohammad, In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*Sem), June 2012, Montreal, Canada.
Portable Features for Classifying Emotional Text, Saif Mohammad, In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2012, Montreal, Canada.
Strapparava, C., & Mihalcea, R. (2007). Semeval-2007 task 14: Affective text. In Proceedings of SemEval-2007, pp. 70-74, Prague, Czech Republic.
From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales, Saif Mohammad, In Proceedings of the ACL 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), June 2011, Portland, OR.
Plutchik, R. (1980). A general psychoevolutionary theory of emotion. Emotion: Theory, research, and experience, 1(3), 3-33.
Stance and Sentiment in Tweets. Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. Special Section of the ACM Transactions on Internet Technology on Argumentation in Social Media, In Press.
Determining Word-Emotion Associations from Tweets by Multi-Label Classification. Felipe Bravo-Marquez, Eibe Frank, Saif Mohammad, and Bernhard Pfahringer. In Proceedings of the 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI'16), Omaha, Nebraska, USA.
Challenges in Sentiment Analysis. Saif M. Mohammad, A Practical Guide to Sentiment Analysis, Springer, 2016.
Osgood, C. E., Suci, G. J., & Tannenbaum, P. (1957). The measurement of meaning. University of Illinois Press.
Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best-Worst Scaling. Svetlana Kiritchenko and Saif M. Mohammad. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. June 2016. San Diego, CA.
Ortony, A., Clore, G. L., & Collins, A. (1988). The Cognitive Structure of Emotions. Cambridge University Press.
Semeval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases. Svetlana Kiritchenko, Saif M. Mohammad, and Mohammad Salameh. In Proceedings of the International Workshop on Semantic Evaluation (SemEval-16). June 2016. San Diego, California.
Alm, C. O. (2008). Affect in text and speech. ProQuest.
Aman, S., & Szpakowicz, S. (2007). Identifying expressions of emotion in text. In Text, Speech and Dialogue, Vol. 4629 of Lecture Notes in Computer Science, pp. 196-205.
The Effect of Negators, Modals, and Degree Adverbs on Sentiment Composition. Svetlana Kiritchenko and Saif M. Mohammad, In Proceedings of the NAACL 2016 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), June 2014, San Diego, California.
Sentiment Analysis: Detecting Valence, Emotions, and Other Affectual States from Text. Saif M. Mohammad, Emotion Measurement, 2016.
NRC-Canada-2014: Detecting Aspects and Sentiment in Customer Reviews, Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and Saif M. Mohammad. In Proceedings of the eighth international workshop on Semantic Evaluation Exercises (SemEval-2014), August 2014, Dublin, Ireland.
Barrett, L. F. (2006). Are emotions natural kinds?. Perspectives on psychological science, 1(1), 28-58.
FAQ
Q. Why restrict the number of official submissions to one?
A. Since this is a competition, we do not want teams to submit a large number of submissions using different parameters and systems without being confident which will work best.
Even though the number of official submissions is restricted to one, the gold data will be released soon after the evaluation period. Thus you can use it to determine results from many different system variants. You are strongly encouraged to report these additional results in the system-description paper in addition to the official submission results.
Number of submissions allowed per team on CodaLab in the test phase is restricted to 50. However, only your final valid submission will be your official submission to the competition.
Q. How do I include more than one score on the leaderboard?
A. CodaLab allows only one score on the leaderboard per user.
Create an account in CodaLab (https://competitions.codalab.org/). Sign in.
Edit your profile appropriately. Make sure to add a team name, and enter names of team members. (Go to "Settings", and look under "Competition settings".)
Read information on all the pages of the task website.
Download data: training, development, and test (when released)
Run your system on the data and generate a submission file.
Make submissions on the development set (Phase 1).
Wait a few moments for the submission to execute.
Click on the ‘Refresh Status’ button to check status.
Check to make sure submission is successful:
System will show status as “Finished”
Click on ‘Download evaluation output from scoring step’ to examine the result.
If you choose to, you can upload the result on the leaderboard.
If unsuccessful, check error log, fix format issues (if any), resubmit updated zip.
Number of submissions allowed is restricted to 50.
Once the evaluation period begins, you can make submissions for the test set (Phase 2). The procedure is similar to that on the dev set. These differences apply:
The leader board will be disabled until the end of the evaluation period.
You cannot see the results of your submission. They will be posted on a later date after the evaluation period ends.
You can still see if your submission was successful or resulted in some error.
In case of error, you can view the error log.
Number of submissions allowed per team is restricted to 50. However, only your final valid submission will be your official submission to the competition.
Participants who made a submission on the CodaLab website during the official evaluation period are given the opportunity to submit a system-description paper that describes their system, resources used, results, and analysis. This paper will be part of the official SemEval-2018 proceedings.
If describing only one task, then up to 4 pages + references. If you took part in two or more tasks then you can go up to 6 pages + references. If you took part in four or five tasks and would really like 8 pages + references, send us an email.
See details provided by SemEval here.
Link to submit your system description paper is also provided there.
Papers are due Mon 26 February, 2018, by 23:59 GMT -12:00.
You do not have to repeat details of the task and data. Just cite the Task paper (details below) quickly summarize the tasks you made submissions to and then you can get into details of the related work, your submissions, experiments, and results.
Cite the task paper as shown below:
Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 Task 1: Affect in tweets. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.
A copy of this paper will be made available in early March. This is after the deadline for your paper submission but you will be able to see this paper well-before the camera-ready deadline. So after access to the task paper, you can still update your paper as you see fit.
The paper below describes how the English data was created:
Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories. Saif M. Mohammad and Svetlana Kiritchenko. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.
Spanish and Arabic data creation followed a largely similar approach, but specific details will be provided in the AIT task paper above.
See the EmoInt-2017 website for the previous iteration of this task, and the corresponding EmoInt-2017 shared task paper (BibTex). System papers from that task can be found in the proceedings (look for papers with EmoInt in the title).
Important Notes:
You are not obligated to submit a system-description paper, however, we strongly encourage all participating teams to do so.
SemEval seeks to have all participants publish a paper, unless the paper does a poor job of describing their system. Your system rank and scores will not impact whether the paper is accepted or not.
Note that SemEval submission is not anonymous; author names should be included.
Later each task participant will be assigned another teams’ system description papers for review, using the START system.
All task participant teams should prepare a poster for display at SemEval. One selected team will be asked to prepare a short talk. Details will be provided at a later date.
Please do not dwell too much on rankings. Focus instead on analysis and the research questions that your system can help address.
References: References do not count against the page limit (6 pages for system description papers). You may have at most *four pages* for references.
It may also be helpful to look at some of the papers from past SemEval competitions, e.g., from https://aclweb.org/anthology/S/S16/.
What to include in a system-description paper?
Here are some key pointers:
Replicability: Present all details that will allow someone else to replicate your system.
Analysis: Focus more on results and analysis and less on discussing rankings. Report results on several variants of the system (even beyond the official submission); present sensitivity analysis of your system's parameters, network architecture, etc.; present ablation experiments showing usefulness of different features and techniques; show comparisons with baselines. You can use the gold labels that we will release later next week for the extra analysis. However, clearly mark what the official submission results were and what the ranks were.
Related work: Place your work in context of previously published related work. Cite all data and resources used in your submission.
FAQ
Q. My system did not get a good rank. Should I still write a system-description paper?
Ans. We encourage all participants to submit a system description paper. The goal is to record all the approaches that were used and how effective they were. Do not dwell too much on rankings. Focus instead on analysis and the research questions that your system can help address. What has not worked is also useful information.
Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories. Saif M. Mohammad and Svetlana Kiritchenko. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.
Start: Aug. 14, 2017, midnight
Start: Jan. 8, 2018, midnight
Start: Jan. 28, 2018, 11:59 p.m.
Never
You must be logged in to participate in competitions.
Sign In