WASSA-2017 Shared Task on Emotion Intensity (EmoInt)
Part of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA-2017), which is to be held in conjunction with EMNLP-2017. Note: this webpage is a mirror of the official task webpage.
Join the mailing group: EmotionIntensity@googlegroups.com
ACTION ITEM: If you have downloaded the anger training data prior to this notice, then please download the (updated) anger training set again. The scores for most instances still remain the same, although, some instances now have a slightly different score.
Background and Significance: Existing emotion datasets are mainly annotated categorically without an indication of degree of emotion. Further, the tasks are almost always framed as classification tasks (identify 1 among n emotions for this sentence). In contrast, it is often useful for applications to know the degree to which an emotion is expressed in text. This is the first task where systems have to automatically determine the intensity of emotions in tweets.
Task: Given a tweet and an emotion X, determine the intensity or degree of emotion X felt by the speaker -- a real-valued score between 0 and 1. The maximum possible score 1 stands for feeling the maximum amount of emotion X (or having a mental state maximally inclined towards feeling emotion X). The minimum possible score 0 stands for feeling the least amount of emotion X (or having a mental state maximally away from feeling emotion X). The tweet along with the emotion X will be referred to as an instance. Note that the absolute scores have no inherent meaning -- they are used only as a means to convey that the instances with higher scores correspond to a greater degree of emotion X than instances with lower scores.
Paper: Participants will be given the opportunity to write a system-description paper that describes their system, resources used, results, and analysis. This paper will be part of the official WASSA-2017 proceedings. The paper is to be four pages long plus two pages at most for references. The papers are to follow the format and style files provided by EMNLP-2017.
I am interested. How do I get going?
- Read Competition details on this page.
- Join the mailing group: EmotionIntensity@googlegroups.com
- Download data.
- Directions on participating and making submissions on dev and test sets via CodaLab are here.
This task can be cited as shown below:
WASSA-2017 Shared Task on Emotion Intensity. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.
Evaluation: For each emotion, systems are evaluated by calculating the Pearson Correlation Coefficient with Gold ratings. The correlation scores across all four emotions will be averaged to determine the bottom-line competition metric by which the submissions will be ranked.
Additional metrics: In addition to the bottom-line competition metric described above, the following additional metrics will be provided:
Note that both these additional metrics will be calculated from the same submission zip described above. (Participants need not provide anything extra for these additional metrics.)
The official evaluation script (which also acts as a format checker) is available here. You may want to run it on the training set to determine your progress, and eventually on the test set to check the format of your submission.
Data: Training and test datasets are provided for four emotions: joy, sadness, fear, and anger. For example, the anger training dataset has tweets along with a real-valued score between 0 and 1 indicating the degree of anger felt by the speaker. The test data includes only the tweet text. Gold emotion intensity scores will be released after the evaluation period. Further details of this data are available in this paper:
Without intensity labels:
Note: For your competition submission for the test set, you are allowed to train on the combined Training and Development sets.
This is a *small* set of data that can be used to tune one’s system, but is provided mainly so that one can test submitting output on CodaLab. Please make sure you try submitting your system output on the development set through the CodaLab website, and address any issues that may come up as a result of that, well before evaluation period. Test data will have a format identical to the development set, but it will be much larger in size.
Note: Since the dev set is small in size, results on the data may not be indicative of performance on the test set.
Without intensity labels:
Here are some key points about test-phase submissions:
Once the competition is over, we will release the gold labels and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.
Manual Annotation: Manual annotation of the dataset to obtain real-valued scores was done through Best-Worst Scaling (BWS), an annotation scheme shown to obtain very reliable scores (Kiritchenko and Mohammad, 2016). The data is then split into a training set and a test set. The test set released at the start of the evaluation period will not include the real-valued sentiment scores. These scores for the test data, which we will refer to as the Gold data, will be released after evaluation, when the results are posted.
The emotion intensity scores for both training and test data are obtained by crowdsourcing. Standard crowdsourcing best practices were followed such as pre-annotating 5% to 10% of questions internally (by one of the task organizers). These pre-annnotations were used to randomly check quality of crowdsourced responses and inform annotators of errors as and when they make them. (This has been shown to significantly improve annotation quality).
Best-Worst Scaling Questionnaires and Directions to Annotators
Obtaining real-valued sentiment annotations has several challenges. Respondents are faced with a higher cognitive load when asked for real-valued sentiment scores for terms as opposed to simply classifying terms as either positive or negative. It is also difficult for an annotator to remain consistent with his/her annotations. Further, the same sentiment association may map to different sentiment scores in the minds of different annotators; for example, one annotator may assign a score of 0.6 and another 0.8 for the same degree of positive association. One could overcome these problems by providing annotators with pairs of terms and asking which is more positive (a comparative approach), however that requires a much larger set of annotations (order N2, where N is the number of terms to be annotated).
Best-Worst Scaling (BWS), also sometimes referred to as Maximum Difference Scaling (MaxDiff), is an annotation scheme that exploits the comparative approach to annotation (Louviere and Woodworth, 1990; Cohen, 2003; Louviere et al., 2015; Kiritchenko and Mohammad, 2016) while still keeping the number of required annotations small. Annotators are given four items (4-tuple) and asked which item is the Best (highest in terms of the property of interest) and which is the Worst (least in terms of the property of interest). These annotations can then be easily converted into real-valued scores of association between the items and the property, which eventually allows for creating a ranked list of items as per their association with the property of interest.
The questionnaires used to annotate the data are available here:
You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. In order to assist testing of ideas, we also provide a baseline emotion intensity system that you can build on. The use of this system is completely optional. The system is available here. Instructions for using the system with the task data are available here.
Large lists of manually created and automatically generated word-emotion and word-sentiment association lexicons are available here.
System submissions must to have the same format as used in the training and test sets. Each line in the file should include:
Simply replace the NONEs in the last column of the test files with your system's predictions.
A valid submission file for CodaLab is a zip compressed file containing the following files:
Organizers of the shared task:
Saif M. Mohammad
National Research Council Canada
The University of Waikato
European Commission, Brussels
Start: Feb. 20, 2017, midnight
Start: May 2, 2017, midnight
You must be logged in to participate in competitions.Sign In