This is the CodaLab website for SemEval-2018 Task 3: Irony detection in English tweets. The task is part of the 12th workshop on semantic evaluation.
You can join the mailing group of the task.
IMPORTANT: registration for the task closes on 05/01/2018!
The development of the social web has stimulated creative and figurative language use like irony. This frequent use of irony on social media has important implications for natural language processing tasks, which struggle to maintain high performance when applied to ironic text (Liu, 2012; Maynard and Greenwood, 2014; Ghosh and Veale, 2016). Although different definitions of irony co-exist, it is often identified as a trope or figurative language use whose actual meaning differs from what is literally enunciated. As such, modeling irony has a large potential for applications in various research areas, including text mining, author profiling, detecting online harassment, and perhaps one of the most popular applications at present, sentiment analysis.
As described by Joshi et al. (2016), recent approaches to irony can roughly be classified into rule-based and machine learning-based methods. While rule-based approaches mostly rely upon lexical information and require no training, machine learning invariably makes use of training data and exploits different types of information sources, including bags of words, syntactic patterns, sentiment information or semantic relatedness. Recently, deep learning techniques gain increasing popularity for this task as they allow to integrate semantic relatedness by making use of, for instance, word embeddings.
To facilitate data collection and annotation, many supervised-learning approaches rely on hashtag-labeled (e.g. #sarcasm) Twitter data, although it has been shown to increase data noise (e.g. Kunneman et al., 2015, Van Hee et al., 2016a, 2016b). For the current task, we collected a dataset for automatic irony detection using the hashtags #irony, #sarcasm and #not and manually annotated the corpus following a fine-grained annotation scheme.
Participants of the task will be given the opportunity to write a paper that describes their system, resources used, results, and analysis that will be part of the official SemEval-2018 proceedings. The paper is to be four pages long plus two pages at most for references. This paper is to be 4 pages long plus 2 pages at most for references and should follow the provided format and style files.
We propose two different subtasks for the automatic detection of irony on Twitter. For the first subtask, participants should determine whether a tweet is ironic or not (by assigning a binary value 0 or 1). For the second subtask, participants are tasked with distinguishing between non-ironic and ironic tweets, the latter of which are subdivided into three categories. More details of both subtasks are described below.
The first subtask is a two-class (or binary) classification task where the system has to predict whether a tweet is ironic or not. The following sentences present examples of an ironic and non-ironic tweet, respectively.
The second subtask is a multiclass classification task where the system has to predict one out of four labels describing i) verbal irony realized through a polarity contrast, ii) verbal irony without such a polarity contrast (i.e., other verbal irony), iii) descriptions of situational irony, iv) non-irony. A brief description and example for each label are presented below.
Instances containing an evaluative expression whose polarity (positive, negative) is inverted between the literal and the intended evaluation.
In the above examples, the irony results from a polarity inversion between two evaluations. In sentence 4 for instance, the literal evaluation ("I really love this year's summer") is positive, while the intended one, which is implied by the context ("weeks and weeks of awful weather"), is negative.
Instances which show no polarity contrast between the literal and the intended evaluation, but are nevertheless ironic.
Instances describing situational irony, or situations that fail to meet some expectations. As explained by Shelley (2001), firefighters who have a fire in their kitchen while they are out to answer a fire alarm would be a typically ironic situation.
Instances that are clearly not ironic, or which lack context to be sure that they are ironic. Examples:
Systems are evaluated using standard evaluation metrics, including accuracy, precision, recall and F1-score. The submissions will be ranked by F1-score.
Note that while accuracy provides insights into the system performance for all classes, the latter three measures will be calculated for the positive class only (subtask A) or will be reported per class label or macro-averaged (subtask B). Macro-averaging of the F1-score implies that all class labels have equal weight in the final score.
The metrics will be calculated as follows:
The official evaluation script is available at GitHub.
By submitting results to this competition, you consent to the public release of your scores at the SemEval-2018 workshop and in the associated proceedings, at the task organisers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organisers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organisers.
You further agree that the task organisers are under no obligation to release scores and that scores may be withheld if it is the task organisers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organisers.
You agree not to redistribute the test data except in the manner prescribed by its licence.
Cynthia Van Hee is a post-doctoral researcher at the LT3 Language and Translation Technology Team at Ghent University, active in the field of computational linguistics and machine learning. In the framework of her PhD, she created a theoretic framework of irony and developed a state-of-the-art irony detection system. Her other research interests include sentiment and emotion analysis in social media text and the detection of cyberbullying, which was one of the use cases of the AMiCA (Automatic Monitoring for Cyberspace Applications) project in which she was actively involved. Related to the first topic, she participated in the SemEval-2014 and SemEval-2015 tasks "Sentiment Analysis on Twitter" and "Sentiment Analysis of Figurative Language in Twitter", respectively.
Els Lefever is assistant professor Terminology and Computational Semantics. She holds a PhD in computer science from Ghent University on "ParaSense: Parallel Corpora for Word Sense Disambiguation" (2012). She has a strong expertise in machine learning of natural language and multilingual natural language processing, with a special interest for computational semantics, cross-lingual word sense disambiguation and automatic terminology extraction & taxonomy learning. Els organised two runs of the SemEval "Crosslingual Word Sense Disambiguation" task and co-organised the SemEval-2014 task on "L2 Writing Assistant" and the SemEval-2016 task on "Taxonomy Extraction Evaluation (TexEval)". She also participated to 5 other SemEval tasks (2007: Web People Search, 2014: Sentiment Analysis in Twitter, 2015: Sentiment Analysis of Figurative Language in Twitter, Aspect Based Sentiment Analysis, and Taxonomy Extraction Evaluation).
Véronique Hoste is full professor and head of LT3. She holds a PhD in linguistics on optimisation issues in machine learning of coreference resolution (2005). She has a strong expertise in machine learning of natural language, and more specifically in coreference resolution, word sense disambiguation, terminology extraction, text classification, classifier optimisation, readability prediction, sentiment mining, etc. Véronique already participated in the Senseval 2 and Senseval-3 competitions early 2000 on all-words word sense disambiguation. In the SemEval successor, she co-organised tasks on coreference resolution (2010), cross-lingual word sense disambiguation (2010 and 2013), L2 writing (2014) and multilingual aspect-based sentiment analysis (2016) on which she also published several articles.
The task organisers are members of the LT3 (Language and Translation Technology) team at the Faculty of Arts and Philosophy at Ghent University.
Training and test datasets are provided for two subtasks A and B on irony detection in tweets. The training dataset for task A has tweets with a binary value score (0 or 1) indicating whether the tweet is ironic. The training data for subtask B includes tweets with a numeric value corresponding to one of the subcategories (i.e. ironic by clash, other irony, situational irony, non-ironic). The test data includes only the tweet text. Gold irony labels will be released after the evaluation period. Download links and further details on the construction of the dataset are provided below.
We constructed a data set of 3,000 English tweets, by searching Twitter for the hashtags #irony, #sarcasm and #not. All tweets were collected between 01/12/2014 and 04/01/2015 and represent 2,676 unique users. To minimize the noise introduced by groundless irony-related hashtags, all tweets were manually labeled using a fine-grained annotation scheme for irony (Van Hee et al., 2016c). Prior to data annotation, the entire corpus was cleaned by removing retweets, duplicates and non-English tweets, and replacement of XML-escaped characters (e.g. &).
The entire corpus was annotated by three students in linguistics and second-language speakers of English, which each annotated one third of the corpus. Brat (Stenetorp et al., 2012). was used as the annotation tool. To assess the reliability of the annotations, an annotation agreement study was carried out on a subset of the corpus (100 instances), resulting in Fleiss Kappa scores of 0.72 for recognising (the different forms of) ironic instances. Annotation of the corpus resulted in the following distribution of the different classes:
Based on the annotations, 2,396 instances are ironic (1,728 + 267 + 401) while 604 are not. To balance class representation in the corpus, 1,792 non-ironic tweets were added from a background corpus. The tweets were manually checked to ascertain that they are non-ironic and are devoid of irony-related hashtags. This brings the total amount of data to 4,792 tweets (2,396 ironic + 2,396 non-ironic). For the SemEval-2018 competition, this corpus will be randomly split into a training (80% or 3,833 instances) and test (20%, or 958 instances) set. In addition, we will provide a second manually annotated test set containing approximately 1,500 tweets.
System submissions for CodaLab are zip-compressed folders containing a predictions file called predictions-taskA.txt or predictions-taskB.txt, for subtask A and B, respectively. Remember that your submission should only concern one task at a time. There will be separate evaluation periods for the two subtasks.
The evaluation script will check whether the files contain the relevant labels for each task. The script, as well as a sample predictions file can be found on GitHub.
IMPORTANT: a lot of competitions are run on CodaLab, and just a certain number of submissions can be handled at a given time, due to which your submission may be 'stuck' (i.e. status remains 'submitted' even after refreshing) for a certain time. In this case, patiently try again until your submission does get processed. Please make sure that you do not wait until the very last moment to submit your final system to avoid stress and missing the deadline :-).
During the practice phase, teams can upload an example submission for Subtask A (i.e. binary irony classification). This submission is optional and just by means of practice for participants that are new to CodaLab (i.e. to find out how submitting and evaluating system output on CodaLab works) and does not require any development. The sample file for this phase can be downloaded from GitHub and contains predictions for 10 instances of the training data for Task A. Please note that the file (like the prediction files for final evaluation) should be named 'predictions-taskA.txt' and should be uploaded in a zipped folder named 'submission'.
Want to try?
(*) Make sure that your submission is a zip-file named submission.zip (for Mac users: ensure that it contains no __macosx file).
During the development phase (25/09/2017 - 15/01/2018), teams can upload a submission for Subtask A (i.e. binary irony classification) by means of development. They can upload predictions (e.g. obtained via the baseline system) for all instances in the training data for Task A in the same way as for the official evaluation phase. During the development phase, submissions will be evaluated against the gold-standard labels of the training data for subtask A.
Find below a step-by-step guideline to upload your submission on CodaLab during both the development and evaluation phase:
You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. In order to assist testing of ideas, we also provide a baseline irony detection system that you can build on.
The use of this system is completely optional and is available on GitHub, as well as instructions for using the system with the task data. The system is based on token unigram features and outputs predictions for subtask A or subtask B and a score (obtained through 10-fold cross validation on the training sets).
The leader board will be disabled until the end of the evaluation period, so you will not be able to see the results of your submission. You can, however, see whether the upload of your submission was succesful or not (if not, consult the error log) and you can of course evaluate your system on the training data by making use of the evaluation script (more details below).
The scores of the last submission of each team will be displayed on the leader board after the evaluation period ends. Whether that is the constrained or unconstrained system (in case you have both), is up to the choice of the team.
For each team, there is one team member that submits the system. Submissions for one team that are made by different people will not be considered. A maximum of 10 submissions are allowed per team, but only the final submission will be considered as the team's official submission.
During development, participants can evaluate their system on the training data (on their local device) by making use of the official scoring script. The same scoring script runs on CodaLab to evaluate the submissions.
To simulate the scoring of your submission on your local PC, execute the following steps:
Start: Aug. 21, 2017, midnight
Description: The practice phase has finished.
Start: Sept. 25, 2017, midnight
Start: Sept. 25, 2017, midnight
Start: Jan. 15, 2018, midnight
Description: During this phase, submissions for Task A can be evaluated. Max. 10 submissions per team are allowed.
Start: Jan. 22, 2018, midnight
May 1, 2018, midnight
You must be logged in to participate in competitions.Sign In