SemEval-2018 task 3 - Irony detection in English tweets

Organized by Eslefeve - Current server time: May 23, 2019, 6:51 p.m. UTC

First phase

Practice Task A
Aug. 21, 2017, midnight UTC


Competition Ends
Jan. 29, 2018, midnight UTC

SemEval-2018 Task 3: Irony detection in English tweets


This is the CodaLab website for SemEval-2018 Task 3: Irony detection in English tweets. The task is part of the 12th workshop on semantic evaluation. You can join the official mailing group of the task.

Cite this paper for the task: Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. Semeval-2018 Task 3: Irony detection in English Tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, June 2018.

IMPORTANT: registration for the task closes on 05/01/2018!


Background and significance:

The development of the social web has stimulated creative and figurative language use like irony. This frequent use of irony on social media has important implications for natural language processing tasks, which struggle to maintain high performance when applied to ironic text (Liu, 2012; Maynard and Greenwood, 2014; Ghosh and Veale, 2016). Although different definitions of irony co-exist, it is often identified as a trope or figurative language use whose actual meaning differs from what is literally enunciated. As such, modeling irony has a large potential for applications in various research areas, including text mining, author profiling, detecting online harassment, and perhaps one of the most popular applications at present, sentiment analysis.

As described by Joshi et al. (2016), recent approaches to irony can roughly be classified into rule-based and machine learning-based methods. While rule-based approaches mostly rely upon lexical information and require no training, machine learning invariably makes use of training data and exploits different types of information sources, including bags of words, syntactic patterns, sentiment information or semantic relatedness. Recently, deep learning techniques gain increasing popularity for this task as they allow to integrate semantic relatedness by making use of, for instance, word embeddings.

To facilitate data collection and annotation, many supervised-learning approaches rely on hashtag-labeled (e.g. #sarcasm) Twitter data, although it has been shown to increase data noise (e.g. Kunneman et al., 2015, Van Hee et al., 2016a, 2016b). For the current task, we collected a dataset for automatic irony detection using the hashtags #irony, #sarcasm and #not and manually annotated the corpus following a fine-grained annotation scheme.


Participants of the task will be given the opportunity to write a paper that describes their system, resources used, results, and analysis that will be part of the official SemEval-2018 proceedings. The paper is to be four pages long plus two pages at most for references. This paper is to be 4 pages long plus 2 pages at most for references and should follow the provided format and style files.

How to participate?

  • read the competition details on this website;
  • download the competition data;
  • join the mailing group to receive announcements and participate in discussions;
  • submit your system(s) in January 2018 (see Schedule for exact dates for each task).


We propose two different subtasks for the automatic detection of irony on Twitter. For the first subtask, participants should determine whether a tweet is ironic or not (by assigning a binary value 0 or 1). For the second subtask, participants are tasked with distinguishing between non-ironic and ironic tweets, the latter of which are subdivided into three categories. More details of both subtasks are described below.

Ironic vs. non-ironic

The first subtask is a two-class (or binary) classification task where the system has to predict whether a tweet is ironic or not. The following sentences present examples of an ironic and non-ironic tweet, respectively.

  • I just love when you test my patience!! #not
  • Had no sleep and have got school now #not happy

Different types of irony

The second subtask is a multiclass classification task where the system has to predict one out of four labels describing i) verbal irony realized through a polarity contrast, ii) verbal irony without such a polarity contrast (i.e., other verbal irony), iii) descriptions of situational irony, iv) non-irony. A brief description and example for each label are presented below.

Verbal irony by means of a polarity contrast

Instances containing an evaluative expression whose polarity (positive, negative) is inverted between the literal and the intended evaluation.


  • I love waking up with migraines #not :'(
  • I really love this year's summer; weeks and weeks of awful weather

In the above examples, the irony results from a polarity inversion between two evaluations. In sentence 4 for instance, the literal evaluation ("I really love this year's summer") is positive, while the intended one, which is implied by the context ("weeks and weeks of awful weather"), is negative.

Other verbal irony

Instances which show no polarity contrast between the literal and the intended evaluation, but are nevertheless ironic.


  • @someuser Yeah keeping cricket clean, that's what he wants #Sarcasm
  • Human brains disappear every day. Some of them have never even appeared. |#brain #humanbrain #Sarcasm
Situational irony

Instances describing situational irony, or situations that fail to meet some expectations. As explained by Shelley (2001), firefighters who have a fire in their kitchen while they are out to answer a fire alarm would be a typically ironic situation.


  • Most of us didn't focus in the #ADHD lecture. #irony
  • Event technology session is having Internet problems. #irony #HSC2024
  • Just saw a non-smoking sign in the lobby of a tobacco company #irony

Instances that are clearly not ironic, or which lack context to be sure that they are ironic. Examples:

  • And then my sister should be home from college by time I get home from babysitting. And it's payday. THIS IS A GOOD FRIDAY
  • Please dont fuck with me when I first wake up #not a morning person!


  • Ghosh, A. and Veale, T.: 2016, Fracking Sarcasm using Neural Network, Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Association for Computational Linguistics, San Diego, California, pp. 161–169.
  • Joshi, A., Bhattacharyya, P. and Carman, M. J.: 2016, Automatic Sarcasm Detection: A Survey, CoRR abs/1602.03426.
  • Liu, B.: 2012, Sentiment Analysis and Opinion Mining, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.
  • Maynard, D. and Greenwood, M.: 2014, Who cares about Sarcastic Tweets? Investigating the Impact of Sarcasm on Sentiment Analysis, in N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk and S. Piperidis (eds), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LRC'14), European Language Resources Association, Reykjavik, Iceland.
  • Shelley, C.: 2001, The bicoherence theory of situational irony, Cognitive Science 25(5), 775–818.
  • Van Hee, C., Lefever, E. & Hoste, V.: 2016a, Exploring the realization of irony in Twitter data. Proceedings of the Tenth International Conference on language Resources and Evaluation (LREC'16), 1795-1799, European Language Resources Association (ELRA), Portoroz╠î, Slovenia, pp. 1795-1799.
  • Van Hee, C., Lefever, E. and Hoste, V.: 2016b, Monday mornings are my fave : #not Exploring the Automatic Recognition of Irony in English tweets, Proceedings of COLING 2016, 26th International Conference on Computational Linguistics, Osaka, Japan, pp. 2730–2739.
  • Van Hee, C., Lefever, E. and Hoste, V.: 2016c, Guidelines for Annotating Irony in Social Media Text, version 2.0, Technical Report 16-01, LT3, Language and Translation Technology Team – Ghent University.
  • Wallace, B. C.: 2015, Computational irony: A survey and new perspectives, Artificial Intelligence Review 43(4), 467–483.


Systems are evaluated using standard evaluation metrics, including accuracy, precision, recall and F1-score. The submissions will be ranked by F1-score.

Note that while accuracy provides insights into the system performance for all classes, the latter three measures will be calculated for the positive class only (subtask A) or will be reported per class label or macro-averaged (subtask B). Macro-averaging of the F1-score implies that all class labels have equal weight in the final score.

The metrics will be calculated as follows:

Evaluation metrics

Scoring script

The official evaluation script is available at GitHub.

Terms and conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2018 workshop and in the associated proceedings, at the task organisers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organisers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organisers.

You further agree that the task organisers are under no obligation to release scores and that scores may be withheld if it is the task organisers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organisers.

You agree not to redistribute the test data except in the manner prescribed by its licence.


Competition setup

  • Mon 21 Aug 2017: CodaLab competition website ready and made public. Should include basic task description and mailing group information for the task. Trial data ready. Evaluation script ready for participants to download and run on the trial data.
  • Mon 25 Sep 2017: Training data ready. Development data ready. CodaLab competition website updated to include an evaluation script uploaded as part of the competition so that participants can upload submissions on the development set and the script immediately checks the submission for format and computes the results on the development set. Benchmark system should be made available to participants. Also, the organisers should run the submission created with the benchmark system on CodaLab, so that participants can see its results on the Leaderboard.

Competition and evaluation

  • Mon 15 Jan 2018: Evaluation start Task A
  • Mon 22 Jan 2018: Evaluation end Task A, Evaluation start Task B
  • Mon 29 Jan 2018: Evaluation end Task B
  • Mon 05 Feb 2018: Results Task A & Task B posted
  • Fri 09 Feb 2018: Gold labels Task A & Task B released
  • Mon 26 Feb 2018: System description paper submissions due by 23:59 GMT -12:00
  • Mon 05 Mar 2018: Task description paper submissions due by 23:59 GMT -12:00
  • Mon 19 Mar 2018: Paper reviews due (for both systems and tasks)
  • Mon 02 Apr 2018: Author notifications
  • Mon 16 Apr 2018: Camera ready submissions due

Task organisers

Cynthia Van Hee is a post-doctoral researcher at the LT3 Language and Translation Technology Team at Ghent University, active in the field of computational linguistics and machine learning. In the framework of her PhD, she created a theoretic framework of irony and developed a state-of-the-art irony detection system. Her other research interests include sentiment and emotion analysis in social media text and the detection of cyberbullying, which was one of the use cases of the AMiCA (Automatic Monitoring for Cyberspace Applications) project in which she was actively involved. Related to the first topic, she participated in the SemEval-2014 and SemEval-2015 tasks "Sentiment Analysis on Twitter" and "Sentiment Analysis of Figurative Language in Twitter", respectively.

Els Lefever is assistant professor Terminology and Computational Semantics. She holds a PhD in computer science from Ghent University on "ParaSense: Parallel Corpora for Word Sense Disambiguation" (2012). She has a strong expertise in machine learning of natural language and multilingual natural language processing, with a special interest for computational semantics, cross-lingual word sense disambiguation and automatic terminology extraction & taxonomy learning. Els organised two runs of the SemEval "Crosslingual Word Sense Disambiguation" task and co-organised the SemEval-2014 task on "L2 Writing Assistant" and the SemEval-2016 task on "Taxonomy Extraction Evaluation (TexEval)". She also participated to 5 other SemEval tasks (2007: Web People Search, 2014: Sentiment Analysis in Twitter, 2015: Sentiment Analysis of Figurative Language in Twitter, Aspect Based Sentiment Analysis, and Taxonomy Extraction Evaluation).

Véronique Hoste is full professor and head of LT3. She holds a PhD in linguistics on optimisation issues in machine learning of coreference resolution (2005). She has a strong expertise in machine learning of natural language, and more specifically in coreference resolution, word sense disambiguation, terminology extraction, text classification, classifier optimisation, readability prediction, sentiment mining, etc. Véronique already participated in the Senseval 2 and Senseval-3 competitions early 2000 on all-words word sense disambiguation. In the SemEval successor, she co-organised tasks on coreference resolution (2010), cross-lingual word sense disambiguation (2010 and 2013), L2 writing (2014) and multilingual aspect-based sentiment analysis (2016) on which she also published several articles.

The task organisers are members of the LT3 (Language and Translation Technology) team at the Faculty of Arts and Philosophy at Ghent University.


Training and test datasets are provided for two subtasks A and B on irony detection in tweets. The training dataset for task A has tweets with a binary value score (0 or 1) indicating whether the tweet is ironic. The training data for subtask B includes tweets with a numeric value corresponding to one of the subcategories (i.e. ironic by clash, other irony, situational irony, non-ironic). The test data includes only the tweet text. Gold irony labels will be released after the evaluation period. Download links and further details on the construction of the dataset are provided below.


We constructed a data set of 3,000 English tweets, by searching Twitter for the hashtags #irony, #sarcasm and #not. All tweets were collected between 01/12/2014 and 04/01/2015 and represent 2,676 unique users. To minimize the noise introduced by groundless irony-related hashtags, all tweets were manually labeled using a fine-grained annotation scheme for irony (Van Hee et al., 2016c). Prior to data annotation, the entire corpus was cleaned by removing retweets, duplicates and non-English tweets, and replacement of XML-escaped characters (e.g. &).


The entire corpus was annotated by three students in linguistics and second-language speakers of English, which each annotated one third of the corpus. Brat (Stenetorp et al., 2012). was used as the annotation tool. To assess the reliability of the annotations, an annotation agreement study was carried out on a subset of the corpus (100 instances), resulting in Fleiss Kappa scores of 0.72 for recognising (the different forms of) ironic instances. Annotation of the corpus resulted in the following distribution of the different classes:

  • Verbal irony by means of a polarity contrast: 1,728 instances
  • Other types of verbal irony: 267 instances
  • Situational irony: 401 instances
  • Non-ironic: 604 instances

Train and test corpus

Based on the annotations, 2,396 instances are ironic (1,728 + 267 + 401) while 604 are not. To balance class representation in the corpus, 1,792 non-ironic tweets were added from a background corpus. The tweets were manually checked to ascertain that they are non-ironic and are devoid of irony-related hashtags. This brings the total amount of data to 4,792 tweets (2,396 ironic + 2,396 non-ironic). For the SemEval-2018 competition, this corpus will be randomly split into a training (80% or 3,833 instances) and test (20%, or 958 instances) set. In addition, we will provide a second manually annotated test set containing approximately 1,500 tweets.

Submission format

System submissions for CodaLab are zip-compressed folders containing a predictions file called predictions-taskA.txt or predictions-taskB.txt, for subtask A and B, respectively. Remember that your submission should only concern one task at a time. There will be separate evaluation periods for the two subtasks.

The evaluation script will check whether the files contain the relevant labels for each task. The script, as well as a sample predictions file can be found on GitHub.

IMPORTANT: a lot of competitions are run on CodaLab, and just a certain number of submissions can be handled at a given time, due to which your submission may be 'stuck' (i.e. status remains 'submitted' even after refreshing) for a certain time. In this case, patiently try again until your submission does get processed. Please make sure that you do not wait until the very last moment to submit your final system to avoid stress and missing the deadline :-). 

Practice phase

During the practice phase, teams can upload an example submission for Subtask A (i.e. binary irony classification). This submission is optional and just by means of practice for participants that are new to CodaLab (i.e. to find out how submitting and evaluating system output on CodaLab works) and does not require any development. The sample file for this phase can be downloaded from GitHub and contains predictions for 10 instances of the training data for Task A. Please note that the file (like the prediction files for final evaluation) should be named 'predictions-taskA.txt' and should be uploaded in a zipped folder named 'submission'.

Want to try?

  • navigate to 'Participate';
  • click 'Submit / View Results';
  • upload your zipped submission folder (*);
  • refresh the status to see the evaluation status;
  • click 'download evaluation output from scoring step' to see the results.

(*) Make sure that your submission is a zip-file named (for Mac users: ensure that it contains no __macosx file).

Development and evaluation phase

During the development phase (25/09/2017 - 15/01/2018), teams can upload a submission for Subtask A (i.e. binary irony classification) by means of development. They can upload predictions (e.g. obtained via the baseline system) for all instances in the training data for Task A in the same way as for the official evaluation phase. During the development phase, submissions will be evaluated against the gold-standard labels of the training data for subtask A.

Find below a step-by-step guideline to upload your submission on CodaLab during both the development and evaluation phase:

  • via the command line, navigate on your local pc to a folder named 'submission' containing the output of your system (i.e. a file named 'predictions-taskA.txt' or 'predictions-taskB.txt', depending on the task). Make sure that there is 1 prediction per line and that the original order of the posts is left unchanged.
  • inside the folder, execute the following command: zip -r ../ *
  • on CodaLab, navigate to 'Participate' > 'Evaluation Task A' or 'Evaluation Task B' > Submit/View Results and upload your file
  • click 'Refresh status' until your submission receives the 'Finished' status
  • click 'Submit to leaderboard' to push your results to the official scoring board. Please note: as soon as the official evaluation period starts, the scoring board will not be made visible until the end of the evaluation period (end of January 2018).

Baseline system

You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. In order to assist testing of ideas, we also provide a baseline irony detection system that you can build on.

The use of this system is completely optional and is available on GitHub, as well as instructions for using the system with the task data. The system is based on token unigram features and outputs predictions for subtask A or subtask B and a score (obtained through 10-fold cross validation on the training sets).

Directions, notes and reminders for the official evaluation phase

Time schedule
  • Evaluation Task A: 15/01 - 22/01 (i.e. submissions can be made until Sunday 21/01, 23:59)
  • Evaluation Task B: 22/01 - 29/01 (i.e. submissions can be made until Sunday 28/01, 23:59)
  • A participant can be involved in exactly one team. If there are reasons why it makes sense for you to participate in more than one team, then email us before the evaluation period starts. In special circumstances this may be allowed.
  • All registered members of this task are asked to indicate their username, email and team name in a Google spreadsheet (invitation via email). Team constitutions cannot be changed once the evaluation period has started.
System submission
  • Participants can download the test data via GitHub (when released) and run their system on the data to generate a submission file.
  • Participants can submit for each task a constrained (only the provided training data were used to develop the system) and an unconstrained (additional training data were used) run for each task (but: only the last submission will appear on the official leader board).
  • Submission files should be named 'predictions-taskA.txt' or 'predictions-taskB.txt' and zipped in a folder named 'submission_constrained' or 'submission_unconstrained', depending on the flavour of your system. Submission folders that are named '' will be considered constrained.
  • The predictions file consists of n predictions (one per line/for each instance in the test set) in the relevant order. Only labels should be included in the predictions file, without any text, metadata or headers.
  • When uploading their system, participants will need to fill in some metadata, including the name of their team, a brief description of the method used, etc.
  • Only the last submission (it does not matter whether this is constrained/unconstrained) will be shown on the leaderboard (this will be the official ranking on CodaLab), but the task description paper will list a separate ranking of the best constrained and unconstrained system runs. It will therefore be important to clearly indicate, at the moment of submission, whether the run is constrained or unconstrained.
  • After the evaluation phase has finished, each team will be asked to fill in a form to provide details about thehir submission so that all systems can be accurately described in the task description paper. 


Practical details about the leader board
  • The leader board will be disabled until the end of the evaluation period, so you will not be able to see the results of your submission. You can, however, see whether the upload of your submission was succesful or not (if not, consult the error log) and you can of course evaluate your system on the training data by making use of the evaluation script (more details below).

  • The scores of the last submission of each team will be displayed on the leader board after the evaluation period ends. Whether that is the constrained or unconstrained system (in case you have both), is up to the choice of the team.

  • For each team, there is one team member that submits the system. Submissions for one team that are made by different people will not be considered and only the final submission will be considered as the team's official submission. Important: there is as a maximum of 10 submission attempts (i.e. in case something goes wrong wile uploading, there are problems with the file format, etc.), uploading more than 10 submissions is not possibile.

Use of the evaluation script

During development, participants can evaluate their system on the training data (on their local device) by making use of the official scoring script. The same scoring script runs on CodaLab to evaluate the submissions.

To simulate the scoring of your submission on your local PC, execute the following steps:

  1. Create a folder (e.g. named 'demo') with two subfolders: 'ref' and 'res'
  2. Put the gold-standard file in the 'ref' folder and your system output in the 'res' folder
  3. Run the script with the following command: python demo .
  4. Consult the output in the 'scores.txt' file that was generated from step 3


Practice Task A

Start: Aug. 21, 2017, midnight

Description: The practice phase has finished.

Development Task A

Start: Sept. 25, 2017, midnight

Development Task B

Start: Sept. 25, 2017, midnight

Evaluation Task A

Start: Jan. 15, 2018, midnight

Description: During this phase, submissions for Task A can be evaluated. Max. 10 submissions per team are allowed.

Evaluation Task B

Start: Jan. 22, 2018, midnight

Competition Ends

Jan. 29, 2018, midnight

You must be logged in to participate in competitions.

Sign In