The Shared Task on Sarcasm Detection

Organized by debanjanghosh - Current server time: July 11, 2020, 5:24 p.m. UTC

First phase

Reddit Dataset
Jan. 19, 2020, midnight UTC


Competition Ends
April 16, 2020, 11:59 p.m. UTC

This is the overview page of the sarcasm detection shared task section in the Second Workshop on Figurative Language Processing, co-located with ACL 2020.

Sarcasm Detection

Sarcasm is defined as a form of figurative language, which often presents a sharp, bitter, or cutting expression or remark. Consider the following examples:

  • I love going to the emergency room every Saturday.
  • I sprayed perfume in my eye! I am now a genius!!?
  • Made $174 this month, I'm gonna buy a yacht! #poor #wishIhadabetterjob
  • So much intellectual conversation going on right now I may have to join mensa after this.
  • Pictures of you holding dead animal carcasses are so flattering.

Sarcasm detection has received considerable attention in the NLP community in recent years (Joshi, 2016). Computational approaches for sarcasm detection have modeled either the utterance in isolation or together with contextual information such as conversation context, author context, visual context, or cognitive features. This current shared task aims to study the role of conversation context for sarcasm detection (Ghosh et al., 2018).

We will be using two different datasets: Twitter conversations and conversation threads from Reddit. For both datasets, we will provide the immediate context (i.e., only the previous dialogue turn) as well as the full dialogue thread, when available. The goal is to understand how much conversation context is needed or helpful for sarcasm detection.

The computational models proposed by Ghosh et al. (2018), already released as open-source, will be used as a baseline. Sarcasm data is collected from Twitter using the standard hashtags of #sarcasm and #sarcastic. The Reddit data for training will be a subset of the datasets introduced by Khodak et al. (2017).


  • Joshi, A., Bhattacharyya, P., & Carman, M. J. (2017). Automatic sarcasm detection: A survey. ACM Computing Surveys (CSUR), 50(5), 73.
  • Ghosh, D., Fabbri, A. R., & Muresan, S. (2018). Sarcasm analysis using conversation context. Computational Linguistics, 44(4), 755-792.
  • Khodak, M., Saunshi, N., & Vodrahalli, K. (2017). A large self-annotated corpus for sarcasm. arXiv preprint arXiv:1704.05579.

Important Dates

  • January 19, 2020: 1st CFP for the shared task; CodaLab competition is open; training data and auxiliary scripts can be downloaded
  • February 25, 2020: 2nd CFP for the shared task; test data can be downloaded and results submitted; performance will be tracked on CodaLab dashboard
  • April 11, April 16, 2020: Last day for submitting predictions on test data
  • April 20, April 23, 2020: Papers describing the systems are due for paper submission information 
  • May 15, 2020: Notifications for papers are sent
  • May 25, 2020: Camera-Ready version of papers
  • July 9, 2020: Workshop


Training and test data have been released. Please check the corresponding GitHub links. 

Contact Info and Emails:
  • Debanjan Ghosh, Educational Testing Service; dghosh [AT] ets [DOT] org
  • Smaranda Muresan, Data Science Institute, Columbia University; smara [AT] columbia [DOT] edu 
  • Avijit Vajpayee, Educational Testing Service; avajpayee [AT] ets [DOT] org

Shared Task on Twitter Datasets and Reddit Datasets for Sarcasm Detection

As mentioned, there will be two datasets available for this shared task. You can elect to participate in the shared task using one or/and two of the datasets.

For both the Twitter dataset and Reddit dataset we will provide training and test utterances, thus, you do not have to use APIs to download data. We also provide a baseline classification model for prediction. The metric for comparison will the F-measure.

Training phase: Data is released for developing and training your sarcasm detection software. You could do cross-validation on the training data, or partition the training data further to have a held-out set for preliminary evaluations, and/or set apart a subset data for development/tuning of hyper-parameters. However the released data is used, the goal is to have N final systems (or versions of a system) ready for test when the test data is released.

Testing phase: Test instances are released. Each team generates predictions for the test instances, for up to N models. The submitted predictions are evaluated (by us) against the true labels. Submissions will be anonymized -- only the highest score of all submitted systems per day will be displayed. The metric will be F-measure with Precision and Recall also available via the detailed results link.

NOTE:  If you submit the predictions for the Twitter test set please note that the submission must be named "answer.txt" and compressed into "" for the evaluation to work. Likewise, if you submit the predictions for Reddit, the submission must be similarly named "answer.txt" and compressed into "" file.

We plan to have N=12 (up to 12 submissions per team) — this way, a number of versions of your system can be evaluated, without overwhelming the competition. You can try as many submissions as are allowed by CodaLab (1000), but only the last 12 submissions will be valid for comparative evaluation purposes with other systems participating in the shared task. For generating a task summary paper, we require that each submission provide the mandatory team name/individualmethod name and method description. Specifically, please include any 3rd party datasets you train your system on (if any) in the method description textbox. 

Regarding 3rd party datasets: we are allowing participants to use standard datasets that have been used in sarcasm detection research in recent years. For instance, for Twitter, you can use Riloff et al. (2013), Ptacek et al. (2014), and so on. For discussion forum type of data, feel free to look at the Internet Argument Corpus datasets (e.g., Oraby et al. 2017). However, “please” include the information whether you have used any particular dataset and the impact of it in your system performance.

For more detailed information regarding the submission of your results please see this.




Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the Figurative Language Processing-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. The inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

Training and testing data will be made available and persist on GitHub after the competition to encourage research in metaphor detection. You agree not to redistribute these data except in the manner prescribed by its license.

Reddit Dataset

Start: Jan. 19, 2020, midnight

Twitter Dataset

Start: Jan. 19, 2020, midnight

Competition Ends

April 16, 2020, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In