This is the overview page of the sarcasm detection shared task section in the Second Workshop on Figurative Language Processing, co-located with ACL 2020.
Sarcasm is defined as a form of figurative language, which often presents a sharp, bitter, or cutting expression or remark. Consider the following examples:
|
Sarcasm detection has received considerable attention in the NLP community in recent years (Joshi, 2016). Computational approaches for sarcasm detection have modeled either the utterance in isolation or together with contextual information such as conversation context, author context, visual context, or cognitive features. This current shared task aims to study the role of conversation context for sarcasm detection (Ghosh et al., 2018).
We will be using two different datasets: Twitter conversations and conversation threads from Reddit. For both datasets, we will provide the immediate context (i.e., only the previous dialogue turn) as well as the full dialogue thread, when available. The goal is to understand how much conversation context is needed or helpful for sarcasm detection.
The computational models proposed by Ghosh et al. (2018), already released as open-source, will be used as a baseline. Sarcasm data is collected from Twitter using the standard hashtags of #sarcasm and #sarcastic. The Reddit data for training will be a subset of the datasets introduced by Khodak et al. (2017).
Reference:
Important Dates
Updates
Training and test data have been released. Please check the corresponding GitHub links.
As mentioned, there will be two datasets available for this shared task. You can elect to participate in the shared task using one or/and two of the datasets.
For both the Twitter dataset and Reddit dataset we will provide training and test utterances, thus, you do not have to use APIs to download data. We also provide a baseline classification model for prediction. The metric for comparison will the F-measure.
Training phase: Data is released for developing and training your sarcasm detection software. You could do cross-validation on the training data, or partition the training data further to have a held-out set for preliminary evaluations, and/or set apart a subset data for development/tuning of hyper-parameters. However the released data is used, the goal is to have N final systems (or versions of a system) ready for test when the test data is released.
Testing phase: Test instances are released. Each team generates predictions for the test instances, for up to N models. The submitted predictions are evaluated (by us) against the true labels. Submissions will be anonymized -- only the highest score of all submitted systems per day will be displayed. The metric will be F-measure with Precision and Recall also available via the detailed results link.
NOTE: If you submit the predictions for the Twitter test set please note that the submission must be named "answer.txt" and compressed into "twitter_answer.zip" for the evaluation to work. Likewise, if you submit the predictions for Reddit, the submission must be similarly named "answer.txt" and compressed into "reddit_answer.zip" file.
We plan to have N=12 (up to 12 submissions per team) — this way, a number of versions of your system can be evaluated, without overwhelming the competition. You can try as many submissions as are allowed by CodaLab (1000), but only the last 12 submissions will be valid for comparative evaluation purposes with other systems participating in the shared task. For generating a task summary paper, we require that each submission provide the mandatory team name/individual, method name and method description. Specifically, please include any 3rd party datasets you train your system on (if any) in the method description textbox.
Regarding 3rd party datasets: we are allowing participants to use standard datasets that have been used in sarcasm detection research in recent years. For instance, for Twitter, you can use Riloff et al. (2013), Ptacek et al. (2014), and so on. For discussion forum type of data, feel free to look at the Internet Argument Corpus datasets (e.g., Oraby et al. 2017). However, “please” include the information whether you have used any particular dataset and the impact of it in your system performance.
For more detailed information regarding the submission of your results please see this.
By submitting results to this competition, you consent to the public release of your scores at the Figurative Language Processing-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. The inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
Training and testing data will be made available and persist on GitHub after the competition to encourage research in metaphor detection. You agree not to redistribute these data except in the manner prescribed by its license.
Start: Jan. 19, 2020, midnight
Start: Jan. 19, 2020, midnight
April 16, 2020, 11:59 p.m.
You must be logged in to participate in competitions.
Sign In