SemEval-2021 Task 12 - Learning with Disagreements

Organized by alexandraG - Current server time: Sept. 23, 2020, 10:32 p.m. UTC


Jan. 31, 2021, midnight UTC


Evaluation Phase
Jan. 10, 2021, midnight UTC


Competition Ends

SemEval-2021 Task 12 - Learning with Disagreements

Modern research in Cognitive Science and Artificial Intelligence (AI) is driven by the availability of large datasets annotated with human judgments. Most annotation projects assume that a single preferred interpretation exists for each item,  but this assumption has been shown to be an idealization at best, both in computational linguistics and in computer vision. Virtually all annotation projects for tasks such as anaphora resolution (Poesio et al. 2005, 2006), wordsense disambiguation (Passonneau et al, 2006), POS tagging (Plank et al, 2014), sentiment analysis,  image classification, natural language inference, and others, encounter numerous cases on which humans disagree. 

The aim of this shared task is to provide a unified testing framework for learning from disagreements using the best-known datasets containing information about disagreements for interpreting language and classifying images:

1. LabelMe-IC: Image Classification using a subset of LabelMe images (Russell et al., 2008), is a widely used, community-created image classification dataset where images are assigned to one of 8 categories: highway, inside city, tall building, street, forest, coast, mountain, open country. Rodrigues and Pereira (2017) collected crowd labels for these images using Amazon Mechanical Turk (AMT).

2. CIFAR10-IC: Image Classification using a subset of CIFAR-10 dataset, The entire dataset consists of colour images in 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Crowdsourced labels for this dataset were collected by Peterson et al (2019).

3. PDIS: Information Status Classification using Phrase Detectives Information. Information Status Classification (IS) in Phrase Detectives (Poesio et al., 2019) dataset involves identifying the information status of a noun phrase: whether that noun phrase refers to new information or to old information.

4. Gimpel-POS: Part-of-Speech tagging using the Gimpel dataset (Gimpel et al., 2011) for Twitter posts. Plank et al.(2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2011), using these tags as gold, and collected crowdsourced labels.

5. Humour: ranking one-line texts using pairwise funniness judgements (Simpson et al., 2019). Crowdworkers have annotated pairs of texts (a mixture of puns and non-puns) to indicate which is funniest. A gold standard ranking was produced using a large number of redundant annotations. The goal is to infer the gold standard ranking from a reduced number of crowdsourced judgements.

6. CrowdTruth-FD: semantic frame disambiguation using labels acquired with crowdsourcing by the CrowdTruth Project (Dumitrache 2019).

Participants are invited to train models for these six tasks by harnessing the crowd labels. 



Anca Dumitrache, Lora Aroyo, and Chris Welty. 2019. A crowdsourced frame disambiguation corpus with ambiguity. In Proc. of NAACL

Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 42–47, Portland, Oregon, USA. Association for Computational Linguistics.

Joshua C. Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and OlgaRussakovsky. 2019. Human uncertainty makes classification more robust.2019 IEEE/CVF International Conference on Computer Vision (ICCV),pages 9616–9625

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. Computing Research Repository - CORR.

Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014a. Learning part-of-speech taggers with inter-annotator agreement loss. InProceedings of the14th Conference of the European Chapter of the Association for Computational Linguistics, pages 742–751, Gothenburg, Sweden. Association forComputational Linguistics.

Massimo Poesio, Uwe Reyle, and Rosemary Stevenson. 2007. Justified slop-piness in anaphoric reference. In H. Bunt and R. Muskens, editors,Com-puting Meaning, volume 3, pages 11–34. Kluwer.

Massimo Poesio, Jon Chamberlain, and Udo Kruschwitz. 2017. Crowdsourcing. In N. Ide and J. Pustejovsky, editors, The Handbook of LinguisticAnnotation, pages 277–295. Springer.

Filipe Rodrigues, Mariana Lourenco, Bernardete Ribeiro, and Francisco Pereira. 2017. Learning supervised topic models for classification and regression from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1–1

Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman.2008. Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77.

Edwin Simpson, Erik-Lân Do Dinh, Tristan Miller, and Iryna Gurevych. 2019. Predicting humorousness and metaphor novelty with Gaussian process preference learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5716-5728).


The competition is in 3 phases.  Submissions for each phase are evaluated using 2 metrics.

1. Micro-F1 - a 'hard' metric. This metric evaluates how well the results submitted align with the preferred (gold) interpretation. Each item is considered correct if it assigns the maximum probability to the preferred interpretation.

2. Cross entropy - a 'soft' metric. The hard metric ignores probabilities assigned to different labels for a given item. We therefore use cross entropy to compare predicted label probabilities to the distribution of labels assigned by human annotators. This evaluates how well the model's probabilities reflect the level of agreement among annotators, as a model that correctly predicts the distribution of labels produced by the crowd for each item will have low cross entropy. 

An ideal model would have a high micro-averaged f1 and low crossentropy results.



1. Submissions must be made before end of the evaluation phase.

2. You may submit a total of 5 submissions to "Evaluation Phase". 

3. You may submit results for any number of tasks/datasets (at least one dataset) but overall leaderboard will report average f1 result across datasets.

4. Using a single crowd learning methodology or framework across all the tasks and datasets. For example, if your framework is the aggregate the crowd labels using Majority Voting and train on the aggregated labels, use this same methodology for all of the tasks you participate in.

5. You must incoporate the crowd labels into your training framework. 

6. You may not create multiple accounts or belong to several teams.

Alexandra Uma - Queen Mary University of London, United Kingdom

Anca Dumitrache - Talpa Network, Netherlands

Tommaso Fornaciari - Bocconi University, Italy

Edwin Simpson - University of Bristol, United Kingdom

Jon Chamberlain - University of Essex, United Kingdom

Silviu Paun - Queen Mary University of London, United Kingdom

Barbara Plank - IT University of Copenhagen , Denmark

Massimo Poesio - Queen Mary University of London, United Kingdom

  • Trial data ready: July 31, 2020
  • Training data ready: October 1, 2020
  • Test data ready: December 3, 2020
  • Evaluation start: January 10, 2021
  • Evaluation end: January 31, 2021
  • Paper submission due: February 23, 2021
  • Notification to authors: March 29, 2021
  • Camera ready due: April 5, 2021
  • SemEval workshop: Summer 2021

For inquiries, please contact us at

Practice Phase

Start: July 31, 2020, midnight

Description: This is the practice phase of the competition. In this phase, you are provided with datasets for tasks, starting with two tasks - PDIS and CIFAR10-IC. You are expected to craft novel approaches for training the models using the crowd labels. You may use the base models in as a starting point.

Evaluation Phase

Start: Jan. 10, 2021, midnight

Description: In this phase of the competition, the test data is released. Participants are expected to use the models trained in the practice phase to make predictions on the test data and make submissions. A total of 5 submissions are permitted. The submissions are then tested using the test dataset.


Start: Jan. 31, 2021, midnight

Description: The official competition ends in the evaluation phase. However, the Post-competition phase allows you to continue to refine and test your models.

Competition Ends


You must be logged in to participate in competitions.

Sign In