CODI-CRAC 2021 Shared-Task: Anaphora Resolution in Dialogues

Organized by sopank - Current server time: April 2, 2025, 6:19 p.m. UTC

Previous

Eval - DD (Gold)
July 11, 2021, midnight UTC

Current

Eval - DD (Gold)
July 11, 2021, midnight UTC

End

Competition Ends
July 21, 2021, midnight UTC

Official Ranking: https://docs.google.com/document/d/172Hp24wKTbaaY1veVOLDJg2zWtYCaHiNhLc6RO1Ubjw/edit?usp=sharing

**Updates:** 

Cross-team Analysis (Analysis reports due Sept 6).

We have sent access invites to the teams (team creators)/participants who submitted predictions to the evaluation leaderboards regarding folders that contain relevant files for cross-team analysis. If you made a submission to the eval leaderboard but did not receive the invite, please mail us at sopank@andrew.cmu.edu.

For participants who did not submit an official evaluation prediction, but would like to participate in the cross-team analysis should also contact us at sopank@andrew.cmu.edu.

 

************

We request all authors of system papers to include one table per system/track/setting in their report with the following information:

- Track (coref/deixis/bridging)
- Setting (predicted or gold mentions?)
- Baseline(s) used/modified
- Learning framework (i.e., modifications to the baseline(s), including modifications to both training and decoding)
- Markable identification model
- Data used for training
- Data used for development

 

****************

**Fully annotated test data available now (Participate -> Get data)!**

**************** 

Submissions of System Descriptions and Analysis Papers

 

One unique aspect of this shared task is that we are inviting two types of submissions, first System descriptions (Due August 3), and then Analysis papers (Due September 6).  

 

All shared task participants who have participated in at least one track are invited to submit a system description of up to 5 pages plus 2 extra pages for additional tracks plus references.  Participants who have participated in multiple tasks may add additional pages (i.e., 2 extra pages for 2 tracks, or 4 extra pages for 3 tracks).  Submissions may also add up to 2 pages per track for error analysis.  This is optional, but very highly encouraged.  Submissions should conform to the EMNLP 21 format.  Submissions of system descriptions are due on August 3 to the shared task softconf site:  https://www.softconf.com/emnlp2021/CODICRAC2021/.  Note that submitted drafts will be made publicly available to author teams participating in the Analysis paper call.

 

We are also making an open call for what we are calling Analysis Papers.  The purpose of these papers is to present a vision statement for the field based on a cross-cutting comparison across systems submitted or a critique of the shared task itself, pointing to gaps in the field not addressed in this shared task.  Analysis papers may focus on just one track or multiple tracks.  By August 10, we will make the submitted system descriptions and a performance table that reports the output on each instance for each of the participating systems.  Submissions of analysis papers can be between 4 and 6 pages and should conform to the EMNLP 21 format.  Submissions of analysis papers are due on September 6 to the shared task softconf site:  https://www.softconf.com/emnlp2021/CODICRAC2021/.

Link to the style files: https://2021.emnlp.org/call-for-papers/style-and-formatting

Please feel free to reach out if you have questions or concerns: sharedtask-codicrac-emnlp2021@googlegroups.com

****************

  1. ALL TEST SETS available now for Eval-Br(Gold) and Eval-DD(Gold). Participate -> Get data! 

     

  2. ALL TEST SETS AVAILABLE NOW for Eval - AR, Eval - Br (Pred), and Eval - DD (Pred).
  3. Deadline for Eval - AR, Eval - Br (Pred), and Eval - DD (Pred) extended to July 10th 2021
  4. AMI, Persuasion, LIGHT, and Swbd test set for Eval - AR, Eval - Br (Pred), and Eval - DD (Pred) available now (Participate -> Geta data)!
  5. Task description document available here!
  6. Persuasion dev set available now (Participate -> Get data)!
  7. Birds of a Feather session at NAACL at 2pm EST on June 8 (Zoom and other details: Participate -> Get data)!
  8. Baseline and helper scripts released (Participate -> Get data)!
  9. AMI dev set available now (Participate -> Get data)!
  10. LIGHT dev set available now (Participate -> Get data)!

 

Organizers: 

Sopan Khosla (Carnegie Mellon University), Ramesh Manuvinakurike (Intel Labs), Vincent Ng (University of Texas at Dallas), Massimo Poesio (Queen Mary University of London), Michael Strube (Heidelberg Institute for Theoretical Studies), Carolyn Rosé (Carnegie Mellon University)

 

Contact Email: sharedtask-codicrac-emnlp2021@googlegroups.com

 

Welcome to the shared task on Anaphora Resolution in Dialogues. This shared task provides: 

  • Three Tracks 

    • Resolution of anaphoric identity

    • Resolution of bridging references

    • Resolution of discourse deixis/abstract anaphora

  • New paradigm: two-stage shared task to facilitate community-wide visioning

  • New emphasis on less-studied forms of anaphora: Abstract and Bridging

  • New Genre: Conversation

  • New computational techniques: transfer of learned representations across genres

  • New opportunities for interaction between communities: Discourse and Dialogue

  • New data set

This shared task is jointly run through CRAC 2021 and CODI 2021 at EMNLP 2021.

The first release of the data coming up on March 26, 2021!

Birds of a Feather

The Future of Anaphora: Birds of a Feather Session for CODI-CRAC Shared Task at NAACL

We will host a Birds of a Feather session at NAACL at 2pm EST on June 8 (zoom details: Participate -> Get data)! At this birds of a feather session, we will provide tips for using provided baselines and scorer code as well as answer any questions participants might have. All are welcome to participate in the dawning of a new era for anaphora research!!

Background

Coreference and anaphora resolution is a long-studied problem in computational linguistics and NLP. Although multiple benchmark datasets have been developed in recent years, the progress in this area has been hindered because most of these corpora do not emphasize potentially difficult cases. For example, datasets like OntoNotes [1], GAP [2], and LitBank [3] only focus on identity coreference and neglect relations like discourse deixis [9] or bridging anaphora [10], both of which introduce interesting research challenges.

Several works have shown the importance of syntax for anaphora resolution [4,5,6,7]. However, these features might not generalize well to conversations where the language is often grammatically incorrect and suffers from disfluencies. Apart from this, anaphora resolution in dialogue requires systems to perform speaker grounding of pronouns and focus on long-distance conversation structure, complexities that are often missing in news or Wikipedia articles which form a large chunk of current coreference resolution state-of-the-art datasets. 

This shared-task goes above and beyond the simple cases of coreference resolution that arguably overestimate the performance of current SOTA models on the task. The goal of this shared task is to bring together researchers from disciplines like discourse analysis, dialogue systems, machine learning, and linguistics for the purpose of paving the way to advances in the area of coreference and anaphora resolution.

 

Task

In this shared task, you will contribute approaches/models for addressing three types of anaphoric relations. The shared task is therefore structured into three sub-tasks. You have the option to participate in one or more of these sub-tasks. The three sub-tasks include:

  1. Resolution of anaphoric identity

  2. Resolution of bridging references

  3. Resolution of discourse deixis/abstract anaphora

 

Data

The data for the shared task includes conversations from five different domains:

  • ARRAU (Trains_91): Dev set available now!

  • Switchboard: Dev set available now!

  • AMI: Dev set available now!

  • Persuasion: Dev set available now!

  • Light: Dev set available now!

Please register for this shared-task to get access to the data! (Dataset details available in "Participate" -> "Get Data")

Since the main aim of this shared-task is to create generalizable models, we will only release the dev/test sets from each of the five domains mentioned above.  However, participants are free to use ARRAU_PEAR, ARRAU_RST, ARRAU_Trains93, ARRAU_GNOME, and other external data to train their models.

Annotation Format

The datasets for shared-task would be released in the Universal Anaphora format. We encourage the participants to refer to the most up-to-date documentation of the annotation format here.

Sponsors

The annotation of the data was co-sponsored by the Heidelberg Institute for Theoretical Studies gGmbH (HITS) and DALI

Submission Format

Predictions for each dataset (during each phase) should be put in separate directories with the same name as the dataset. These directories should then be put in the main directory that needs to be zipped (recursively) and submitted to Codalab. 

Expected Directory Structure:

./Solution

  |__ ARRAU

      |__ prediction_file

  |__ Switchboard

      |__ prediction_file

  |__ AMI

      |__ prediction_file

  |__ Persuasion

      |__ prediction_file

  |__ Light 

      |__ prediction_file

 

Participants are free to submit predictions for one or more datasets. However, the leaderboard will be updated using the latest submission and will not carry over scores from previous submissions.

E.g. If the participant submits their prediction for ARRAU in their first submission, i.e. ./solution/ARRAU/ARR_preds,  with other dataset folders empty, the leaderboard will incorporate the prediction score for ARRAU and 0.0 for other datasets. If they now want to submit results to the leaderboard for Switchboard, the correct way would be to submit both ./solution/ARRAU/ARRAU_preds (their best ARRAU predictions) and ./solution/Switchboard/Swbd_preds (new switchboard predictions) to ensure that the leaderboard displays non-zero scores for both datasets. 

 

Evaluation

We will evaluate the submissions for anaphoric identity and discourse deixis using CoNLL Avg. F1 score [1]. For bridging, we will report Entity F1 scores. 

The shared-task scorer is derived from universal-anaphora-scorer, and we encourage the participants to use this repository to evaluate their models locally. 

 

Timeline

  • Mar 26, 2021 - Training and development data released.

  • May 28 - Baselines and helper scripts released.

  • June 8 - Birds of a Feather session at NAACL 2021.

  • **June 21 - Test data for Eval - AR, Eval - Br (Pred), and Eval - DD (Pred) released.

  • July 10 - Submission deadline Eval - AR, Eval - Br (Pred), and Eval - DD (Pred).

  • July 11 - Test data for Eval - Br (Gold) and Eval - DD (Gold) released.

  • July 21 - Submission deadline Eval - Br (Gold) and Eval - DD (Gold).

  • Aug 3 - System descriptions due (Stage 1).

  • Aug 4 - Error-analysis and cross-team discussions start (Stage 2).

  • Sep 6 - Analysis reports due (Stage 2).

  • Sep 20 - Accept/reject notifications.

  • Oct 1 - Camera-ready version due.

  • Nov 7-11 - EMNLP 2021.

References


[1] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL - Shared Task.
[2] Webster, K., Recasens, M., Axelrod, V., & Baldridge, J. (2018). Mind the gap: A balanced corpus of gendered ambiguous pronouns. Transactions of the Association for Computational Linguistics.
[3] Bamman, D., Lewke, O. and Mansoor, A..An annotated dataset of coreference in English literature. arXiv preprint arXiv:1912.01140. 2019.
[4] J. Hobbs. Resolving pronoun references. Readings in natural language processing, 1986.
[5] Durrett, G. and Klein, D.. Easy victories and uphill battles in coreference resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013.
[6] A. Zeldes and S. Zhang. When annotation schemes change rules help: aconfigurable approach to coreference resolution beyond ontonotes, CORBON 2016.
[7] S. Lappin and H.J. Leass. An algorithm for pronominal anaphora resolution, Comput. Linguist. 1994.
[8] Moosavi, N.S., Born, L., Poesio, M. and Strube, M. Using automatically extracted minimum spans to disentangle coreference evaluation from boundary detection. arXiv preprint arXiv:1906.06703. 2019.
[9] Webber, Bonnie. Discourse Deixis: reference to discourse segments. In 26th Annual Meeting of the Association for Computational Linguistics. 1988.
[10] Massimo Poesio, Rahul Mehta, Axel Maroudas, and Janet Hitzeman. 2004. Learning to resolve bridging references. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004.
[11] Poesio, Massimo, Yulia Grishina, Varada Kolhatkar, Nafise Sadat Moosavi, Ina Roesiger, Adam Roussel, Fabian Simonjetz et al. "Anaphora resolution with the ARRAU corpus." In Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference. 2018.

Submission Format

Predictions for each dataset (during each phase) should be put in separate directories with the same name as the dataset. These directories should then be put in the main directory that needs to be zipped (recursively) and submitted to codalab. 

Expected Directory Structure

./Solution

  |__ ARRAU

      |__ prediction_file

  |__ Switchboard

      |__ prediction_file

  |__ AMI

      |__ prediction_file

  |__ Persuasion

      |__ prediction_file

  |__ Light 

      |__ prediction_file

Participants are free to submit predictions for one or more datasets. However, the leaderboard will be updated using the latest submission and will not carry over scores from previous submissions.

E.g. If the participant submits their prediction for ARRAU in their first submission, i.e. ./solution/ARRAU/ARR_preds,  with other dataset folders empty, the leaderboard will incorporate the prediction score for ARRAU and 0.0 for other datasets. If they now want to submit results to the leaderboard for Switchboard, the correct way would be to submit both ./solution/ARRAU/ARRAU_preds (their best ARRAU predictions) and ./solution/Switchboard/Swbd_preds (new switchboard predictions) to ensure that the leaderboard displays non-zero scores for both datasets. 

Evaluation Criteria

We will evaluate the submissions for anaphoric identity and discourse deixis using CoNLL Avg. F1 score [1]. For bridging, we will report Entity F1 scores. 

The shared-task scorer is derived from universal-anaphora-scorer, and we encourage the participants to use this repository to evaluate their models locally.

Terms and Conditions

Participants should not share this data outside the shared-task!

Dev - AR

Start: March 26, 2021, midnight

Description: Anaphora Resolution - Train your model on official training set for Anaphora Resolution. Feel free to use external datasets or knowledge sources. Submit results on the validation data.

Dev - Br

Start: March 26, 2021, midnight

Description: Bridging - Train your model on official training set for Anaphora Resolution. Feel free to use external datasets or knowledge sources. Submit results on the validation data.

Dev - DD

Start: March 26, 2021, midnight

Description: Discourse Deixis - Train your model on official training set for Anaphora Resolution. Feel free to use external datasets or knowledge sources. Submit results on the validation data.

Eval - AR

Start: June 21, 2021, midnight

Description: Anaphora Resolution - Evaluate your model on the official test set. You can use validation set to aid with training.

Eval - Br (Gold)

Start: July 11, 2021, midnight

Description: Bridging - Evaluate your model on the official test set using gold mentions. You can use validation set to aid with training.

Eval - Br (Pred)

Start: June 21, 2021, midnight

Description: Bridging - Evaluate your model on the official test set using system mentions. You can use validation set to aid with training.

Eval - DD (Gold)

Start: July 11, 2021, midnight

Description: Discourse Deixis - Evaluate your model on the official test set using gold mentions. You can use validation set to aid with training.

Eval - DD (Pred)

Start: June 21, 2021, midnight

Description: Discourse Deixis - Evaluate your model on the official test set using system mentions. You can use validation set to aid with training.

Competition Ends

July 21, 2021, midnight

You must be logged in to participate in competitions.

Sign In