Multi-Hop Inference Explanation Regeneration (TextGraphs-14)

Organized by dustalov - Current server time: July 9, 2020, 7:52 a.m. UTC

Previous

Practice
March 1, 2020, midnight UTC

Current

Post-Competition
Sept. 22, 2020, midnight UTC

Next

Post-Competition
Sept. 22, 2020, midnight UTC

TextGraphs-14 Shared Task on Multi-Hop Inference Explanation Regeneration

Multi-hop inference is the task of combining more than one piece of information to solve an inference task, such as question answering. This can take many forms, from combining free-text sentences read from books or the web, to combining linked facts from a structured knowledge base. The Shared Task on Explanation Regeneration asks participants to develop methods to reconstruct gold explanations for elementary science questions, using a new corpus of gold explanations that provides supervision and instrumentation for this multi-hop inference task. Each explanation is represented as an “explanation graph”, a set of atomic facts (between 1 and 16 per explanation, drawn from a knowledge base of 5,000 facts) that, together, form a detailed explanation for the reasoning required to answer and explain the resoning behind a question. Linking these facts to achieve strong performance at rebuilding the gold explanation graphs will require methods to perform multi-hop inference. The explanations include both core scientific facts as well as detailed world knowledge, allowing this task to appeal to those interested in both multi-hop reasoning and common-sense inference.

Important Dates

  • 2020-03-06: Training data release
  • 2020-04-06: Test data release; Evaluation start
  • 2020-05-06 2020-09-21: Evaluation end
  • 2020-05-20 2020-10-02: System description paper deadline
  • 2020-06-10 2020-10-18: Deadline for reviews of system description papers
  • 2020-06-24 2020-10-25: Author notifications
  • 2020-07-11 2020-11-01: Camera-ready description paper deadline
  • 2020-09-14 2020-12-13: TextGraphs-14 workshop

Dates are specified in the ISO 8601 format.

Getting Started

The shared task data distribution includes a baseline that uses a term frequency model (tf.idf) to rank how likely table row sentences are to be a part of a given explanation. The performance of this baseline on the development partition is 0.255 MAP. The complete code and task description is available at https://github.com/cognitiveailab/tg2020task.

In order to prepare a submission file for CodaLab, create a ZIP file containing your predict.txt for the test dataset, cf. make predict-tfidf-test.zip.

Citation

@inproceedings{Jansen:19,
  author    = {Jansen, Peter and Ustalov, Dmitry},
  title     = {{TextGraphs~2019 Shared Task on Multi-Hop Inference for Explanation Regeneration}},
  booktitle = {Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)},
  year      = {2019},
  pages     = {63--77},
  doi       = {10.18653/v1/D19-5309},
  isbn      = {978-1-950737-86-4},
  address   = {Hong Kong},
  publisher = {Association for Computational Linguistics},
  language  = {english},
}

Replicability

To encourage transparency and replicability, all teams must publish their code, tuning procedures, and instructions for running their models with their submission of shared task papers.

Contacts

We share the code on GitHub at https://github.com/cognitiveailab/tg2020task and the data at http://cognitiveai.org/explanationbank/.

We welcome questions and answers on the shared task on CodaLab Forums.

Evaluation

Participating systems will be evaluated using mean average precision (MAP) on the explanation reconstruction task. The example code provided calculates this, both overall, as well as broken down into specific sub-measures (e.g. the role of sentences in an explanation, and whether a sentence has lexical overlap with the question or answer).

Participants are also encouraged, but not required, to report the following measures with their systems:

  1. A histogram of explanation reconstruction performance (MAP) versus the length of the gold explanation being reconstructed
  2. If also using the data to perform the QA task, reporting overall QA accuracy as well as explanation reconstruction accuracy for correctly answered questions
  3. Though the Worldtree corpus was constructed to automate explanation evaluation, it is still possible some facts may be highly relevant but not included in an explanation. An error analysis of the final system is strongly encouraged to determine the proportion of errors that are genuine errors of various categories, and the proportion of errors that are “also good” explanation sentences.

The shared task data distribution includes a baseline that uses a term frequency model (tf-idf) to rank how likely table row sentences are to be a part of a given explanation.

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the TextGraph-14 workshop and in the associated proceedings, at the task organizers’ discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers’ judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition’s rules. Inclusion of a submission’s scores is not an endorsement of a team or individual’s submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to use or redistribute the shared task data except in the manner prescribed by its licence.

To encourage transparency and replicability, all teams must publish their code, tuning procedures, and instructions for running their models with their submission of shared task papers.

Practice

Start: March 1, 2020, midnight

Evaluation

Start: April 6, 2020, midnight

Post-Competition

Start: Sept. 22, 2020, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 alvysinger 0.5815
2 webbley 0.5809
3 aisys 0.5074