Multi-Hop Inference Explanation Regeneration (TextGraphs-13)

Organized by dustalov - Current server time: Jan. 21, 2021, 2:21 p.m. UTC


April 18, 2019, midnight UTC


July 12, 2019, midnight UTC


Competition Ends

TextGraphs-13 Shared Task on Multi-Hop Inference Explanation Regeneration

Multi-hop inference is the task of combining more than one piece of information to solve an inference task, such as question answering. This can take many forms, from combining free-text sentences read from books or the web, to combining linked facts from a structured knowledge base. The Shared Task on Explanation Regeneration asks participants to develop methods to reconstruct gold explanations for elementary science questions, using a new corpus of gold explanations that provides supervision and instrumentation for this multi-hop inference task. Each explanation is represented as an “explanation graph”, a set of atomic facts (between 1 and 16 per explanation, drawn from a knowledge base of 5,000 facts) that, together, form a detailed explanation for the reasoning required to answer and explain the resoning behind a question. Linking these facts to achieve strong performance at rebuilding the gold explanation graphs will require methods to perform multi-hop inference. The explanations include both core scientific facts as well as detailed world knowledge, allowing this task to appeal to those interested in both multi-hop reasoning and common-sense inference.

Important Dates

  • 20-05-2019: Training data release
  • 12-07-2019: Test data release; Evaluation start
  • 09-08-2019: Evaluation end
  • 19-08-2019: System description paper deadline
  • 11-09-2019: Deadline for reviews of system description papers
  • 19-09-2019: Author notifications
  • 30-09-2019: Camera-ready description paper deadline
  • 03-11-2019/04-11-2019: TextGraphs-13 workshop

Getting Started

The shared task data distribution includes a baseline that uses a term frequency model (tf-idf) to rank how likely table row sentences are to be a part of a given explanation. The performance of this baseline on the development partition is 0.054 MAP. The complete code and task description is available at

First, you need to download the training and development datasets (the test dataset is held out):

$ make dataset

Then, you can run the baseline program in Python to make predictions:

$ ./ annotation/expl-tablestore-export-2017-08-25-230344/tables questions/ARC-Elementary+EXPL-Dev.tsv > predict.txt

The format of the predict.txt file is questionID<TAB>explanationID without header; the order is important. We also offer the same evaluation code as used on CodaLab:

$ ./ --gold=questions/ARC-Elementary+EXPL-Dev.tsv predict.txt

In order to prepare a submission file for CodaLab, create a ZIP file containing your predict.txt, cf. make


  author    = {Jansen, Peter and Ustalov, Dmitry},
  title     = {{TextGraphs~2019 Shared Task on Multi-Hop Inference for Explanation Regeneration}},
  booktitle = {Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)},
  year      = {2019},
  pages     = {63--77},
  url       = {},
  isbn      = {978-1-950737-86-4},
  address   = {Hong Kong},
  publisher = {Association for Computational Linguistics},
  language  = {english},


We share the code on GitHub at and the data at

We welcome questions and answers on the shared task on GitHub:


Participating systems will be evaluated using mean average precision (MAP) on the explanation reconstruction task. The example code provided calculates this, both overall, as well as broken down into specific sub-measures (e.g. the role of sentences in an explanation, and whether a sentence has lexical overlap with the question or answer).

Participants are also encouraged, but not required, to report the following measures with their systems:

  1. A histogram of explanation reconstruction performance (MAP) versus the length of the gold explanation being reconstructed
  2. If also using the data to perform the QA task, reporting overall QA accuracy as well as explanation reconstruction accuracy for correctly answered questions
  3. Though the Worldtree corpus was constructed to automate explanation evaluation, it is still possible some facts may be highly relevant but not included in an explanation. An error analysis of the final system is strongly encouraged to determine the proportion of errors that are genuine errors of various categories, and the proportion of errors that are “also good” explanation sentences.

The shared task data distribution includes a baseline that uses a term frequency model (tf-idf) to rank how likely table row sentences are to be a part of a given explanation.

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the TextGraph-13 workshop and in the associated proceedings, at the task organizers’ discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers’ judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition’s rules. Inclusion of a submission’s scores is not an endorsement of a team or individual’s submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to use or redistribute the shared task data except in the manner prescribed by its licence.


Start: April 18, 2019, midnight


Start: July 12, 2019, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 ameyag416 0.5625
2 aisys 0.5501
3 redken 0.4945