Multi-hop inference is the task of combining more than one piece of information to solve an inference task, such as question answering. This can take many forms, from combining free-text sentences read from books or the web, to combining linked facts from a structured knowledge base. The Shared Task on Explanation Regeneration asks participants to develop methods to reconstruct gold explanations for elementary science questions, using a new corpus of gold explanations that provides supervision and instrumentation for this multi-hop inference task. Each explanation is represented as an “explanation graph”, a set of atomic facts (between 1 and 16 per explanation, drawn from a knowledge base of 5,000 facts) that, together, form a detailed explanation for the reasoning required to answer and explain the resoning behind a question. Linking these facts to achieve strong performance at rebuilding the gold explanation graphs will require methods to perform multi-hop inference. The explanations include both core scientific facts as well as detailed world knowledge, allowing this task to appeal to those interested in both multi-hop reasoning and common-sense inference.
The shared task data distribution includes a baseline that uses a term frequency model (tf-idf) to rank how likely table row sentences are to be a part of a given explanation. The performance of this baseline on the development partition is 0.054 MAP. The complete code and task description is available at https://github.com/umanlp/tg2019task.
First, you need to download the training and development datasets (the test dataset is held out):
$ make dataset
Then, you can run the baseline program in Python to make predictions:
$ ./baseline_tfidf.py annotation/expl-tablestore-export-2017-08-25-230344/tables questions/ARC-Elementary+EXPL-Dev.tsv > predict.txt
The format of the predict.txt
file is questionID<TAB>explanationID
without header; the order is important. We also offer the same evaluation code as used on CodaLab:
$ ./evaluate.py --gold=questions/ARC-Elementary+EXPL-Dev.tsv predict.txt
In order to prepare a submission file for CodaLab, create a ZIP file containing your predict.txt
, cf. make predict-tfidf.zip
.
@inproceedings{Jansen:19,
author = {Jansen, Peter and Ustalov, Dmitry},
title = {{TextGraphs~2019 Shared Task on Multi-Hop Inference for Explanation Regeneration}},
booktitle = {Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)},
year = {2019},
pages = {63--77},
url = {https://www.aclweb.org/anthology/D19-5309},
isbn = {978-1-950737-86-4},
address = {Hong Kong},
publisher = {Association for Computational Linguistics},
language = {english},
}
We share the code on GitHub at https://github.com/umanlp/tg2019task and the data at http://cognitiveai.org/explanationbank/.
We welcome questions and answers on the shared task on GitHub: https://github.com/umanlp/tg2019task/issues.
Participating systems will be evaluated using mean average precision (MAP) on the explanation reconstruction task. The example code provided calculates this, both overall, as well as broken down into specific sub-measures (e.g. the role of sentences in an explanation, and whether a sentence has lexical overlap with the question or answer).
Participants are also encouraged, but not required, to report the following measures with their systems:
The shared task data distribution includes a baseline that uses a term frequency model (tf-idf) to rank how likely table row sentences are to be a part of a given explanation.
By submitting results to this competition, you consent to the public release of your scores at the TextGraph-13 workshop and in the associated proceedings, at the task organizers’ discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers’ judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition’s rules. Inclusion of a submission’s scores is not an endorsement of a team or individual’s submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
You agree not to use or redistribute the shared task data except in the manner prescribed by its licence.
Start: April 18, 2019, midnight
Start: July 12, 2019, midnight
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | ameyag416 | 0.5625 |
2 | aisys | 0.5622 |
3 | redken | 0.4945 |