Multi-Hop Inference Explanation Regeneration (TextGraphs-15) Forum

Go back to competition Back to thread list Post in this thread

> Evaluation Criteria : Just checking...

Looking at the codebase for the Textgraphs-15 Shared Task, it looks like this year's evaluation criterion is intended to reward competitors who can correctly identify the 'hard statements' that connect the questions to the answers. This seems to match up with intention expressed during Textgraphs-14 to encourage competitors to find the 'hard statements' rather than just explore those that are textually similar to the Q&A text.

However, it is a little surprising that there now seems little incentive to determine the 'core explanation' logic, since reward is given for statements that the experts liked, independent of whether they were in the gold explanation given in the explanations training set. This would seem to convert the Shared Task into a big classification (or score regression) problem, where competitors are most likely to focus on guessing what the expert raters thought of each statement individually, rather than as a sequence of connected statements that form a cohesive explanation.

This Colab quickly shows that doing the ranking correctly is a bigger win than achieving a good (in this case the 'gold') explanations:

https://colab.research.google.com/drive/1uexs4-ir0E9dbAsGPbCUJDAhmx0nwRx-?usp=sharing

Can we confirm that the 'evaluation.py' script is as intended (where it does not refer to the gold explanation statement sequence)? It would seem that adding something like `if int(data['isGoldWT21'])>0` just before https://github.com/cognitiveailab/tg2021task/blob/main/evaluate.py#L26 would make it so that the criteria was more in-line with requiring competitors to come up with a good logical explanation while still capturing the 'hard/important step' aspect...

It would be great to have confirmation one way or the other, since the strategies we would focus on for this would be rather different.

Posted by: mdda @ March 15, 2021, 5:41 p.m.

Apologies to take a few days to respond -- this is a very good and interesting comment, and one that we've been wrestling with.

One of the challenges with many-hop multi-hop inference datasets (like WorldTree, where the inference problems can contain as many as 16 or more facts provide detailed explanations for) is relevance versus completeness. From an end-user perspective, we want a system that can generate an explanation that (a) contains all the core/critical facts required to make the inference, and (b) doesn't miss any knowledge, or have critical inference gaps. That's essentially what the original WorldTree V1/V2 annotation does -- the explanation authors create a single detailed gold explanation that ideally has full completeness. From the perspective of us modelers who try and use this data to create end-user systems, this creates a really difficult evaluation problem -- the explanations in WorldTree are huge relative to other multi-hop datasets -- what if our model finds a different explanation that's also still good, but all the facts weren't marked as gold by the original WorldTree explanation authors because they created the explanation in a slightly or significantly different way? (or, at a different level of detail?). While we've been wondering how big an issue this is in multi-hop inference, one of the wonderful things we've been able to do from the past two shared tasks is actually measure this -- and when I've gone back and manually annotated relevance judgments for the top-K retrieved facts from participant's models, it's shown they can be finding highly relevant facts that aren't marked as gold between 10-20% of the time. So, for very large multi-hop problems, there is essentially an unsolved evaluation problem in being able to quickly measure relevance and completeness when doing automatic (non-manual) evaluations.

This relevance vs completeness measurement problem in many-hop multi-hop inference is unsolved, and right now you can essentially (like Heisenberg) get one but not both. WorldTree V2.1 gives a single completeness example per question, but does not give exhaustive relevance. The dataset for this shared task gives extremely high quality expert-created relevance judgments for likely candidate facts (as determined by a language model), with graded relevancy ratings (to help address the issue of measuring fact centrality/level-of-detail issues), but does not address completeness like WorldTree V2.1. I haven't figured out how to do both yet -- exhaustive relevance and exhaustive completeness in a single dataset -- though I have some ideas to explore how to make this more tractable. So while we did two initial shared tasks on completeness, for this shared task we explore relevance, with the implicit notion that if you did perfect (or even well) at this exhaustive relevancy task, you'll likely have all the facts you need to make something complete. And you could sort-of measure that right now using a union of the WorldTree V2.1 and this expert-rating dataset (e.g. take the shortlist a model returns that it determines has non-zero relevancy scores -- does that list contain all the WT2.1 facts (completeness) and all the expert rating facts (relevancy)? ) -- but hopefully we'll develop more explicitly structured ways of doing this in the next few years.

Posted by: pajansen @ March 17, 2021, 7:28 p.m.
Post in this thread