When we apply our existing knowledge to new situations, we demonstrate a kind of understanding of how the knowledge (through tasks) is applied. When viewed over a conceptual domain, this constitutes a competence. Competence-based evaluations can be seen as a new approach for designing NLP challenges, in order to better characterize the underlying operational knowledge that a system has for a conceptual domain, rather than focusing on individual tasks. In this shared task, we present a challenge that is reflective of linguistic and cognitive competencies that humans have when speaking and reasoning.
Given the intuition that textual and visual information mutually inform each other for semantic reasoning, we formulate the challenge as a competence-based question answering (QA) task, designed to involve rich semantic annotation and aligned text-video objects.
The Competence-based Multimodal Question Answering task is structured as question answering pairs, querying how well a system understands the semantics of recipes derived from a collection of cooking recipes and videos. Each question belongs to a "question family" reflecting a specific reasoning competence. The associated R2VQ dataset is designed for testing competence-based comprehension of machines over a multimodal recipe collection.
For the purposes of our work, we will build the R2VQ (Recipe Reading and Video Question Answering) dataset, a dataset consisting of a collection of recipes sourced from https://recipes.fandom.com/wiki/Recipes_Wiki and foodista.com, and labeled according to three distinct annotation layers: (i) Cooking Role Labeling (CRL), (ii) Semantic Role Labeling (SRL), and (iii) aligned image frames taken from creative commons cooking videos downloaded from YouTube. It consists of 1,000 recipes, with 800 to be used as training, and 100 recipes each for validation and testing. Participating systems will be exposed to the aforementioned multimodal training set, and will be asked to provide answers to unseen queries exploiting (i) visual and textual information jointly, or (ii) textual information only.
Each recipe is annotated at the span level for cooking-related Events and the associated Entities (ingredients and props such as tools, containers, and habitats). The ingredients can be either labeled as explicit (those listed in the ingredients section of the recipe) or implicit (the intermediate outputs of applying a cooking action to a set of explicit ingredients), along with coreference grounding for implicit ingredients (e.g. the implicit ingredient marinade is associated with the cooking event combine(vinegar,soy_sauce,oil)).
CRL is a domain-specific dependency relation annotation for the cooking domain. Each entity is assigned with an entity ID in addition to the token ID. The entity IDs are in the COREF column of the first token of entities.
The span-level Events and Entities are further annotated for role relations. Events that implicate Entities not explicitly mentioned in the text are marked to reflect that these hidden entities (see below section) are necessary to complete the action. These relations are:
The participant relation is meant to identify the constituents of cooking events, within the same sentence. In the example below, the cardamom_green and ghee are explicit participants of the Fry event. The screenshots below show the linkage of fry to cardamom_green; the steps will need to be repeated to link fry to ghee as well.
The result relation is meant to identify relationships between the event and another entity in the same sentence (result link cannot be a hidden relation, see below for a description of hidden arguments). In the sentence, “Shape with hands into a ball” the ball is the result of the shape action, which took place on a dropped plum (from an earlier sentence).
The shadow relation expresses a link between events and semantically hidden ingredients. “Cook pasta in a large pot” necessitates water in the pot, which may have been added previously as a hidden argument (see below for a description of hidden arguments).
The tool relation links objects with the events they are used in. Tools may appear in the text (“Cut the pear with a sharp knife”), or they may be hidden (“Cut an apple” requires an unmentioned knife).
Similar to tool, the habitat relation links events with the objects in which they take place. Habitats may appear in the text (“Bake in a preheated oven”), or they may be hidden (“Saute the onion” requires an unmentioned pan).
The drop relation expresses a link between events and syntactically hidden objects, which were typically mentioned in a previous step. In the example below, the serve.6.1.1 event references the pujabi_sawian.5.1.2 from the prior sentence.
These can only appear in the same row where the token is the head of the EVENT entity. Each hidden argument writes as Keyword=value, e.g. Drop=mixture, with multiple values separated by : (e.g. Drop=mixture:olive oil) and multiple hidden attributes separated by | (e.g. Drop=mixture:olive oil|Tool=spoon)
A coreference id is represented as step_id.sent_id.token_id of the first appearance of the co-referral, e.g. Drop=mixture:olive oil.1.1.3|Tool=spoon (the head of the first appearance of olive oil is the 3rd token of step 1, sentence 1)
One of the three layers with which steps in R2VQ are annotated is the Semantic Role Labeling (SRL) layer. SRL is often described informally as the task of automatically answering the question "Who did What to Whom, Where, When, and How?" (Marquéz et al., 2008). More precisely, SRL is usually defined as the task of automatically identifying and labeling argument structures.
Let’s consider the example "John loves Mary". In this case, SRL consists of i) identifying "loves" as a predicate, that is, something that denotes an action or an event; ii) disambiguating the predicate, that is, assigning the most appropriate sense for "loves" in this context; iii) identifying the arguments of each predicate, that is, those parts of the text, "John" and "Mary" that are semantically linked to "loves"; and iv) assigning a semantic role to each predicate-argument pair, e.g., "John" is the Experiencer of the predicate "loves", whereas "Mary" is the Stimulus.
In the context of our evaluation exercise, we employ SRL in its span-based approach, hence tagging the whole span of arguments in given sentences and not just their syntactic heads (e.g., "the broccoli" and not "the"). We chose VerbAtlas (Di Fabio et al., 2019 - http://verbatlas.org/) as our reference inventory of frames and semantic roles and initially labeled the recipes from the dataset automatically, by means of a state-of-the-art system (Conia and Navigli, 2020). Subsequently, we asked human annotators to validate and correct both frames and argument labels to ensure data quality. Predicate frames: each predicate is labeled according to its VerbAtlas sense/frame in column 10 of the file. A value of ‘_’ means that the corresponding word is not a predicate. In the example below, there is only one predicate, "Cut" with the corresponding sense/frame "CUT" in position 1.
SRL example (omitted some columns for readability):
1 Cut [...] CUT B-V
2 the [...] _ B-Patient
3 broccoli [...] _ I-Patient
4 into [...] _ B-Result
5 flowerets [...] _ I-Result
6 . [...] _ _
Semantic roles: for each predicate, we provide its semantic roles in BIO format (B - Beginning, I - Inside, O - Outside). Note that, for this dataset, we only use B and I to indicate the first token of a span and the rest of the tokens in the same span, respectively. In the example above, "the broccoli" is a Patient of the predicate CUT, with the token "the" as the Beginning of the span (B-Patient) and the token "broccoli" as the Inside of the span (I-Patient). Note that the predicate that refers to a specific column of semantic roles is always labeled with the notation B-V. Should the predicate consist of a multi-word expression, the other tokens apart from the first are labeled as I-V:
1 Deep [...] COOK B-V
2 - [...] _ I-V
3 fry [...] _ I-V
4 till [...] _ B-Result
5 crispy [...] _ I-Result
6 & [...] _ I-Result
7 golden [...] _ I-Result
8 brown [...] _ I-Result
Should the multi-word expression be made of non-adjacent words, tokens apart from the first are instead labeled as D-V:
1 Bring [...] CHANGE_APPEARANCE/STATE B-V
2 the [...] _ B-Patient
3 water [...] _ I-Patient
4 to [...] _ D-V
5 boil [...] _ D-V
6 . [...] _ _
In the case of multiple predicates in the same sentence, there will be multiple semantic role columns, one for each predicate in column 10. For example, if there are two predicates in the sentence, column 11 will indicate the semantic roles for the first predicate, and column 12 will show the semantic roles for the second predicate.
1 Reduce [...] REDUCE_DIMINISH B-V _
2 heat [...] _ B-Attribute _
3 , [...] _ _ _
4 and [...] _ _ _
5 simmer [...] COOK _ B-V
6 for [...] _ _ B-Time
7 1 [...] _ _ I-Time
8 hour [...] _ _ I-Time
9 . [...] _ _ _
For each cooking action in the R2VQ dataset where a visual counterpart can be found, a 'key frame triple' corresponding to a representative 3 second segment of a Creative Commons licensed video is included. Video data was sourced from both the YouCook2 dataset, and ad hoc videos found by querying the YouTube API with a given recipe's title. We use the S3D MIL-NCE model for text-to-video retrieval, using spans of text containing a cooking-action as input. Due to licensing limitations, coverage is limited and some actions will not have corresponding key frame triples.
The following example frames are taken from recipe, r-1948 step 4 event 1. The raw text for this step is "Fry in butter on all sides".
We adopt the concept of "question families" as outlined in the CLEVR dataset (Johnson et al., 2017). While some question families naturally transfer over from the VQA domain (e.g., integer comparison, counting), other concepts such as ellipsis and object lifespan must be employed to cover the full extent of competency within procedural texts.
We start by creating text templates for each of question family we identified. Actual questions will be created through the combination of templates and random entities/relations from the annotation. Word inflection and manual evaluation is applied to ensure the grammaticality of the questions. Each template is also associated with a functional program. It contains a set of functions that allow to query and filter the annotated recipe to get the answer to that template-based question.
Sample Question-answer pairs from the training data are as follows:
# question = How many times is the tube pan used?
# answer = 2
# question = What should be added to the bowl?
# answer = the eggs, sugar and butter
Implicit Argument Identification
# question = How do you drain the pasta?
# answer = by using a strainer
Object Lifespan
# question = How did you get the broth?
# answer = by boiling the cold water, meat and onion in the large pan
Event Ordering
# question = Refrigerating the fresh fruit dip until chilled and serving the fresh fruit dip, which comes first?
# answer = the first event
Coreference Location Change
# question = Where was the sesame seed sauce before it was stored in a bottle in the fridge?
# answer = blender
Attribute
# question = How do you serve the vermicelli mixture?
# answer = serve the vermicelli mixture hot or cold
Temporal
# question tempq20 = For how long do you simmer the soup?
# answer tempq20 = for 40 to 50 minutes
Result
# question = To what extent do you stir the mixture?
# answer = until the wine is almost evaporated
Cause
# question = Why do you wear plastic gloves?
# answer = as the oil from the peppers has been known to blister skin
Co-Patient
# question = What do you toss slivered mushrooms with?
# answer = with the shallots and ginger
The data is released under the CC-BY-NC 4.0 license (see Terms and Conditions). If you use the data, please cite the paper from the Reference section.
All systems will be asked to provide answers to the open-ended questions based on the textual and visual information encoded in the dataset. All systems will be evaluated solely based on the answers to the questions.
For all questions, exact match, word-level F1 will be applied to compute the results. For unanswerable questions, word-level F1 score will be the same with each match score.
Participants are not allowed in any way to exploit the question ID information during or after training, in order to improve evaluation results.
Participants should submit a zip file containing a single json with the string r2vq_pred
in its name, and a file of this form:
{"Recipe_ID1":{"Question_ID1":
"answer1","Question_ID2":"answer2","Question_ID3":null,...},"Recipe_ID2":{...}}
where the Recipe_ID
is the # newdoc id
from the metadata in the provided CoNLL-U files. Within each recipe, the Question_ID
is the key of the question to be answered, and the value is the answer that could be a string (natural language answers), or just null
(unanswerable).
This section lists the final results on the R2VQ test set from all user submissions ordered by Exact Match score:
Plain text:
BibTex:
@inproceedings{pasini-etal-xl-wsd-2021,
title={{SemEval-2022} {T}ask 9: {R2VQ} - Competence-based Multimodal Question Answering},
author={Pusejovsky, James and Tu, Jingxuan and Maru, Marco and Conia, Simone and Navigli, Roberto and Rim, Kyeongmin and Lynch, Kelley and Brutti, Richard and Holderness, Eben},
booktitle={Proceedings of the 16th Workshop on Semantic Evaluation (SemEval-2022)},
year={2022}
}
@inproceedings{di-fabio-etal-2019-verbatlas,
title = "{V}erb{A}tlas: {A} Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling",
author = "Di Fabio, Andrea and
Conia, Simone and
Navigli, Roberto",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1058",
doi = "10.18653/v1/D19-1058",
pages = "627--637"
}
@inproceedings{conia-navigli-2020-bridging,
title = "{B}ridging the Gap in Multilingual {S}emantic {R}ole {L}abeling: {A} Language-Agnostic Approach",
author = "Conia, Simone and
Navigli, Roberto",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.120",
pages = "1396--1410"
}
The data of the R2VQ - Competence-based Multimodal Question Answering are released under the CC-BY-NC 4.0 license.
James Pustejovsky, Brandeis University, jamesp@brandeis.edu
Jingxuan Tu, Brandeis University, jxtu@brandeis.edu
Marco Maru, Sapienza University of Rome, maru@di.uniroma1.it
Simone Conia, Sapienza University of Rome, conia@di.uniroma1.it
Roberto Navigli, Sapienza University of Rome, navigli@diag.uniroma1.it
Kyeongmin Rim, Brandeis University, krim@brandeis.edu
Kelley Lynch, Brandeis University, kmlynch@brandeis.edu
Richard Brutti, Brandeis University, richardbrutti@brandeis.edu
Eben Holderness, Brandeis University, egh@brandeis.edu
Start: Aug. 6, 2021, midnight
Description: Task announced and sample data and data specification is published.
Start: Oct. 15, 2021, midnight
Description: Training data is released and participants develop systems to solve the problem.
Start: Dec. 3, 2021, midnight
Description: Evaluation data is uploaded and participants submit results.
Start: Jan. 31, 2022, midnight
Description: Paper submission phase.
Never
You must be logged in to participate in competitions.
Sign In