CodaLab - Competition

SemEval-2022 Task 09: R2VQ - Competence-based Multimodal Question Answering

Organized by r2vq - Current server time: April 2, 2025, 7:14 p.m. UTC

Evaluation phase

Dec. 3, 2021, midnight UTC

Current

Post-evaluation phase

Jan. 31, 2022, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
References
Terms and Conditions
Task Organizers

Change Log

02/16/2022: Uploaded test set with gold answers as public data under post-evaluation phase.
01/24/2022: Evaluation script is now available: link
12/21/2021: Updated the scoring program for submission evaluation.
11/29/2021: Updated the description for evaluation metrics and format for submission.
10/28/2021: Added readme.txt in the data package; added question ID for each QA pair.
10/22/2021: Clean up unnecessary SRL predicate tags for the task; removed irrelevant files/folders in the data zip file.
10/20/2021: Updated the data specification and question family examples on the codalab page; removed unnecessary SRL roles for the task; corrected few UPOS tags.
11/15/2021: Updated data with additional images for training set. Added sample images to codalab page.
12/03/2021: Added validation and test sets.
12/06/2021: Updating data download for the evaluation phase to include both validation and test sets.

Motivation

When we apply our existing knowledge to new situations, we demonstrate a kind of understanding of how the knowledge (through tasks) is applied. When viewed over a conceptual domain, this constitutes a competence. Competence-based evaluations can be seen as a new approach for designing NLP challenges, in order to better characterize the underlying operational knowledge that a system has for a conceptual domain, rather than focusing on individual tasks. In this shared task, we present a challenge that is reflective of linguistic and cognitive competencies that humans have when speaking and reasoning.

The task

Given the intuition that textual and visual information mutually inform each other for semantic reasoning, we formulate the challenge as a competence-based question answering (QA) task, designed to involve rich semantic annotation and aligned text-video objects.

The Competence-based Multimodal Question Answering task is structured as question answering pairs, querying how well a system understands the semantics of recipes derived from a collection of cooking recipes and videos. Each question belongs to a "question family" reﬂecting a speciﬁc reasoning competence. The associated R2VQ dataset is designed for testing competence-based comprehension of machines over a multimodal recipe collection.

Data components

For the purposes of our work, we will build the R2VQ (Recipe Reading and Video Question Answering) dataset, a dataset consisting of a collection of recipes sourced from https://recipes.fandom.com/wiki/Recipes_Wiki and foodista.com, and labeled according to three distinct annotation layers: (i) Cooking Role Labeling (CRL), (ii) Semantic Role Labeling (SRL), and (iii) aligned image frames taken from creative commons cooking videos downloaded from YouTube. It consists of 1,000 recipes, with 800 to be used as training, and 100 recipes each for validation and testing. Participating systems will be exposed to the aforementioned multimodal training set, and will be asked to provide answers to unseen queries exploiting (i) visual and textual information jointly, or (ii) textual information only.

Cooking Role Labeling (CRL)

Each recipe is annotated at the span level for cooking-related Events and the associated Entities (ingredients and props such as tools, containers, and habitats). The ingredients can be either labeled as explicit (those listed in the ingredients section of the recipe) or implicit (the intermediate outputs of applying a cooking action to a set of explicit ingredients), along with coreference grounding for implicit ingredients (e.g. the implicit ingredient marinade is associated with the cooking event combine(vinegar,soy_sauce,oil)).

CRL is a domain-specific dependency relation annotation for the cooking domain. Each entity is assigned with an entity ID in addition to the token ID. The entity IDs are in the COREF column of the first token of entities.

Cooking entities

EXPLICITINGREDIENT - listed in the ingredients section of the recipe
IMPLICITINGREDIENT - intermediate outputs of applying a cooking action to a set of explicit ingredients
TOOL - tools that are implicated in the creation of the dish (e.g. spoon, knife, etc.)
HABITAT - habitats that are implicated in the creation of the dish (e.g. oven, bowl, etc.)

Cooking role values in the dependency

The span-level Events and Entities are further annotated for role relations. Events that implicate Entities not explicitly mentioned in the text are marked to reflect that these hidden entities (see below section) are necessary to complete the action. These relations are:

The participant relation is meant to identify the constituents of cooking events, within the same sentence. In the example below, the cardamom_green and ghee are explicit participants of the Fry event. The screenshots below show the linkage of fry to cardamom_green; the steps will need to be repeated to link fry to ghee as well.
The result relation is meant to identify relationships between the event and another entity in the same sentence (result link cannot be a hidden relation, see below for a description of hidden arguments). In the sentence, “Shape with hands into a ball” the ball is the result of the shape action, which took place on a dropped plum (from an earlier sentence).

The shadow relation expresses a link between events and semantically hidden ingredients. “Cook pasta in a large pot” necessitates water in the pot, which may have been added previously as a hidden argument (see below for a description of hidden arguments).
The tool relation links objects with the events they are used in. Tools may appear in the text (“Cut the pear with a sharp knife”), or they may be hidden (“Cut an apple” requires an unmentioned knife).
Similar to tool, the habitat relation links events with the objects in which they take place. Habitats may appear in the text (“Bake in a preheated oven”), or they may be hidden (“Saute the onion” requires an unmentioned pan).

The drop relation expresses a link between events and syntactically hidden objects, which were typically mentioned in a previous step. In the example below, the serve.6.1.1 event references the pujabi_sawian.5.1.2 from the prior sentence.

Hidden Roles

These can only appear in the same row where the token is the head of the EVENT entity. Each hidden argument writes as Keyword=value, e.g. Drop=mixture, with multiple values separated by : (e.g. Drop=mixture:olive oil) and multiple hidden attributes separated by | (e.g. Drop=mixture:olive oil|Tool=spoon)

DROP (Syntactically elided) - this occurs when an argument is not mentioned in a sentence but is expected by the verb’s subcategorization. For example, "Chop onions." "Simmer [DROP=onions] until browned." The DROP argument is the missing Direct Object of the verb simmer.
SHADOW (Semantically elided) - this expresses a link between events and semantically hidden ingredients. "Cook pasta in a large pot" necessitates water in the pot, which may have been added previously as a hidden argument (see below for a description of hidden arguments).
TOOL - this links objects with the events they are used in. Tools may appear in the text ("Cut the pear with a sharp knife"), or they may be hidden ("Cut an apple" requires an unmentioned knife).
HABITAT - this links events with the objects in which they take place. Habitats may appear in the text ("Bake in a preheated oven"), or they may be hidden ("Saute the onion" requires an unmentioned pan).
RESULT - this links events with the objects in which they are a result of. "mixture" is a hidden RESULT of "mix all the ingredients together in the blender".

Coreference

A coreference id is represented as step_id.sent_id.token_id of the first appearance of the co-referral, e.g. Drop=mixture:olive oil.1.1.3|Tool=spoon (the head of the first appearance of olive oil is the 3rd token of step 1, sentence 1)

Semantic Role Labeling (SRL)

One of the three layers with which steps in R2VQ are annotated is the Semantic Role Labeling (SRL) layer. SRL is often described informally as the task of automatically answering the question "Who did What to Whom, Where, When, and How?" (Marquéz et al., 2008). More precisely, SRL is usually defined as the task of automatically identifying and labeling argument structures.

Let’s consider the example "John loves Mary". In this case, SRL consists of i) identifying "loves" as a predicate, that is, something that denotes an action or an event; ii) disambiguating the predicate, that is, assigning the most appropriate sense for "loves" in this context; iii) identifying the arguments of each predicate, that is, those parts of the text, "John" and "Mary" that are semantically linked to "loves"; and iv) assigning a semantic role to each predicate-argument pair, e.g., "John" is the Experiencer of the predicate "loves", whereas "Mary" is the Stimulus.

In the context of our evaluation exercise, we employ SRL in its span-based approach, hence tagging the whole span of arguments in given sentences and not just their syntactic heads (e.g., "the broccoli" and not "the"). We chose VerbAtlas (Di Fabio et al., 2019 - http://verbatlas.org/) as our reference inventory of frames and semantic roles and initially labeled the recipes from the dataset automatically, by means of a state-of-the-art system (Conia and Navigli, 2020). Subsequently, we asked human annotators to validate and correct both frames and argument labels to ensure data quality. Predicate frames: each predicate is labeled according to its VerbAtlas sense/frame in column 10 of the file. A value of ‘_’ means that the corresponding word is not a predicate. In the example below, there is only one predicate, "Cut" with the corresponding sense/frame "CUT" in position 1.

SRL example (omitted some columns for readability):

1 Cut           [...]    CUT    B-V
2 the           [...]    _      B-Patient
3 broccoli      [...]    _      I-Patient
4 into          [...]    _      B-Result
5 flowerets     [...]    _      I-Result 
6 .             [...]    _      _

Semantic roles: for each predicate, we provide its semantic roles in BIO format (B - Beginning, I - Inside, O - Outside). Note that, for this dataset, we only use B and I to indicate the first token of a span and the rest of the tokens in the same span, respectively. In the example above, "the broccoli" is a Patient of the predicate CUT, with the token "the" as the Beginning of the span (B-Patient) and the token "broccoli" as the Inside of the span (I-Patient). Note that the predicate that refers to a specific column of semantic roles is always labeled with the notation B-V. Should the predicate consist of a multi-word expression, the other tokens apart from the first are labeled as I-V:

1 Deep   [...]    COOK    B-V
2 -      [...]    _       I-V
3 fry    [...]    _       I-V
4 till   [...]    _       B-Result
5 crispy [...]    _       I-Result
6 &      [...]    _       I-Result
7 golden [...]    _       I-Result
8 brown  [...]    _       I-Result

Should the multi-word expression be made of non-adjacent words, tokens apart from the first are instead labeled as D-V:

1 Bring   [...]    CHANGE_APPEARANCE/STATE    B-V
2 the     [...]    _                          B-Patient
3 water   [...]    _                          I-Patient
4 to      [...]    _                          D-V
5 boil    [...]    _                          D-V 
6 .       [...]    _                          _

In the case of multiple predicates in the same sentence, there will be multiple semantic role columns, one for each predicate in column 10. For example, if there are two predicates in the sentence, column 11 will indicate the semantic roles for the first predicate, and column 12 will show the semantic roles for the second predicate.

1 Reduce   [...]    REDUCE_DIMINISH     B-V          _
2 heat     [...]    _                   B-Attribute  _
3 ,        [...]    _                   _            _
4 and      [...]    _                   _            _
5 simmer   [...]    COOK                _            B-V
6 for      [...]    _                   _            B-Time
7 1        [...]    _                   _            I-Time
8 hour     [...]    _                   _            I-Time
9 .        [...]    _                   _            _

Video Component

For each cooking action in the R2VQ dataset where a visual counterpart can be found, a 'key frame triple' corresponding to a representative 3 second segment of a Creative Commons licensed video is included. Video data was sourced from both the YouCook2 dataset, and ad hoc videos found by querying the YouTube API with a given recipe's title. We use the S3D MIL-NCE model for text-to-video retrieval, using spans of text containing a cooking-action as input. Due to licensing limitations, coverage is limited and some actions will not have corresponding key frame triples.

The following example frames are taken from recipe, r-1948 step 4 event 1. The raw text for this step is "Fry in butter on all sides".

Question Families

We adopt the concept of "question families" as outlined in the CLEVR dataset (Johnson et al., 2017). While some question families naturally transfer over from the VQA domain (e.g., integer comparison, counting), other concepts such as ellipsis and object lifespan must be employed to cover the full extent of competency within procedural texts.

We start by creating text templates for each of question family we identified. Actual questions will be created through the combination of templates and random entities/relations from the annotation. Word inflection and manual evaluation is applied to ensure the grammaticality of the questions. Each template is also associated with a functional program. It contains a set of functions that allow to query and filter the annotated recipe to get the answer to that template-based question.

Sample Question-answer pairs from the training data are as follows:

Cardinality

# question = How many times is the tube pan used?
# answer = 2

Ellipsis

# question = What should be added to the bowl?
# answer = the eggs, sugar and butter

Implicit Argument Identification

# question = How do you drain the pasta?
# answer = by using a strainer

Object Lifespan

# question = How did you get the broth?
# answer = by boiling the cold water, meat and onion in the large pan

Event Ordering

# question = Refrigerating the fresh fruit dip until chilled and serving the fresh fruit dip, which comes first?
# answer = the first event

Coreference Location Change

# question = Where was the sesame seed sauce before it was stored in a bottle in the fridge?
# answer = blender

Attribute

# question = How do you serve the vermicelli mixture?
# answer = serve the vermicelli mixture hot or cold

Temporal

# question tempq20 = For how long do you simmer the soup?
# answer tempq20 = for 40 to 50 minutes

Result

# question = To what extent do you stir the mixture?
# answer = until the wine is almost evaporated

Cause

# question = Why do you wear plastic gloves?
# answer = as the oil from the peppers has been known to blister skin

Co-Patient

# question = What do you toss slivered mushrooms with?
# answer = with the shallots and ginger

License

The data is released under the CC-BY-NC 4.0 license (see Terms and Conditions). If you use the data, please cite the paper from the Reference section.

Evaluation

All systems will be asked to provide answers to the open-ended questions based on the textual and visual information encoded in the dataset. All systems will be evaluated solely based on the answers to the questions.

For all questions, exact match, word-level F1 will be applied to compute the results. For unanswerable questions, word-level F1 score will be the same with each match score.

Important:

Participants are not allowed in any way to exploit the question ID information during or after training, in order to improve evaluation results.

Submission

Participants should submit a zip file containing a single json with the string r2vq_pred in its name, and a file of this form:

{"Recipe_ID1":{"Question_ID1":"answer1","Question_ID2":"answer2","Question_ID3":null,...},"Recipe_ID2":{...}}

where the Recipe_ID is the # newdoc id from the metadata in the provided CoNLL-U files. Within each recipe, the Question_ID is the key of the question to be answered, and the value is the answer that could be a string (natural language answers), or just null (unanswerable).

Final Results

This section lists the final results on the R2VQ test set from all user submissions ordered by Exact Match score:

t.dryjanski / Samsung Research Poland (SRPAL)

EM/F1: 92.53/94.34

weihezhai / ITNLP&QMUL

EM/F1: 91.34/94.23

ruan

EM/F1: 78.21/82.62

kartikaggarwal98

EM/F1: 69.49/77.37

r2vq (baseline from organizers)

EM/F1: 65.34/75.22

EM/F1: 10.23/10.23

Zehao_Liu

EM/F1: 5.90/15.78

References

Plain text:

James Pustejovsky, Jingxuan Tu, Marco Maru, Simone Conia, Roberto Navigli, Kyeongmin Rim, Kelley Lynch, Richard Brutti, Eben Holderness. SemEval-2022 Task 9: R2VQ - Competence-based Multimodal Question Answering. In Proceedings of the 16th Workshop on Semantic Evaluation (SemEval-2022).
Andrea Di Fabio, Simone Conia, Roberto Navigli. VerbAtlas: a Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling. In Proceedings of EMNLP (2019).
Simone Conia, Roberto Navigli. Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach. In Proceedings of COLING (2020).

BibTex:

@inproceedings{pasini-etal-xl-wsd-2021,
    title={{SemEval-2022} {T}ask 9: {R2VQ} - Competence-based Multimodal Question Answering},
    author={Pusejovsky, James and Tu, Jingxuan and Maru, Marco and Conia, Simone and Navigli, Roberto and Rim, Kyeongmin and Lynch, Kelley and Brutti, Richard and Holderness, Eben},
    booktitle={Proceedings of the 16th Workshop on Semantic Evaluation (SemEval-2022)},
    year={2022}
}

@inproceedings{di-fabio-etal-2019-verbatlas,
    title = "{V}erb{A}tlas: {A} Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling",
    author = "Di Fabio, Andrea  and
      Conia, Simone  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1058",
    doi = "10.18653/v1/D19-1058",
    pages = "627--637"
}

@inproceedings{conia-navigli-2020-bridging,
    title = "{B}ridging the Gap in Multilingual {S}emantic {R}ole {L}abeling: {A} Language-Agnostic Approach",
    author = "Conia, Simone  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.120",
    pages = "1396--1410"
}

Terms and Conditions

The data of the R2VQ - Competence-based Multimodal Question Answering are released under the CC-BY-NC 4.0 license.

Task Organizers

James Pustejovsky, Brandeis University, jamesp@brandeis.edu
Jingxuan Tu, Brandeis University, jxtu@brandeis.edu
Marco Maru, Sapienza University of Rome, maru@di.uniroma1.it
Simone Conia, Sapienza University of Rome, conia@di.uniroma1.it
Roberto Navigli, Sapienza University of Rome, navigli@diag.uniroma1.it
Kyeongmin Rim, Brandeis University, krim@brandeis.edu
Kelley Lynch, Brandeis University, kmlynch@brandeis.edu
Richard Brutti, Brandeis University, richardbrutti@brandeis.edu
Eben Holderness, Brandeis University, egh@brandeis.edu

Announcement phase

Start: Aug. 6, 2021, midnight

Description: Task announced and sample data and data specification is published.

Development phase

Start: Oct. 15, 2021, midnight

Description: Training data is released and participants develop systems to solve the problem.

Evaluation phase

Start: Dec. 3, 2021, midnight

Description: Evaluation data is uploaded and participants submit results.

Post-evaluation phase

Start: Jan. 31, 2022, midnight

Description: Paper submission phase.

Competition Ends

Never

You must be logged in to participate in competitions.

Competition

SemEval-2022 Task 09: R2VQ - Competence-based Multimodal Question Answering

Previous

Current

End

Change Log

Motivation

The task

Data components

Cooking Role Labeling (CRL)

Cooking entities

Cooking role values in the dependency

Hidden Roles

Coreference

Semantic Role Labeling (SRL)

Video Component

Question Families

License

Evaluation

Important:

Submission

Final Results

References

Terms and Conditions

Task Organizers

Announcement phase

Development phase

Evaluation phase

Post-evaluation phase

Competition Ends