SemEval-2018 Task 5: Counting Events and Participants in the Long Tail

Organized by filipilievski - Current server time: July 22, 2018, 4:42 p.m. UTC

Previous

Evaluation
Jan. 8, 2018, midnight UTC

Current

Post-Evaluation
Jan. 30, 2018, midnight UTC

End

Competition Ends
Never

 

25-9-2017: small modification of the ConLL format. We refer to the explanation of each subtask for more information.
25-9-2017: First version of the baseline is on the leaderboard and available as code.
29-9-2017: An improved version of the baseline is on the leaderboard and available as code.
8-12-2017: The final version of the guidelines is published online.
11-12-2017: Different organization of the task documents. Each subtask now has a single CoNLL file as input, instead of a CoNLL file per question. 
11-12-2017: Final version of the trial data (see the 'Participate' tab) and of the task baseline (code here) made available.  
20-12-2017: Test data input released ! Please refer to the 'Participate' tab to download the test data. 
8-1-2018: Evaluation phase started. See the bottom of the submission page for details.
30-1-2018: Evaluation phase ended. The results will be made official shortly.

5-2-2018: official results are available

9-2-2018: gold data is available

Welcome!

This is the CodaLab Competition for all three Subtasks of SemEval-2018 Task 5: Counting Events and Participants within Highly Ambiguous Data covering a very long tail. 

Please join our mailing group to stay informed!

Task Summary

Can you count events? We are hosting a "referential quantification" task that requires systems to provide answers to questions about the number of incidents of an event type (subtasks S1 and S2) or participants in roles (subtask S3).

Given a set of questions and corresponding documents, the participating systems need to provide a numeric answer together with the supporting documents that directly relate to and support the answer. Optionally, participants can also provide the text mentions of events in the documents. To correctly answer each question, participating systems must be able to establish the meaning, reference, and identity (i.e. coreference) of events and participants in news articles. A schematic example of the S2 challenge is given below (please click here to view a larger version of the image):

The schemas for S1 and S3 are very similar, so we leave them out for brevity.

The data (texts and answers) are prepared in such a way that the task deliberately exhibits large ambiguity and variation, as well as coverage of long tail phenomena by including a substantial amount of low-frequent, local events and entities.

Subtasks

The overall competition consists of three subtasks:

  • Subtask 1 (S1): Find the single event that answers the question
  • Subtask 2 (S2): Find all events (if any) that answer the question
  • Subtask 3 (S3): Find all participant-role relations that answer the question

The three subtasks are based on the same kind of data and are evaluated using the same metrics (see Data and Evaluation for details on the task data and evaluation). Participants will receive a set of documents from which they need to derive the numeric answer (how many incidents or participants?), the documents from the set that report on the correct incidents and the mentions of the events within these documents that make reference to the incident or subevents of the incidents according to a given event schema.

 

For further specifics on the individual subtasks, visit the individual subtask tabs.

Question components

The questions are provided as structured JSON. Each question is defined through three components in this JSON structure: the event type and two event properties that will act as constraints on the required answer. The two event properties in a question are either a specification for the time, the location or the participants of the event, where the specification can vary in granularity (e.g. month, or city, or full name). More explanation on these components follows.

Event types: We consider four event types in this task described through their representation in WordNet 3.0/FrameNet 1.7 (only killing and injuring will be part of the trial data). Each question is constrainted by exactly one event type.

Event type Definition Representation in WN30/FN17
killing at least one person is killed wn30:killing.n.02
wn30:kill.v.01
fn17:Killing
injuring at least one person is non-fataly injured

wn30:injure.v.01
wn30:injured.a.01
fn17:Cause_harm
fn17:Experience_bodily_harm

fire_burning the event of something burning

wn30:fire.n.01
fn17:Fire_burning

job_firing terminated employment wn30:displace.v.03
fn17:Firing

Event properties: We consider three event properties: time, location and participants. We only consider names with one first name and one last name. Any question contains exactly two event properties that are given. For each of them we define several granularities:

Event property Granularity
Time

Day (e.g. 1/1/2015)
Month (e.g. 1/2015)
Year (e.g. 2015)

Location

City (e.g. wiki: Waynesboro, Mississippi)
State (e.g. wiki: Mississippi)

Participant First Name (e.g. John)
Last Name (e.g. Smith)
Full Name (e.g. John Smith)

Terminology and statistics

An answer incident is an event whose properties fit the constraints of a question. An answer document is a document that reports on an answer incident. A confusion incident is an event which fits some, but not all of the question constraints (e.g. an event that fits the event type and time, but not the location).  A confusion document is a document that reports on a confusion incident, and does not report on any of the answer incidents. A noise incident is an event which fits none of the question constraints. A noise document is a document that reports on a noise incident, and does not report on any of the answer or confusion incidents.  With each question we provide a set of documents, only a small subset of which are answer documents, while all remaining documents are confusion or noise documents. Returning a confusion/noise incident or document results in a false positive. 

The large portion of confusion documents results in very high ambiguity of the task data, thus encouraging deep semantic processing to interpret different events, participants and their identities beyond surface form matching. To illustrate this ambiguity, we present several statistics about the questions from subtask 2 in the trial data (the statistics for subtask 1 and 3 are similar):

  • The average number of incidents corresponding to a question is 4.22
  • The average number of gold documents (documents providing evidence to the gold incidents) is 7.68
  • On average, for each gold document there are 89 confusion documents and 54 noise documents.

The data is sampled from local news documents reporting on events and participants that are only relevant within a specific context. As such, shallow strategies based on frequency and popularity are expected to perform poorly. 

Phases

This competition will be run in three phases:

  1. Practice/trial phase (August 14, 2017 - January 7, 2018) - The trial data will be made available in the beginning of this phase. During the practice phase, the task participants can get familiar with the competition and develop their solutions. In December, we will release the test data, so the participants can use the last month of the practice phase to work on the test data.
  2. Evaluation phase (January 8 - January 29, 2018) - During this phase, the task participants submit their solutions on the test data to the competition leaderboard.
  3. Post-evaluation phase (January 30, 2018 - ) - Once the evaluation is done, the task participants can still use the competition page to evaluate their solutions, however, these submissions will be considered out-of-competition.

Organizers

Filip Ilievski, Marten Postma, Piek Vossen (Vrije Universiteit Amsterdam)

Data Description

The data in this task is divided into two parts: trial and test data. Note that there is no training data made available.

Our test data covers three domains: gun violence, fire disasters, and business. The trial data only stems from the gun violence domain.

Trial data

The trial data stems from the gun violence domain. It consists of: 424 questions for subtask 1, 469 questions for subtask 2, and 585 questions for subtask 3. The task participants are welcome to train their systems on the 1,478 questions from the trial data. Namely, the folder dev_data contains the answers to the trial data questions with the corresponding documents. In addition, the dev_data folder also contains the mentions of all answer documents from one question per subtask. The IDs of these questions are: 1-89170, 2-7074, and 3-59191.

Note: Participants are also allowed to train their systems on external data, including the Gun Violence Database

Test data

The test data follows the same format as the trial data, with the key difference that it covers three domains: gun violence, fire disasters, and business. In addition, for the test data we do not provide the gold answers. The test data consists of 4,485 questions in total: 1,032 questions for subtask 1, 997 questions for subtask 2, and 2,456 questions for subtask 3.

Similarly as for the trial data, for the test data we have also annotated a subset (not all) of the questions with mention-level evaluation. However, in the case of the test data, we do not specify which documents for which questions were annotated with mentions. Participants should generate mention annotations for all the answer documents, while we evaluate only the documents that also have been annotated for gold mentions.

The evaluation on this test data via CodaLab happens in January, but task participants are welcome to download and explore the data in December.

Data representation 

(please click here to view a larger version of the image)

Question representation - We provide the participants with a structured representation of each question. This relieves the burden of question parsing. Example of a question representation:

  • Subtask: S2
  • Event type: injuring
  • Time: 2017
  • Location: wiki:Iowa

Document representation - For each document, we provide its titlecontent (tokenized), and creation time.

Answer representation - The participants are asked to submit two types of answers per subtask:

  1. A single JSON file containing the numeric answer for each question and the set of supporting documents
  2. A single CONLL file that contains cross-document event coreference annotations of the input documents

Evaluation

Evaluation in this task is performed on three levels: incident-level, document-level, and mention-level.

  1. The incident-level evaluation compares the numeric answer provided by the system to the gold answer for each of the questions. The comparison is done twofold: by exact matching and by Root Mean Square Error (RMSE) for difference scoring. For example, let's consider that the task consists of two questions with gold answers 1 and 4. If the system answers are 1 and 7 correspondingly, then the exact matching accuracy is 0.5 and the RMSE is  4.5  =2.12. If the system answers are 1 and 3 correspondingly, then the exact matching accuracy is again 0.5, but the RMSE is lower: 1.0  =1. The scores per subtask are then averaged over all questions to compute a single incident-level evaluation score.
  2. The document-level evaluation compares the set of answer documents between the system and the gold standard. The sets of documents for each question are compared by the customary metrics of Precision, Recall and F1-score. For example, let's say that the gold documents for a question have IDs 1,3,5, and 7, and the system provided the documents 3, 4, and 5. Then the precision is 2/3=~0.67 and the recall is 2/4=0.5. The F1-score is 0.57. The scores per subtask are then averaged over all questions to compute a single document-level evaluation score.
  3. The mention-level evaluation is a cross-document event coreference evaluation. Mention-level evaluation is only done for questions with the event types 'killing' or 'injuring'. We apply the customary metrics to score the event coreference of systems: BCUB, BLANC, CEAF_E, CEAF_M, and MUC. The final F1-score is the average of the F1-scores of the individual coreference metrics. The set of mentions to annotate should conform to the schema defined in the annotation guidelines for this task (further details on the extent of mentions, distinguishing quantified events and subevents can be found in the guidelines):

Event schema

Guidelines for mention annotation

The guidelines can be found here.

Surface form baseline

Filtering of documents 
This baseline uses surface forms based on the question components to find answer documents in data. We only consider documents that contain the label of the event type or at least one of its WordNet synonyms. The labels of locations and participants are queried directly in document texts (e.g. if the location requested is the US state of Texas, then we only consider documents that contain the surface form "Texas", and similarly for participants such as "John"). The temporal constraint is handled differently: we only consider documents whose publishing date falls within the time requested in the question. 

Infering incidents per subtask
  • For subtask 1, this baseline assumes that all documents that fit the created constraints are referring to the same incident. If there is no such document, then the baseline does not answer the question (because S1 always has at least one supporting document).
  • For subtask 2, we assume that none of the documents are coreferential. Hence, if 10 documents match the constraints, we infer that there are also 10 corresponding incidents. 
  • This baseline does not address subtask 3, because it does not reason over participants.

Mention annotation
Regarding mention annotation, we annotate mentions of events of type "killing" and "injuring", when these surface forms or their synonyms in Wordnet are found as tokens in a document. We assume that all mentions of the same event type within a document are coreferential, whereas all mentions found in different documents are not.
 
Code
The code of this baseline is publicly available on github. Task participants are welcome to use this baseline as a starting point when building their solutions.

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2018 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree to respect the following statements about the dataset:

  1. Vrije Universiteit Amsterdam makes no warranties regarding the Dataset, including but not limited to being up-to-date, correct or complete. Vrije Universiteit Amsterdam cannot be held liable for providing access to the Dataset or usage of the Dataset.
  2. The Dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
  3. The Dataset must not be provided or shared in part or full with any third party. 
  4. The researcher takes full responsibility for usage of the Dataset at any time.
  5. Vrije Universiteit Amsterdam reserves the right to terminate the researcher’s access to the Dataset at any time.
  6. The place of jurisdiction is Amsterdam, The Netherlands. 
  7. The data is distributed under ‘Fair use’ (Fair use policy in USA, Fair use policy in EU). Copyright will remain with the owners of the content. We will remove data from a publisher upon request.
  8. If any part of this agreement is legally invalid, this shall not affect the remaining agreement.

Subtask 1: Find the single event that answers the question

Subtask 1 consists of event-based questions with exactly one answer incident. The main goal in subtask 1 is then to find the documents which provide evidence for the single event that answers the question.

In addition, the task participants can annotate coreferential event mentions according to the event schema specified in the guidelines. Note that we expect all event mentions that fit the schema to be annotated in the document, regardless of the type of event that is specified in the query. So if the query is limited to killing events, we expect also the mentions of the incident itself, shootings and injuries to be annotated. Although these annotations cannot be directly mapped to the answer, they help understanding how the event identification in the selected documents relates to the higher-level quantification task.

System input

Questions consist of an event type and two event properties. We refer to the "Learn the details" tab for more information.

  1. Event types: Each question contains one out of the following four event types: killing, injuring, fire_burning, and job_firing. For the trial data of this subtask, we only consider two event types: killing and injuring. 
  2. Event properties: Each question contains two out of the following three event properties: location, time, and participants.

For each question, the system input will consist of:

  • a question in a structured format
  • a CoNLL file containing the documents which should be used to determine the answer to the question.

Question in JSON

An example of a question can be found below with the event properties location and time.

"1-89170": {

        "event_type": "injuring",

        "participant": {

            "full_name": "Akia Thomas"

      },

        "subtask": 1,

        "time": {

            "month": "01/2017"

        },

        "verbose_question": "Which ['injuring'] event happened in 01/2017 (month) that involve the name Akia Thomas (full_name) ?"

    }

Some observations about the file format:

  • The keys of this JSON file are question IDs ("1-5109" in this example)
  • each question contains one event type ("event_type")
  • each question contains exactly two of the three event properties ("participant", "time", "location").
  • each question contains a "subtask" field, which is always 1 for S1
  • each question contains a field "verbose_question", which is the summary of the question in free text, for human readers.

CoNLL

All input documents in a tokenized format can be found in a file called docs.conll.

This file serves as the input for each question, i.e. it contains the documents that are provided to determine what the answer is to each question. Hence, all questions have the same input documents. We will use an example to explain the format:

#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY -
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

Some observations about the file format:

  • every document starts with a line starting with #begin document (DOC_ID);
  • the line after that always provides the document creation time.
  • each line consists of four columns: token identifiertoken, discourse type (DCT or TITLE or BODY), and coreference chain identifier (default value is a dash '-')
  • every document ends with a line #end document

Answer format

Systems can provide one or both out of the following two output formats: a JSON file with answers, and a single CONLL file with mention-level event coreference for all documents. Again, we will use examples to explain the formats.

JSON

Example of a JSON file:

"1-89170": {

        "answer_docs": 

              [

                "1a45d73a21522536c411807219ed553e",

                "f016114ddb55b3f5c16fea2f8d1f2ec7"

            ],

        "numerical_answer": 1

    }, .....

 

Observations about the JSON format:

  • the answer file is a JSON file, in which each question is an entry in the JSON.
  • For each question, there is a single key to be provided for this task, called answer_docs (which are the supporting documents for the single incident that answers the question? In the example, the two document identifiers provide information about the single answer incident).

CoNLL

Example of a CONLL file annotated for event coreference:

#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

Subtask 2: Find all events (if any) that answer the question

Subtask 2 consists of event-based questions with any amount of (zero to N) answer incidents. The goal of subtask 2 is then: 1) to determine the number of answer incidents, 2) to find the documents which provide evidence for the answer. To make the task more realistic, we also include questions with zero as an answer. 

In addition, the task participants can annotate coreferential event mentions according to the event schema specified in the guidelines. Note that we expect all event mentions that fit the schema to be annotated in the document, regardless of the type of event that is specified in the query. So if the query is limited to killing events, we expect also the mentions of the incident itself, shootings and injuries to be annotated. Although these annotations cannot be directly mapped to the answer, they help understanding how the event identification in the selected documents relates to the higher-level quantification task.

System input

Questions consist of an event type and two event properties. We refer to the "Learn the details" tab for more information.

  1. Event types: Each question contains one out of the following four event types: killing, injuring, fire_burning, and job_firing. For the trial data of this subtask, we only consider two event types: killing and injuring. 
  2. Event properties: Each question contains two out of the following three event properties: location, time, and participants.

For each question, the system input will consist of:

  • a question in a structured format
  • a CoNLL file containing the documents which should be used to determine the answer to the question.

Question in JSON

An example of a question can be found below with the event properties participant and time.

"2-7074": {

        "event_type": "killing",

        "participant": {

            "first": "Sean"

        },

        "subtask": 2,

        "time": {

            "year": "2017"

        },

        "verbose_question": "How many ['killing'] events happened in 2017 (year) that involve the name Sean (first) ?"

    }

Some observations about the file format:

  • The keys of this JSON file are question IDs ("2-7074" in this example)
  • each question contains one event type ("event_type")
  • each question contains exactly two of the event properties ("participant", "time", "location").
  • each question contains a "subtask" field, which is always 2 for S2
  • each question contains a field "verbose_question", which is the summary of the question in free text, for human readers.

CoNLL

All input documents in a tokenized format can be found in a file called docs.conll.

This file serves as the input for each question, i.e. it contains the documents that are provided to determine what the answer is to each question. Hence, all questions have the same input documents. We will use an example to explain the format:

#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY -
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

Some observations about the file format:

  • every document starts with a line starting with #begin document (DOC_ID);
  • the line after that always provides the document creation time.
  • each line consists of four columns: token identifiertoken, discourse type (DCT or TITLE or BODY), and coreference chain identifier (default value is a dash '-')
  • every document ends with a line #end document

Answer format

Systems can provide one or both out of the following two output formats: a JSON file with answers, and a single CONLL file with mention-level event coreference for all documents. Again, we will use examples to explain the formats.

 

JSON

Example of a JSON file:

"2-7074": {

        "answer_docs": [

                "748f14771b3febdc874b7827d151b6e0",

                "6c9fa7f335e78ca818125c626d3bc216",

                "ea781ee5a57a46b285d834708fee8c0d",

                "abc4c58e9b7621b10a4732a98dc273b3"

            ],

        "numerical_answer": 2

    }, ....

 

 

Observations about the JSON format:

  • the answer file is a JSON file, in which each question is an entry in the JSON.
  • For each question, there are two keys: numerical_answer (how many incidents satisfy the question criteria? In the example, two incidents satisfy the criteria) and answer_docs (which are the supporting documents for the incidents, i.e. which documents provide the system with the information needed to answer the question? In the example, the first two document identifiers provide information about one incident and other two about the other incident).

CoNLL

Example of a CONLL file annotated for event coreference:

#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

Subtask 3: Find all participant-role relations that answer the question

How many people were killed or injured? Subtask 3 consists of participant questions where we are interested in the outcome of the incident for the people involved. The answer is therefore a number ranging from 0 to N representing the event outcomes of a certain type. In the case of gun violence, a single incident can have mixed outcomes in which a number of people got injured and others died. Answering the question requires understanding across documents how many people were injured or died in incidents that match the question constraints.

The goal of subtask 3 is then: 1) to determine the number of events that have the specified participant-role outcome as an answer (people injured or people killed), 2) to find the documents which provide evidence for the answer. Note that this task requires further reasoning over the outcome roles (being injured or being killed) that participants play in the answer incidents, i.e. it is not enough to just decide whether there is some killing/injuring incident relevant to the question but also how many casualties. E.g. quantification of the participants, as in "two people killed" is considered as a quantification of the killing event. It is also necessary to count how many people were killed or injured as the final outcome of the event development, as this is the way in which the structured data is recorded. This means that if a person was initially injured and died later, we count this casualty as 1 killing and 0 injuries.

In addition, the task participants can annotate coreferential event mentions according to the event schema specified in the guidelines. Note that this schema does NOT ask for annotation of participants in roles, only of mentions of the (sub)events in relation to the question. Please also note that we expect all event mentions that fit the schema to be annotated in the document, regardless of the type of event that is specified in the query. So if the query is limited to killing events, we expect also the mentions of the incident itself, shootings and injuries to be annotated. Although these annotations cannot be directly mapped to the answer, they help understanding how the event identification in the selected documents relates to the higher-level quantification task.

System input

Questions consist of an event type and two event properties. We refer to the "Learn the details" tab for more information.

  1. Event types: Each question contains one out of the following four event types: killing, injuring, fire_burning, and job_firing. For the trial data of this subtask, we only consider two event types: killing and injuring. 
  2. Event properties: Each question contains two out of the following three event properties: location, time, and participants.

For each question, the system input will consist of:

  • a question in a structured format
  • a CoNLL file containing the documents which should be used to determine the answer to the question.

Question in JSON

An example of a question can be found below with the event properties location and time.

    "3-59191": {

        "event_type": "killing",

        "location": {

            "state": "http://dbpedia.org/resource/Missouri"

        },

        "subtask": 3,

        "time": {

            "day": "26/01/2017"

        },

        "verbose_question": "How many people were killed in 26/01/2017 (day) in ('Missouri',) (state) ?"

    }

Some observations about the file format:

  • The keys of this JSON file are question IDs ("3-59191" in this example)
  • each question contains one event type ("event_type")
  • each question contains exactly two of the three event properties ("participant", "time", "location").
  • each question contains a "subtask" field, which is always 3 for S3
  • each question contains a field "verbose_question", which is the summary of the question in free text, for human readers.

CoNLL

All input documents in a tokenized format can be found in a file called docs.conll.

This file serves as the input for each question, i.e. it contains the documents that are provided to determine what the answer is to each question. Hence, all questions have the same input documents. We will use an example to explain the format:

#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY -
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

Some observations about the file format:

  • every document starts with a line starting with #begin document (DOC_ID);
  • the line after that always provides the document creation time.
  • each line consists of four columns: token identifiertoken, discourse type (DCT or TITLE or BODY), and coreference chain identifier (default value is a dash '-')
  • every document ends with a line #end document

Answer format

Systems can provide one or both out of the following two output formats: a JSON file with answers, and a single CONLL file with mention-level event coreference for all documents. Again, we will use examples to explain the formats.

JSON

Example of a JSON file:

"3-59191": {

        "answer_docs": [

                "f5e081d0b616c05ba2c77dcc84df443a"

            ],

        "numerical_answer": 3,

    },....

Observations about the JSON format:

  • the answer file is a JSON file, in which each question is an entry in the JSON.
  • For each question, there are two keys: numerical_answer (how many people in relevant incidents satisfy the question criteria? In the example, three people satisfy the criteria) and answer_docs (which are the supporting documents for the incidents, i.e. which documents provide the system with the information needed to answer the question? In the example, there is only one document that provides the information about the one answer incident in which three people were killed). 

CoNLL

Example of a CONLL file annotated for event coreference:

#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

Answer formats

The task participants provide at most two different outputs:

  1. For the incident- and the document-level evaluation, systems provide a single JSON file per subtask. The keys in this JSON file represent question IDs. For each question ID, there is a numeric answer ("numerical_answer") and a set of documents that provide evidence for the answer ("answer_docs"). For example, the answer to questions from subtask 2 can be represented as follows:

    {'2-101': { 'numerical_answer': 3, 'answer_docs': ['8', '11', '15', '17', '87'] }, ..., '2-897': {'numerical_answer': 1, 'answer_docs': ['36', '39']}}

    Participants taking part in all three subtasks should prepare three JSON files following the format above, one per subtask.

  2. For the mention-level evaluation, systems are asked to provide one CoNLL file that contains event mention coreference on a cross-document level for all documents. Each document in the CoNLL file starts with a #begin document row, and it ends with #end document. Each row of the document in the CoNLL file represents a single token, with the following fields: token_idtokendocument_part (whether it is the title, the content or the document creation time), and coreference_chain. Example of a CONLL file annotated for event coreference:

    #begin document (1a45d73a21522536c411807219ed553e);
    1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
    1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
    1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
    ....
    1a45d73a21522536c411807219ed553e.b2.7 a BODY -
    1a45d73a21522536c411807219ed553e.b2.8 child BODY -
    1a45d73a21522536c411807219ed553e.b2.9 was BODY -
    1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
    1a45d73a21522536c411807219ed553e.b2.11 once BODY -
    ....
    1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
    1a45d73a21522536c411807219ed553e.b18.30 . BODY -
    #end document
    #begin document (441b8a536eeb16a6d4f94cf018f6bc10);
    441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
    441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
    441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
    441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -

   ... 

Note: The systems can decide to answer/annotate a subset of the questions. Our scripts are flexible with respect to this, and we also report the number of questions answered. Regarding event coreference, the scripts will select those documents that were also annotated for the evaluation and ignore all others.

Answer Structure

The submissions are formatted as single .zip files. The content inside the submitted .zip file has the following structure:

  s1/
    answers.json
    docs.conll
  s2/
    answers.json
    docs.conll
  s3/
    answers.json
    docs.conll
 
You can find an example submission .zip for both the trial and the test data in the Get Data section (Participate -> Get Data). Make sure you get comfortable with the data format before you start submitting your real efforts on this task. For the convenience of our task participants, we log different outputs of the evaluation script, which the task participants can access when they make a submission. These logs will, for instance, warn participants when the JSON with answers is missing for a subtask, or contains a non-existing question ID. 

Submission process and the Leaderboard

To upload a submission, please refer to the Participate Section. An example submission can be found in the trial_data_final.zip package (for the trial data), and similarly in test_data.zip for the test data. Please zip your submission in the following way:

  • cd example_submission
  • zip -r submission.zip *

After that, please go to Participate -> Submit / View results and click Submit to upload your zip file. After uploading the zip folder, you will see the following in the  Status column: Submitting, Submitted, Finished.

Once you have uploaded a submission without any errors, you can submit it to the competition leaderboard, which can be seen in the Section Results. Make sure to check the logs for warnings.

For the latest valid submission, the leaderboard shows the following 13 scores: 

  1. For subtask 1: document-level F1-score, mention-level average F1-score, and number of answered questions.
  2. For subtask 2: incident-level accuracy and RMSE, document-level F1-score, mention-level average F1-score, and number of answered questions.
  3. For subtask 3: incident-level accuracy and RMSE, document-level F1-score, mention-level average F1-score, and number of answered questions.

Please note that there is only one event coreference evaluation, which is present in all three subtasks. The metric mention-level average F1-score hence represents the same evaluation across three subtasks.

Finally, we would like to emphasize the following important details about the evaluation phase and beyond:

  1. it is still possible to evaluate on the trial data

  2. the maximum number of submissions during the evaluation phase is 10

  3. the results on the leaderboard are hidden during the evaluation phase.

  4. After the evaluation phase (Post-Evaluation phase), only the latest valid submission will be shown on the leaderboard.
  5. the official results will be posted on 5-2-2018
  6. we strongly encourage you to submit a paper describing your system, which will be part of the official SemEval proceedings. The deadline for participant papers is Monday 26 Feb 2018

Practice

Start: Aug. 14, 2017, midnight

Evaluation

Start: Jan. 8, 2018, midnight

Post-Evaluation

Start: Jan. 30, 2018, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In