5-2-2018: official results are available
9-2-2018: gold data is available
This is the CodaLab Competition for all three Subtasks of SemEval-2018 Task 5: Counting Events and Participants within Highly Ambiguous Data covering a very long tail.
Please join our mailing group to stay informed!
Can you count events? We are hosting a "referential quantification" task that requires systems to provide answers to questions about the number of incidents of an event type (subtasks S1 and S2) or participants in roles (subtask S3).
Given a set of questions and corresponding documents, the participating systems need to provide a numeric answer together with the supporting documents that directly relate to and support the answer. Optionally, participants can also provide the text mentions of events in the documents. To correctly answer each question, participating systems must be able to establish the meaning, reference, and identity (i.e. coreference) of events and participants in news articles. A schematic example of the S2 challenge is given below (please click here to view a larger version of the image):
The schemas for S1 and S3 are very similar, so we leave them out for brevity.
The data (texts and answers) are prepared in such a way that the task deliberately exhibits large ambiguity and variation, as well as coverage of long tail phenomena by including a substantial amount of low-frequent, local events and entities.
The overall competition consists of three subtasks:
The three subtasks are based on the same kind of data and are evaluated using the same metrics (see Data and Evaluation for details on the task data and evaluation). Participants will receive a set of documents from which they need to derive the numeric answer (how many incidents or participants?), the documents from the set that report on the correct incidents and the mentions of the events within these documents that make reference to the incident or subevents of the incidents according to a given event schema.
For further specifics on the individual subtasks, visit the individual subtask tabs.
The questions are provided as structured JSON. Each question is defined through three components in this JSON structure: the event type and two event properties that will act as constraints on the required answer. The two event properties in a question are either a specification for the time, the location or the participants of the event, where the specification can vary in granularity (e.g. month, or city, or full name). More explanation on these components follows.
Event types: We consider four event types in this task described through their representation in WordNet 3.0/FrameNet 1.7 (only killing and injuring will be part of the trial data). Each question is constrainted by exactly one event type.
Event type | Definition | Representation in WN30/FN17 |
killing | at least one person is killed | wn30:killing.n.02 wn30:kill.v.01 fn17:Killing |
injuring | at least one person is non-fataly injured |
wn30:injure.v.01 |
fire_burning | the event of something burning |
wn30:fire.n.01 |
job_firing | terminated employment | wn30:displace.v.03 fn17:Firing |
Event properties: We consider three event properties: time, location and participants. We only consider names with one first name and one last name. Any question contains exactly two event properties that are given. For each of them we define several granularities:
Event property | Granularity |
Time |
Day (e.g. 1/1/2015) |
Location |
City (e.g. wiki: Waynesboro, Mississippi) |
Participant | First Name (e.g. John) Last Name (e.g. Smith) Full Name (e.g. John Smith) |
An answer incident is an event whose properties fit the constraints of a question. An answer document is a document that reports on an answer incident. A confusion incident is an event which fits some, but not all of the question constraints (e.g. an event that fits the event type and time, but not the location). A confusion document is a document that reports on a confusion incident, and does not report on any of the answer incidents. A noise incident is an event which fits none of the question constraints. A noise document is a document that reports on a noise incident, and does not report on any of the answer or confusion incidents. With each question we provide a set of documents, only a small subset of which are answer documents, while all remaining documents are confusion or noise documents. Returning a confusion/noise incident or document results in a false positive.
The large portion of confusion documents results in very high ambiguity of the task data, thus encouraging deep semantic processing to interpret different events, participants and their identities beyond surface form matching. To illustrate this ambiguity, we present several statistics about the questions from subtask 2 in the trial data (the statistics for subtask 1 and 3 are similar):
The data is sampled from local news documents reporting on events and participants that are only relevant within a specific context. As such, shallow strategies based on frequency and popularity are expected to perform poorly.
This competition will be run in three phases:
Filip Ilievski, Marten Postma, Piek Vossen (Vrije Universiteit Amsterdam)
The data in this task is divided into two parts: trial and test data. Note that there is no training data made available.
Our test data covers three domains: gun violence, fire disasters, and business. The trial data only stems from the gun violence domain.
The trial data stems from the gun violence domain. It consists of: 424 questions for subtask 1, 469 questions for subtask 2, and 585 questions for subtask 3. The task participants are welcome to train their systems on the 1,478 questions from the trial data. Namely, the folder dev_data contains the answers to the trial data questions with the corresponding documents. In addition, the dev_data folder also contains the mentions of all answer documents from one question per subtask. The IDs of these questions are: 1-89170, 2-7074, and 3-59191.
Note: Participants are also allowed to train their systems on external data, including the Gun Violence Database.
The test data follows the same format as the trial data, with the key difference that it covers three domains: gun violence, fire disasters, and business. In addition, for the test data we do not provide the gold answers. The test data consists of 4,485 questions in total: 1,032 questions for subtask 1, 997 questions for subtask 2, and 2,456 questions for subtask 3.
Similarly as for the trial data, for the test data we have also annotated a subset (not all) of the questions with mention-level evaluation. However, in the case of the test data, we do not specify which documents for which questions were annotated with mentions. Participants should generate mention annotations for all the answer documents, while we evaluate only the documents that also have been annotated for gold mentions.
The evaluation on this test data via CodaLab happens in January, but task participants are welcome to download and explore the data in December.
(please click here to view a larger version of the image)
Question representation - We provide the participants with a structured representation of each question. This relieves the burden of question parsing. Example of a question representation:
Document representation - For each document, we provide its title, content (tokenized), and creation time.
Answer representation - The participants are asked to submit two types of answers per subtask:
Evaluation in this task is performed on three levels: incident-level, document-level, and mention-level.
The guidelines can be found here.
By submitting results to this competition, you consent to the public release of your scores at the SemEval-2018 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
You agree to respect the following statements about the dataset:
Subtask 1 consists of event-based questions with exactly one answer incident. The main goal in subtask 1 is then to find the documents which provide evidence for the single event that answers the question.
In addition, the task participants can annotate coreferential event mentions according to the event schema specified in the guidelines. Note that we expect all event mentions that fit the schema to be annotated in the document, regardless of the type of event that is specified in the query. So if the query is limited to killing events, we expect also the mentions of the incident itself, shootings and injuries to be annotated. Although these annotations cannot be directly mapped to the answer, they help understanding how the event identification in the selected documents relates to the higher-level quantification task.
Questions consist of an event type and two event properties. We refer to the "Learn the details" tab for more information.
For each question, the system input will consist of:
An example of a question can be found below with the event properties location and time.
"1-89170": {
"event_type": "injuring",
"participant": {
"full_name": "Akia Thomas"
},
"subtask": 1,
"time": {
"month": "01/2017"
},
"verbose_question": "Which ['injuring'] event happened in 01/2017 (month) that involve the name Akia Thomas (full_name) ?"
}
Some observations about the file format:
All input documents in a tokenized format can be found in a file called docs.conll.
This file serves as the input for each question, i.e. it contains the documents that are provided to determine what the answer is to each question. Hence, all questions have the same input documents. We will use an example to explain the format:
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY -
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
Some observations about the file format:
Systems can provide one or both out of the following two output formats: a JSON file with answers, and a single CONLL file with mention-level event coreference for all documents. Again, we will use examples to explain the formats.
Example of a JSON file:
"1-89170": {
"answer_docs":
[
"1a45d73a21522536c411807219ed553e",
"f016114ddb55b3f5c16fea2f8d1f2ec7"
],
"numerical_answer": 1
}, .....
Observations about the JSON format:
Example of a CONLL file annotated for event coreference:
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
Subtask 2 consists of event-based questions with any amount of (zero to N) answer incidents. The goal of subtask 2 is then: 1) to determine the number of answer incidents, 2) to find the documents which provide evidence for the answer. To make the task more realistic, we also include questions with zero as an answer.
In addition, the task participants can annotate coreferential event mentions according to the event schema specified in the guidelines. Note that we expect all event mentions that fit the schema to be annotated in the document, regardless of the type of event that is specified in the query. So if the query is limited to killing events, we expect also the mentions of the incident itself, shootings and injuries to be annotated. Although these annotations cannot be directly mapped to the answer, they help understanding how the event identification in the selected documents relates to the higher-level quantification task.
System input
Questions consist of an event type and two event properties. We refer to the "Learn the details" tab for more information.
For each question, the system input will consist of:
An example of a question can be found below with the event properties participant and time.
"2-7074": {
"event_type": "killing",
"participant": {
"first": "Sean"
},
"subtask": 2,
"time": {
"year": "2017"
},
"verbose_question": "How many ['killing'] events happened in 2017 (year) that involve the name Sean (first) ?"
}
Some observations about the file format:
All input documents in a tokenized format can be found in a file called docs.conll.
This file serves as the input for each question, i.e. it contains the documents that are provided to determine what the answer is to each question. Hence, all questions have the same input documents. We will use an example to explain the format:
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY -
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
Some observations about the file format:
Systems can provide one or both out of the following two output formats: a JSON file with answers, and a single CONLL file with mention-level event coreference for all documents. Again, we will use examples to explain the formats.
Example of a JSON file:
"2-7074": {
"answer_docs": [
"748f14771b3febdc874b7827d151b6e0",
"6c9fa7f335e78ca818125c626d3bc216",
"ea781ee5a57a46b285d834708fee8c0d",
"abc4c58e9b7621b10a4732a98dc273b3"
],
"numerical_answer": 2
}, ....
Observations about the JSON format:
Example of a CONLL file annotated for event coreference:
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
How many people were killed or injured? Subtask 3 consists of participant questions where we are interested in the outcome of the incident for the people involved. The answer is therefore a number ranging from 0 to N representing the event outcomes of a certain type. In the case of gun violence, a single incident can have mixed outcomes in which a number of people got injured and others died. Answering the question requires understanding across documents how many people were injured or died in incidents that match the question constraints.
The goal of subtask 3 is then: 1) to determine the number of events that have the specified participant-role outcome as an answer (people injured or people killed), 2) to find the documents which provide evidence for the answer. Note that this task requires further reasoning over the outcome roles (being injured or being killed) that participants play in the answer incidents, i.e. it is not enough to just decide whether there is some killing/injuring incident relevant to the question but also how many casualties. E.g. quantification of the participants, as in "two people killed" is considered as a quantification of the killing event. It is also necessary to count how many people were killed or injured as the final outcome of the event development, as this is the way in which the structured data is recorded. This means that if a person was initially injured and died later, we count this casualty as 1 killing and 0 injuries.
In addition, the task participants can annotate coreferential event mentions according to the event schema specified in the guidelines. Note that this schema does NOT ask for annotation of participants in roles, only of mentions of the (sub)events in relation to the question. Please also note that we expect all event mentions that fit the schema to be annotated in the document, regardless of the type of event that is specified in the query. So if the query is limited to killing events, we expect also the mentions of the incident itself, shootings and injuries to be annotated. Although these annotations cannot be directly mapped to the answer, they help understanding how the event identification in the selected documents relates to the higher-level quantification task.
System input
Questions consist of an event type and two event properties. We refer to the "Learn the details" tab for more information.
For each question, the system input will consist of:
An example of a question can be found below with the event properties location and time.
"3-59191": {
"event_type": "killing",
"location": {
"state": "http://dbpedia.org/resource/Missouri"
},
"subtask": 3,
"time": {
"day": "26/01/2017"
},
"verbose_question": "How many people were killed in 26/01/2017 (day) in ('Missouri',) (state) ?"
}
Some observations about the file format:
All input documents in a tokenized format can be found in a file called docs.conll.
This file serves as the input for each question, i.e. it contains the documents that are provided to determine what the answer is to each question. Hence, all questions have the same input documents. We will use an example to explain the format:
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY -
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
Some observations about the file format:
Systems can provide one or both out of the following two output formats: a JSON file with answers, and a single CONLL file with mention-level event coreference for all documents. Again, we will use examples to explain the formats.
Example of a JSON file:
"3-59191": {
"answer_docs": [
"f5e081d0b616c05ba2c77dcc84df443a"
],
"numerical_answer": 3,
},....
Observations about the JSON format:
For each question, there are two keys: numerical_answer (how many people in relevant incidents satisfy the question criteria? In the example, three people satisfy the criteria) and answer_docs (which are the supporting documents for the incidents, i.e. which documents provide the system with the information needed to answer the question? In the example, there is only one document that provides the information about the one answer incident in which three people were killed).
Example of a CONLL file annotated for event coreference:
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
The task participants provide at most two different outputs:
{'2-101': { 'numerical_answer': 3, 'answer_docs': ['8', '11', '15', '17', '87'] }, ..., '2-897': {'numerical_answer': 1, 'answer_docs': ['36', '39']}}
Participants taking part in all three subtasks should prepare three JSON files following the format above, one per subtask.
#begin document (1a45d73a21522536c411807219ed553e);
1a45d73a21522536c411807219ed553e.DCT 2017-01-24 DCT -
1a45d73a21522536c411807219ed553e.t1.0 Hillsborough TITLE -
1a45d73a21522536c411807219ed553e.t1.1 deputies TITLE -
....
1a45d73a21522536c411807219ed553e.b2.7 a BODY -
1a45d73a21522536c411807219ed553e.b2.8 child BODY -
1a45d73a21522536c411807219ed553e.b2.9 was BODY -
1a45d73a21522536c411807219ed553e.b2.10 shot BODY (29997591319998578759991049991)
1a45d73a21522536c411807219ed553e.b2.11 once BODY -
....
1a45d73a21522536c411807219ed553e.b18.29 cocaine BODY -
1a45d73a21522536c411807219ed553e.b18.30 . BODY -
#end document
#begin document (441b8a536eeb16a6d4f94cf018f6bc10);
441b8a536eeb16a6d4f94cf018f6bc10.DCT 2017-03-07 DCT -
441b8a536eeb16a6d4f94cf018f6bc10.t1.0 Hope TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.1 , TITLE -
441b8a536eeb16a6d4f94cf018f6bc10.t1.2 AR TITLE -
...
Note: The systems can decide to answer/annotate a subset of the questions. Our scripts are flexible with respect to this, and we also report the number of questions answered. Regarding event coreference, the scripts will select those documents that were also annotated for the evaluation and ignore all others.
The submissions are formatted as single .zip files. The content inside the submitted .zip file has the following structure:
To upload a submission, please refer to the Participate Section. An example submission can be found in the trial_data_final.zip package (for the trial data), and similarly in test_data.zip for the test data. Please zip your submission in the following way:
After that, please go to Participate -> Submit / View results and click Submit to upload your zip file. After uploading the zip folder, you will see the following in the Status column: Submitting, Submitted, Finished.
Once you have uploaded a submission without any errors, you can submit it to the competition leaderboard, which can be seen in the Section Results. Make sure to check the logs for warnings.
For the latest valid submission, the leaderboard shows the following 13 scores:
Please note that there is only one event coreference evaluation, which is present in all three subtasks. The metric mention-level average F1-score hence represents the same evaluation across three subtasks.
Finally, we would like to emphasize the following important details about the evaluation phase and beyond:
it is still possible to evaluate on the trial data
the maximum number of submissions during the evaluation phase is 10
the results on the leaderboard are hidden during the evaluation phase.
Start: Aug. 14, 2017, midnight
Start: Jan. 8, 2018, midnight
Start: Jan. 30, 2018, midnight
Never
You must be logged in to participate in competitions.
Sign In