SemEval 2019 Task 2

Organized by alfredo - Current server time: June 19, 2019, 2:42 p.m. UTC


DEV-OLD 600 instances --Scorer for all subtasks
June 30, 2018, midnight UTC


Final Test Data 4620 Records
Jan. 13, 2019, noon UTC


Competition Ends

SemEval 2019 Unsupervised Lexical Semantic Frame Induction Task

Lexical Frame Induction is defined as the process of grouping verbs and their dependant words in type-feature structures (i.e. frames) in a fully unsupervised manner. The Berkeley FrameNet database is the most well known resource of such typed-feature structures. While lexical frame resources are proved to be helpful (even essential) in a range of NLP tasks and linguistics investigations, building them for new languages and domains is resource intensive and thus expensive. This problem can be alleviated using unsupervised lexical frame induction methods. The goal of this task is to develop a benchmark and allow comparison of unsupervised frame induction systems for building lexical frame resources for verbs and their arguments.


Please join the dedicated google group for inquiries about this shared task. 

To contact the task organizers directly, please send an email to





Task Description and Subtasks

Target verbs and their arguments in a test set must be clustered into automatically-learned frame structures. The test set consists of approximately 5000 frames from 1000 randomly chosen sentences from the PTB 3.0. The organizers provide free evaluation access to the full PTB 3.0 (courtesy of LDC under the specified evaluation license). 

The chosen verbs (and their argument structure) in the sentences for testing will be revealed to participants at a later stage near to the evaluation period. However, a development dataset containing gold annotation is available and can be acquired by participants after providing the signed LDC license agreement. The gold annotations for the full test-set will be revealed to participants only after the evaluation and submission periods.

The assumed frame structures have a head and an arbitrary number of slots/roles. A verb lexicalizes the head of a frame and (some of) the verb's arguments are the slot fillers for the frame that the verb evokes. For the test set, the position of the target verbs and their argument (roles/frame-elements) are given (i.e., subcategorization frames are assumed a priori rather than being determined by systems). 

Further instructions regarding obtaining the data and the format of files can be found in the Datasets tab.

Participants are allowed to use additional corpora and tools for training as long as these resources/tools do not contain or use explicit (supervised) semantic annotations regarding word senses, frame groupings, and semantic roles. For instance, while using a lexical semantic resource such as Princeton WordNet is not allowed, participants can (and are encouraged to use) WordNet-like lexical databases that are built automatically.

Participants are invited to partake in one or more of the following subtasks:

Subtask 1: Grouping Verbs to Frame Type Clusters

The aim of this subtask is to identify verbs that evoke the same frame.

For this subtask, participants are required to assign occurrences of the target verbs to a number of clusters, in such a way that verbs belonging to the same cluster evoke the same frame type.  Our gold annotations for this subtask will be based on FrameNet definitions for frames (where it is applicable). Following the FrameNet project, a frame is a conceptual structure that describes a certain type of object, situation, or event. In our task, the focus is on event frames evoked by verbs. For instance, in the following examples:

a. Trump leads the world, backward.
b. Disrespecting international laws leads to many complications.
c. Rosenzweig heads the climate impacts section at NASA's Goddard Institute.

we expect that the verbs `to lead' in ex. a and `to head' in ex. c end up in one cluster (e.g., call it Leadership after FrameNet) whereas `to lead' in ex. b will end up in another cluster (e.g., call it Cause) in which instances of verbs `originate', `produce', and so on (when they are used in the same sense) can be found. As exemplified above, the subtask 1 goes beyond the verb-sense induction task by requiring grouping of synonym, troponym, (even) antonym, ...  senses of verbs together. 



Subtask 2.1: Clustering arguments of verbs to frame-specific slots

The aim of this subtask is clustering arguments of verbs according to their corresponding FrameNet Frame Elements.

For this subtask, "semantic arguments" of verbs must be grouped to a number of frame-specific slots similar to FrameNet. That is, we assume argument groupings are specific to frame types and that they are not necessarily shared with other frames. As a result, participating in Subtask 2.1 demands participation in Subtask 1 since the evaluation of argument-clusters are done per frame cluster. However, one could build frame specific slot-clusters by using a heuristic/assumption such as assuming a frame per verb-form.


Subtask 2.2: Clustering arguments of verbs to generic roles

The aim of this subtask is to cluster arguments of verbs according to their Semantic Roles.

In contrast to subtask 2.1, here arguments are clustered into a set of generic semantic roles that are defined independently of frame definitions. Hence, this subtask is very similar to unsupervised semantic role labeling (e.g. see SemEval 2015 Task 18). Clustering verbs to frames (i.e., subtask 1) is not mandatory for this subtask and clusters that assign arguments into the latent semantic roles are evaluated disregarding the frames that the verbs belong to. Our gold annotations are based on an inventory of semantic roles adapted from the VerbNet project.


Evaluation Setting

The evaluation framework consists of a number of clustering evaluation measures. The induced groupings for frame types/slots/roles are evaluated as clusters of examples. These examples are compared to instances with the gold standard annotations. The scorer reports performance with respect to a range of evaluation metrics; however, for official ranking of systems we use

  • Purity, inverse-Purity and their harmonic mean proposed by Stein-bach et al. (2000); 
  • and, the harmonic mean of BCubed's Precision and Recall proposed by Bagga and Baldwin (1998).

The scorer program can be downloaded from here (also source codes from Github) and used in your local machine (requires Java 1.8).


Evaluation Baselines

For subtask 1, we report results from clusterings that are formed by the assumption that each verb-form evokes a frame (i.e., one frame per verb), assigning each verb-occurrence to a cluster (i.e., one cluster per instance), and grouping verbs to randomly generated clusters (i.e, random baseline). Additionally, we report an additional baseline measure using the method proposed by Kallmeyer e. al. (2018) [1].

For subtask 2.1 and 2.2, we use grouping verb arguments by the type of syntactic relations to the head verb,  e.g., subjects form one cluster, objects from another cluster, and so on (i.e., one frame per syntactic category) as well as a random baseline.






Call for Participation

SemEval-2019 Shared Task #2: Unsupervised Frame Induction



This shared task should be interesting to researchers working on:

* Frame Induction (FrameNet)
* Word Sense Induction 
* Unsupervised Semantic Role Labelling (VerbNet)
* Lexicography

========== Background ==========

Unsupervised identification of semantic frames for lexical items is a challenging task that is useful in many contexts: developing and maintaining lexical frame databases, gaining insight about lexical item behavior for the syntax--semantics interface, as well as practical applications such as information extraction. The 2019 SemEval Task #2 invites participants to tackle this challenge in any of the following sub-tasks:

* Sub-Task 1 -- Verb clustering: identify verbs that evoke the same frame. The task is similar to word sense induction but goes beyond it since, additionally, word senses may need further grouping into one category. For example, given uses of the verb buy, a system must distinguish between its different senses such as purchase (FrameNet's Commerece_buy) and "Accepting/rejecting" (FrameNet's Be_in_agreement_on_assessment). Additionally, it must be able to cluster `to buy' and `to purchase' together when they are both used in the sense of Commerece_buy.

* Sub-Task 2-1 -- Clustering arguments of verbs according to the corresponding FrameNet Frame Elements: Systems have a set of pre-determined slot fillers. These slot-fillers must be clustered such that the resulting grouping resembles FrameNet's definitions/categorizations of fillers. (For this task, participating in Task 1 is necessary).

* Sub-Task 2-2 -- Clustering arguments of verbs according to their Semantic Roles. Similar to Task 2-1, predetermined arguments of a set of target verbs must be clustered to a number of Semantic/Thematic roles categories (Agent, Theme, Patient, etc.). We adapted role definitions from VerbNet for developing the test data for this sub-task.

We have provided newly annotated data from a subset of PTB's WSJ. Since all the tasks are done in an unsupervised manner, no training data is provided. However, for the purpose of development, we provide 600 annotated records.

Further information about the data, including how to obtain it, can be found via the URL at the top of the page.

We welcome submissions for new systems as well as methods described previously elsewhere. Participants are welcome to use additional data in any form as long as they (1) do not used semantically-annotated data (e.g., role annotations, lexical resources such as FrameNet, WN, etc.) and (2) they acknowledge the use of any resource other than PTB.

We will evaluate systems against a number of clustering evaluation measures.

To contact task organizers, please send an email to
Google Groups:!forum/semeval_frameinduction
CodaLab Page:


Important Dates:

Please note that the development/trial dataset is available now for download. The expected schedule for the evaluation phase:

15 Jan 2019: Start evaluation
31 Jan 2019: End evaluation
05 Feb 2019: Results posted
28 Feb 2019: System and Task description paper submissions due by 23:59 GMT
Summer 2019: SemEval 2019

Location: TBA

The schedule for the evaluation period: 

15 Jan 2019: Start evaluation
31 Jan 2019: End evaluation
05 Feb 2019: Results posted
28 Feb 2019: System and Task description paper submissions due by 23:59 GMT
Summer 2019: SemEval 2019

Location: TBA



To obtain a valid license for the full test and training sets, participants must agree to "SemEval 2019 Task 2 Evaluation Agreement" as set by LDC to access the Penn Treebank 3.0 (PTB). In a nutshell, LDC's agreement states that 1) the data can only be used for the purpose of this shared task, and 2) that users' access to data is temporary and they must destroy the data at the completion of the shared task. Accepting this agreement is necessary since the test set is part of the PTB: a subset of 1000 sentences that are randomly chosen from the texts in the Wall Street Journal section of the PTB.


To obtain a valid license:

once your request is processed by LDC, you will receive additional instructions from LDC (and the organizers) to obtain the data including a dev-set which contains gold annotations for 200 frames. Please note that obtaining the license from the LDC may take a few days.

The full blind testing set will be available here later (around December and close to the evaluation period).

minimal example can be downloaded from here.


Data Format

Participants are provided with a CoNLL-U format of the tokenized and automatically-parsed (in the universal dependencies formalism) sentences in PTB 3.0. For instance, for the input sentence

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

which appeared as the first sentence in wsj_0001 section of PTB, the CoNLL-U file will contain the following record:


1 Pierre Pierre PROPN NNP _ 2 compound _ _
2 Vinken Vinken PROPN NNP _ 9 nsubj _ _
3 , , PUNCT , _ 2 punct _ _
4 61 61 NUM CD _ 5 nummod _ _
5 years year NOUN NNS _ 6 nmod:npmod _ _
6 old old ADJ JJ _ 2 amod _ _
7 , , PUNCT , _ 2 punct _ _
8 will will AUX MD _ 9 aux _ _
9 join join VERB VB _ 0 root _ _
10 the the DET DT _ 11 det _ _
11 board board NOUN NN _ 9 dobj _ _
12 as as ADP IN _ 15 case _ _
13 a a DET DT _ 15 det _ _
14 nonexecutive nonexecutive ADJ JJ _ 15 amod _ _
15 director director NOUN NN _ 9 nmod _ _
16 Nov. Nov. PROPN NNP _ 9 nmod:tmod _ _
17 29 29 NUM CD _ 16 nummod _ _
18 . . PUNCT . _ 9 punct _ _

in which the first line (starts with #) denotes the identifier assigned to this sentence. These identifiers together with the assigned positions to the word forms (numbers in the first column) in the sentences are used to form evaluation instances. For example, for subtask 1, the following string:

#20001001 9 join.Becoming-a-member

states that the verb at position 9 (with the lemma `join') of the sentence assigned to id #20001001 evokes a frame of type Becoming-a-member; note the format we use to convey this information, i.e.:

#sent-id position verb-lemma.ftype

In a similar way, for subtask 2.1 we have the record:

#20001001 9 join.Becoming-a-member vinken-:-2-:-New-member board-:-11-:-Group

and for the subtaks 2.2 we have

#20001001 9 join.Becoming-a-member vinken-:-2-:-Agent board-:-11-:-Patient

(note the difference between the slot/role labels). Here, we assumed that the frame has two core arguments.

As exemplified, for the subtasks 2.1 and 2.2, the records have the format:

#sent-id position verb-lemma.{ftype} (word, position, slot-type)\supscript<+>

in which (word, position, slot-type) are delimited using the char-sequence of -:-

UPDATE: If a slot/argument filler consists of more than one word, then its elements will be attached to each other using a '_' (underline char). For example, in  John works in New York,  for New York as an argument filler, we will have a representation such New_York-:-4_5-:-Location; these states that the tokens 4 and 5 in the sentences (i.e., New+York) are annotated as the filler of type Location. Please see more examples in the trial data. 

In the trial dataset, alongside the positional information that determines the shape of frames, the labels for frame types and slots/roles are provided, too. For the final testing set, labels that must be predicted by systems are replaced by the keyword UKN. For instance, the testing set for task 1 will look like:

#20001001 9 join.UKN

and for task 2.1:

#20001001 9 join.UKN vinken-:-2-:-UKN board-:-11-:-UKN

and for the subtask 2.2 we have

#20001001 9 join.NA vinken-:-2-:-UKN board-:-11-:-UKN

(Note that for subtask 2.2 the prediction of the frame type is not necessary).


Evaluation Scripts

The scorer program can be downloaded from here (also source codes from Github) and used in your local machine (requires Java 1.8).

Results from the evaluation period can be seen from the following links in the tabular format (to sort the tables based on different measures click on the header descriptions).  Final submissions from the evaluation and post evaluation period will be available from here soon.  

Submission of System Description Papers -- Important Dates

System description papers will be part of the official SemEval proceedings and must be in ACL style (see for more details and downloading style files).

To submit your paper:

  • Login to the START Conference Manager system at
  • Click "make a new submission", then in "submission categories" select **system description** as the submission type and our task number (i.e., 2) as the task.



  • February 23: System description papers due 
  • March 16: System description paper reviews due
  • March 29: Author notification
  • April 5: Camera ready submissions

Trial subtask 1: grouping verbs to frame

Start: June 30, 2018, midnight

Description: Deprecated Scorer

DEV-OLD 600 instances --Scorer for all subtasks

Start: June 30, 2018, midnight

Description: Scorer for all sub-tasks: Submit your files per instruction in one bundle and see results in the detailed report for all tasks at once.

Final Test Data 4620 Records

Start: Jan. 13, 2019, noon

Description: All Subtasks

Competition Ends


You must be logged in to participate in competitions.

Sign In