SemEval 2019 Task 1: Cross-lingual Semantic Parsing with UCCA

Organized by borgr - Current server time: April 21, 2025, 7:31 a.m. UTC

Evaluation

Jan. 10, 2019, midnight UTC

Current

Post-Evaluation

Feb. 1, 2019, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions
Submissions and Results

SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA

The task evaluation period was between January 1, 2019 and February 1, 2019. Results were presented at SemEval 2019, held on June 6-7 in Minneapolis, USA (collocated with NAACL-HLT).

The task summary is published as the following paper:

@inproceedings{hershcovich-etal-2019-semeval,
    title = "{S}em{E}val-2019 Task 1: Cross-lingual Semantic Parsing with {UCCA}",
    author = "Hershcovich, Daniel  and
      Aizenbud, Zohar  and
      Choshen, Leshem  and
      Sulem, Elior  and
      Rappoport, Ari  and
      Abend, Omri",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/S19-2001",
    doi = "10.18653/v1/S19-2001",
    pages = "1--10"
}

Training, development and test data used in the shared task.

Results can be found under Submissions and Results.

Background

Semantic representation is receiving growing attention in NLP in the past few years, and many proposals for semantic schemes have recently been put forth. Examples include Abstract Meaning Representation, Broad-coverage Semantic Dependencies, Universal Decompositional Semantics, Parallel Meaning Bank, and Universal Conceptual Cognitive Annotation. These advances in semantic representation, along with corresponding advances in semantic parsing, hold promise benefit essentially all text understanding tasks, and have already demonstrated applicability to summarization, paraphrase detection, and semantic evaluation (using UCCA; see below).

In addition to their potential applicative value, work on semantic parsing poses interesting algorithmic and modelling challenges, which are often different from those tackled in syntactic parsing, including reentrancy (e.g., for sharing arguments across predicates), and the modelling of the interface with lexical semantics. Semantic parsing into such schemes has been much advanced by recent SemEval workshops, including two tasks on Broad-coverage Semantic Dependency Parsing and two tasks on AMR parsing. We expect that a SemEval task on UCCA parsing to have a similar effect. Moreover, given the conceptual similarity between the different semantic representations, it is likely that work on UCCA parsing will directly contribute to the development of other semantic parsing technology. Furthermore, conversion scripts are available between UCCA and the SDP, CoNLL-U and AMR formats. Teams that participated in past shared tasks on SDP, UD and AMR, are encouraged to participate using similar systems and a conversion-based protocol.

UCCA is a cross-linguistically applicable semantic representation scheme, building on the established Basic Linguistic Theory typological framework. It has demonstrated applicability to multiple languages, including English, French and German (with pilot annotation projects on Czech, Russian and Hebrew), and stability under translation. It has proven useful for defining semantic evaluation measures for text-to-text generation tasks, including machine translation, text simplification and grammatical error correction.

UCCA supports rapid annotation by non-experts, assisted by an accessible annotation interface. The interface is powered by an open-source, flexible web-application for syntactic and semantic phrase-based annotation in general, and for UCCA annotation in particular.1

Task Definition

The task consists in parsing text according to the UCCA semantic annotation. The task starts from pre-tokenized text.

UCCA represents the semantics of linguistic utterances as directed acyclic graphs (DAGs), where terminal (childless) nodes correspond to the text tokens, and non-terminal nodes to semantic units that participate in some super-ordinate relation. Edges are labelled, indicating the role of a child in the relation the parent represents. Nodes and edges belong to one of several layers, each corresponding to a “module” of semantic distinctions.

UCCA’s foundational layer covers the predicate-argument structure evoked by predicates of all grammatical categories (verbal, nominal, adjectival and others), the inter-relations between them, and other major linguistic phenomena such as semantic heads and multi-word expressions. It is the only layer for which annotated corpora exist at the moment, and is thus the target of this shared task. The layer’s basic notion is the Scene, describing a state, action, movement or some other relation that evolves in time. Each Scene contains one main relation (marked as either a Process or a State), as well as one or more Participants. For example, the sentence “After graduation, John moved to Paris” (see figure) contains two Scenes, whose main relations are “graduation” and “moved”. “John” is a Participant in both Scenes, while “Paris” only in the latter. Further categories account for inter-Scene relations and the internal structure of complex arguments and relations (e.g. coordination, multi-word expressions and modification).

UCCA distinguishes primary edges, corresponding to explicit relations, from remote edges (appear dashed in the figure) that allow for a unit to participate in several super-ordinate relations. Primary edges form a tree in each layer, whereas remote edges enable reentrancy, forming a DAG.

UCCA graphs may contain implicit units with no correspondent in the text. The figure shows the annotation for the sentence “A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice.”. The sentence was used by to compare different semantic dependency schemes. It includes a single Scene, whose main relation is “apply”, a secondary relation “almost impossible”, as well as two complex arguments: “a similar technique” and the coordinated argument “such as cotton, soybeans, and rice.” In addition, the Scene includes an implicit argument, which represents the agent of the “apply” relation.

While parsing technology is well-established for syntactic parsing, UCCA has several distinct properties that distinguish it from syntactic representations, mostly UCCA’s tendency to abstract away from syntactic detail that does not affect argument structure. For instance, consider the following examples where the concept of a Scene has a different rationale from the syntactic concept of a clause. First, non-verbal predicates in UCCA are represented like verbal ones, such as when they appear in copula clauses or noun phrases. Indeed, in the figure, “graduation” and “moved” are considered separate Scenes, despite appearing in the same clause. Second, in the same example, “John” is marked as a (remote) Participant in the graduation Scene, despite not being explicitly mentioned. Third, consider the possessive construction in “John’s trip home”. While in UCCA “trip” evokes a Scene in which “John” is a Participant, a syntactic scheme would analyze this phrase similarly to “John’s shoes”.

The differences in the challenges posed by syntactic parsing and UCCA parsing, and more generally semantic parsing, motivate the development of targeted parsing technology to tackle it.

The UCCA annotation guidelines can be found on Google Drive.

The list of the UCCA categories relevant to the task can be found on Google Drive.

TUPA

At the time of the shared task, two works had been published on UCCA parsing (Hershcovich et al., 2017; Hershcovich et al., 2018), presenting TUPA, a transition-based DAG parser based on a BiLSTM-based classifier.

Several baselines have been proposed, using different classifiers (sparse perceptron or feedforward neural network), and using conversion-based approaches that use existing parsers for other formalisms to parse UCCA by constructing a two-way conversion protocol between the formalisms.

TUPA showed superior performance over all such approaches, and thus served as a strong baseline for system submissions to the shared task.

The code and documentation for TUPA can be found on GitHub.

More information including the resources can be found in UCCA general resource page.

https://github.com/omriabnd/UCCA-App ↩

Evaluation Criteria

Submission conditions

Participant systems in the task were evaluated in four settings:

English in-domain setting, using the Wiki corpus.
English out-of-domain setting, using the Wiki corpus as training and development data, and 20K Leagues as test data.
German in-domain setting, using the 20K Leagues corpus.
French setting with no training data (except trial data), using the 20K Leagues corpus as development and test data.

In order to allow both even ground comparison between systems and using hitherto untried resources, we held both an open and a closed track for submissions in the English and German settings. Closed track submissions are only allowed to use the gold-standard UCCA annotation distributed for the task in the target language, and are limited in their use of additional resources. Concretely, the additional data they are allowed to use is only that used by TUPA, which consists of automatic named entity annotations provided by spaCy¹, and automatic POS tags and syntactic dependency relations provided by UDPipe.² In addition, the closed track allows the use of word embeddings provided by fastText³ for all languages.

Systems in the open track, on the other hand, are allowed to use any additional resource, such as UCCA annotation in other languages, dictionaries or datasets for other tasks, provided that they make sure not to use any additional gold standard annotation over the same text used in the UCCA corpora.⁴ In both tracks, we require that submitted systems will not be trained on the development data. Development data can be used for tuning. Due to the absence of an established pilot study for French, we only hold an open track for this setting. Training for French is allowed on the trial data (15 sentences).

The four settings and two tracks result in a total of 7 competitions, where a team may participate in anywhere between 1 and 7 of them. We encourage submissions in each track to use their systems to produce results in all settings. In addition, we encourage closed-track submissions to also submit to the open track.

Formats

For ease of submission in addition to the UCCA xml files sdp, conllu, conll, export and amr formats are allowed too, such submissions will be automatically converted to UCCA using this script.

To convert manually:

pip install semstr 
python -m semstr.convert [filenames] -f [format] -o [out_dir]

Note that while the NeGra export format preserves all the information in the UCCA graphs, conversion to the sdp, conllu, conll and amr formats is lossy, due to the bilexical dependency structure (and due to reentrancies in AMR not being separated to primary and remote). Below are the labeled scores of converting the English Wiki corpus to these formats and back to the standard format:

	Primary			Remote
	LP	LR	LF	LP	LR	LF
sdp	95.7	92.8	94.2	95	47.9	63.7
conllu	90.4	89	89.7	99.9	47.7	64.6
conll	93.2	92	92.6	95.6	48.2	64.1
amr	97.4	97.4	97.4	88.8	88.7	88.8

Scoring

In order to evaluate how similar an output UCCA structure is to a gold UCCA graph, we use DAG F₁-score . Formally, over two UCCA annotations G₁ and G₂ that share their set of leaves (tokens) W and for a node v in G₁ or G₂ , define its yield (yield(v) subset or equal W) as its set of leaf descendants. Define a pair of edges ((v₁,u₁) in G₁) and ((v₂,u₂) in G₂) to be matching if (yield(u₁) = yield(u₂)) and they have the same label. Labeled Precision and Recall are defined by dividing the number of matching edges in G₁ and G₂ by |E₁| and |E₂| respectively. DAG F₁-score is their harmonic mean. We will report Precision, Recall and F1 scores both for primary and remote edges. For the sake of this task's evaluation implicit units are disregarded and do not count for the evaluation. Also, the measures are indifferent to the position of the Function category.

The Center (C) category is disregarded by the evaluation in the two following cases:

1. If the unique child v of a node u is annotated as C, then v is disregarded. So in this case, if v is a leaf, u will be considered as a leaf instead of v and if v is not a leaf, the child nodes of v will be considered as the child nodes of u.

2. If v is a unique center in a unit u (i.e. the other children of u are not annotated as centers), and w is a unique center in v, then v is disregarded. That is, the child nodes of v (including w) will be considered as the child nodes of u.

Normalization will be automatically run before the evaluation using this script.

For each of the seven competitions, we will report winning systems according to the Primary F1-score and according to the Remote F1-score.

For a more fine-grained evaluation, Precision, Recall and F1 scores of specific category (edge labels) will also be reported. UCCA labels can be divided into categories that correspond to Scene elements (States, Processes, Participants, Adverbials), non-Scene elements (Elaborators, Connectors, Centers), and inter-Scene Linkage (Parallel Scenes, Linkage, Ground). We will report performance for each of these sets separately, leaving out Function and Relator units that do not belong to any particular model.

To evaluate manually:

pip install semstr 
python -m semstr.evaluate [predicated_file_or_dir] [reference_file_or_dir]

http://spacy.io. We use spaCy 2.0.12 with the en_core_web_lg, fr_core_news_md and de_core_news_sm models. ↩
http://ufal.mff.cuni.cz/udpipe. We use UDPipe 1.2 with the CoNLL 2018 baseline models traind on English-EWT, French-GSD and German-GSD. ↩
http://fasttext.cc ↩
We are not aware of any such annotation, but include this restriction for completeness.↩

Terms and Conditions

Competitors are not allowed to use the test set or the dev set for training, use external data in competitions where it is stated they should not and violate any other rule of the competition.

Groups should not submit more than one system unless the systems differ in a meaningful way from one another, if unsure, contact the organizers.

All data released for this task is done so under the Creative Commons License (licenses could also be found with the data).

Organizers of the competition might choose to publicize, analyze and change in any way any content sent as a part of this task. Whenever appropriate academic citation for the sending group would be added (e.g. in a paper summarizing the task).

Competitions should comply with any general rules of SEMEVAL.

The organizers are free to penalized or disqualify for any violation of the above rules or for misuse, unethical behaviour or other behaviours they agree are not accepted in a scientific competition in general and in the specific one at hand.

Submissions

Participants in the task submitted their results in the following format:

One zip file including two main folders named "closed" and "open" - one for the closed tracks and the other for the open tracks.
Each directory should contain subfolder for each setting they are participating at and should be named accordingly.

The options for settings are "UCCA_English-Wiki", "UCCA_English-20K", "UCCA_German-20K", "UCCA_French-20K".

For example, the directory for UCCA_English-20K open track should be named "open/UCCA_English-20K".
Each track directory should contain file for each sentence with the predicted annotation, when the supported formats are UCCA xml files, conllu, sdp and amr formats.

Submit your system outputs here.

Results

The winners of the evaluation phase are:

UCCA_English-Wiki_closed track: the winner is hlt@suda team with 0.774 labeled averaged F1 score.

UCCA_English-Wiki_open track: the winner is hlt@suda team with 0.805 labeled averaged F1 score.

UCCA_English-20K_closed track: the winner is hlt@suda team with 0.727 labeled averaged F1 score.

UCCA_English-20K_open track: the winner is hlt@suda team with 0.767 labeled averaged F1 score.

UCCA_German-20K_closed track: the winner is hlt@suda team with 0.832 labeled averaged F1 score.

UCCA_German-20K_open track: the winner is hlt@suda with 0.849 labeled averaged F1 score.

UCCA_French-20K_open track: the winner is hlt@suda with 0.752 labeled averaged F1 score.

The full results of the evaluation phase can be found here.

The results of the post evaluation phase can be found here.

Baseline Models

Baseline models are available here.

To run these models, first install tupa:

pip install tupa==1.3.8

Then run:

python -m tupa <DATA> -m <MODEL> -o <OUTDIR>

For example,

python -m tupa dev/closed/UCCA_English-Wiki -m ucca-bilstm-20180917 -o out/closed/UCCA_English-Wiki

These models are the baseline models for the following competition tracks:

ucca-bilstm-20180917	closed/UCCA_English-Wiki
ucca-bilstm-20180917	closed/UCCA_English-20K
ucca-de-bilstm-20180917	closed/UCCA_German-20K
ucca-amr-dm-ud-bilstm-20180917	open/UCCA_English-Wiki
ucca-amr-dm-ud-bilstm-20180917	open/UCCA_English-20K
ucca-ud-de-bilstm-20180917	open/UCCA_German-20K
ucca-ud-fr-bilstm-20180917	open/UCCA_French-20K

Practice

Start: Aug. 20, 2018, midnight

Description: Try training on trial data and evaluating on development data.

Evaluation

Start: Jan. 10, 2019, midnight

Description: Train on official training data, tune on development data and upload parsed test data for evaluation.

Post-Evaluation

Start: Feb. 1, 2019, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Competition