CodaLab - Competition

Sentence-level Direct Assessment QE Shared Task 2020

Organized by fblain - Current server time: March 30, 2025, 8:50 a.m. UTC

Sinhala-English

April 19, 2020, midnight UTC

Current

Russian-English

April 19, 2020, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions

QE Shared Task 2020

The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. As in previous years, we cover estimation at various levels. Important elements introduced this year include: a new task where sentences are annotated with Direct Assessment (DA) scores instead of labels based on post-editing; a new multilingual sentence-level dataset mainly from Wikipedia articles, where the source articles can be retrieved for document-wide context; the availability of NMT models to explore system-internal information for the task.

In addition to generally advancing the state of the art at all prediction levels for modern neural MT, our specific goals are:

to create a new set of public benchmarks for tasks in quality estimation,
to investigate models for predicting DA scores and their relationship with models trained for predicting post-editing effort,
to study the feasibility of mulilingual (or even language independent) approaches to QE, and
to study the influence of source-language document-level context for the task of QE, and
to analyse the aplicabiity of NMT model information for QE.

Offical task webpage: QE Shared Task 2020

This submission platform covers Task 1: Sentence-level *Direct Assessment*.

In Task 1, participating systems are required to score sentences according to Direct Assessment scores. Submissions will be evaluated according to how well they score translations. We thus expect an absolute quality score for each sentence translation (z-standardised DA).

Submission Format

The output of your system for a given subtask should produce scores for the translations at the segment-level formatted in the following way:

<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>

Where:

LANGUAGE PAIR is the ID (e.g., en-de) of the language pair of the plain text translation file you are scoring.
METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring.
SEGMENT SCORE is the predicted score for the particular segment.

Each field should be delimited by a single tab character.

To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt, and package them in a zipped file (.zip).

Submissions will be evaluated in terms of the Pearson's correlation metric between the DA prediction and the human DA. These are the official evaluation scripts. The mail evaluation will focus on multilingual systems, i.e. systems that are able to provide predictions for all languages in the Wikipedia domain. Therefore, average Pearson correlation across all these languages will be used to rank QE systems. We will also evaluate QE systems on a per-language basis for those interested in particular languages.

The provided QE labelled data is publicly available under Creative Commons Attribution Share Alike 4.0 International (https://github.com/facebookresearch/mlqe/blob/master/LICENSE).

Participants are allowed to explore any additional data and resources deemed relevant.

Each participating team can submit at most 30 systems for each of the language pairs and type of system (max 5 a day), except for the multilingual track of Task 1 (max 5 in total).