CodaLab - Competition

Sentence-level Post-Editing Effort QE shared task 2021

Organized by fblain - Current server time: April 2, 2025, 5:07 p.m. UTC

Pashto-English

June 25, 2021, midnight UTC

Current

Khmer-English

June 25, 2021, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions

QE Shared Task 2021

This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels. The main new elements introduced this year are: (i) a zero-shot sentence-level prediction task to encourage language independent and unsupervised approaches; (ii) a task on predicting catastrophic, i.e. critical translation errors, in other words, errors that make the translation convey a completely different meaning, which could lead to negative effects such as safety risks. In addition, we release new test sets for 2020's Tasks 1 and 2, and an extended version of the Wikipedia post-editing training data from 2 to 7 languages. Finally, for all tasks, participants will be asked to provide info on their model size (disk space without compression and number of parameters) with their submission and will be able to rank systems based on that.

In addition to generally advancing the state of the art in quality estimation, our specific goals are:

to extend the MLQE-PE public benchmark datasets,
to investigate new language independent approaches esp. for zero-shot prediction,
to study the feasibility of unsupervised approaches esp. for zero-shot prediction, and
to create a new task focusing on critical error detection.

Official task webpage: QE Shared Task 2021

This submission platform covers Task 2: Sentence-level *Post-Editing Effort*.

In Task2, participating systems are required to score sentences according to HTER scores. Submissions will be evaluated according to how well they score translations. We thus expect an absolute quality score for each sentence translation.

Submission Format

The output of your system for a given subtask should produce scores for the translations at the segment-level formatted in the following way:

Line 1:

<DISK FOOTRPINT (in bytes, without compression)>

Line 2:

<NUMBER OF PARAMETERS>

Lines 3-n where -n is the number of test samples:

<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>

Where:

LANGUAGE PAIR is the ID (e.g., en-de) of the language pair of the plain text translation file you are scoring.
METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
SEGMENT SCORE is the predicted score for the particular segment.

Each field should be delimited by a single tab character.

To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt, and package them in a zipped file (.zip).

Submissions will be evaluated in terms of the Pearson's correlation metric for the sentence-level HTER prediction. These are the official evaluation scripts.

The provided QE labelled data is publicly available under Creative Commons Attribution Share Alike 4.0 International (https://github.com/facebookresearch/mlqe/blob/master/LICENSE). Participants are allowed to explore any additional data and resources deemed relevant. Each participating team can submit at most 30 systems for each of the language pairs and type of system (max 5 a day), except for the multilingual track of Task 1 (max 10 in total).