Sentence-level Critical Error Detection shared task 2021

Organized by fblain - Current server time: March 30, 2025, 4:06 p.m. UTC

Previous

English-Czech
July 1, 2021, midnight UTC

Current

English-Japanese
July 1, 2021, midnight UTC

End

Competition Ends
Never

QE Shared Task 2021

This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels. The main new elements introduced this year are: (i) a zero-shot sentence-level prediction task to encourage language independent and unsupervised approaches; (ii) a task on predicting catastrophic, i.e. critical translation errors, in other words, errors that make the translation convey a completely different meaning, which could lead to negative effects such as safety risks. In addition, we release new test sets for 2020's Tasks 1 and 2, and an extended version of the Wikipedia post-editing training data from 2 to 7 languages. Finally, for all tasks, participants will be asked to provide info on their model size (disk space without compression and number of parameters) with their submission and will be able to rank systems based on that.

In addition to generally advancing the state of the art in quality estimation, our specific goals are:

  • to extend the MLQE-PE public benchmark datasets,
  • to investigate new language independent approaches esp. for zero-shot prediction,
  • to study the feasibility of unsupervised approaches esp. for zero-shot prediction, and
  • to create a new task focusing on critical error detection.

Official task webpage: QE Shared Task 2021

This submission platform covers Task 3: Sentence-level *Critical Error Detection*.

In Task 3, participating systems are required to classify sentences according to Critical Error Detection binary scores. Submissions will be evaluated according to how well they classify translations. We thus expect a binary score/label for each sentence translation (0 or 1).

Submission Format

The output of your system for a given subtask should produce scores for the translations at the segment-level formatted in the following way:

Line 1:

<DISK FOOTRPINT (in bytes, )without compression>

Line 2:

<NUMBER OF PARAMETERS>

Lines 3-n where -n is the number of test samples:

<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>

Where:

  • LANGUAGE PAIR is the ID (e.g., en-de) of the language pair of the plain text translation file you are scoring.
  • METHOD NAME is the name of your quality estimation method.
  • SEGMENT ID is the sentence pair identifier from the test file (1st column).
  • SEGMENT SCORE is the predicted score for the particular segment.

Each field should be delimited by a single tab character.

To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt, and package them in a zipped file (.zip).

Submissions will be evaluated in terms of the 'Matthews correlation coefficient metric between the predicted labels and the human labels. These are the official evaluation scripts.

The provided QE labelled data is publicly available under Creative Commons Attribution Share Alike 4.0 International (https://github.com/facebookresearch/mlqe/blob/master/LICENSE). Participants are allowed to explore any additional data and resources deemed relevant. Each participating team can submit at most 30 systems for each of the language pairs and type of system (max 5 a day), except for the multilingual track of Task 1 (max 5 in total).

English-German

Start: July 1, 2021, midnight

English-Chinese

Start: July 1, 2021, midnight

English-Czech

Start: July 1, 2021, midnight

English-Japanese

Start: July 1, 2021, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In