Word-level Post-Editing Effort QE shared task 2020

Organized by erickrf - Current server time: Aug. 11, 2020, 10:45 p.m. UTC


April 19, 2020, midnight UTC


April 19, 2020, midnight UTC


Competition Ends

QE Shared Task 2020

The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. As in previous years, we cover estimation at various levels. Important elements introduced this year include: a new task where sentences are annotated with Direct Assessment (DA) scores instead of labels based on post-editing; a new multilingual sentence-level dataset mainly from Wikipedia articles, where the source articles can be retrieved for document-wide context; the availability of NMT models to explore system-internal information for the task.

In addition to generally advancing the state of the art at all prediction levels for modern neural MT, our specific goals are:

  • to create a new set of public benchmarks for tasks in quality estimation,
  • to investigate models for predicting DA scores and their relationship with models trained for predicting post-editing effort,
  • to study the feasibility of mulilingual (or even language independent) approaches to QE, and
  • to study the influence of source-language document-level context for the task of QE, and
  • to analyse the aplicabiity of NMT model information for QE.

Offical task webpage: QE Shared Task 2020

This submission platform covers Task 2: Word-level *Post-Editing Effort*.

In Task 2, participating systems are required to detect errors both on source side (to detect which words caused errors) and target side (to detect mistranslated or missing words):

  • Target. Each token is tagged as either OK or BAD. Additionally, each gap between two words is tagged as BAD if one or more missing words should have been there, and OK otherwise. Note that number of tags for each target sentence is 2*N+1, where N is the number of tokens in the sentence.
  • Source. Tokens are tagged as OK if they were correctly translated, and BAD otherwise. Gaps are not tagged.

Submission Format

The output for the word-level subtask can be up to two separate files: one with MT labels (for words and gaps) and another one with source words. You can submit for either of these subtasks or both of them, independently. The output format should be the same as in the .tags and .source_tags files in the training data; i.e., the .tags file should be formatted as:

GAP_1 WORD_1 GAP_2 WORD_2 ... GAP_n WORD_n GAP_n+1

and the .source_tags file should be:

WORD_1 WORD_2 ... WORD_n

Where each WORD_i and GAP_i is either OK or BAD. Tags must be delimited by whitespace or a tab character.

For MT labels, each sentence will therefore correspond to 2n+1 tags (where n is the number of words in the sentence), alterating gaps and words, in order. For source labels, each sentence will correspond to n tags.

For example, consider the following MT document and its post edited version. The wrong words are highlighted:

anschließend wird in jeder Methode die übergeordnete Superclass-Version von selbst aufgerufen .

anschließend wird in jeder Methode die Superclass-Version dieser Methode aufgerufen .

For this translation, output tags should be as follows. To make them easier to distinguish, tags referring to gaps are highlighted in yellow, while those referring to words are in blue. In this example, all BAD words can be either removed or replaced by other words in the post edited text; since no insertions are necessary, all gaps are tagged as OK.


To allow the automatic evaluation of your predictions, please submit them in a file named as follows:

  • Words in the MT: predictions_mt.txt
  • Source words: predictions_src.txt

and package them in a single zipped file (.zip).

If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.

Submissions will be evaluated in terms of MCC (Matthews correlation coefficient).

The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.

Participants are allowed to explore any additional data and resources deemed relevant.

The provided QE labelled data is publicly available under Creative Commons Attribution Share Alike 4.0 International (https://github.com/facebookresearch/mlqe/blob/master/LICENSE).

Each participating team can submit at most 30 systems for each of the language pairs of each subtask (max 5 a day).


Start: April 19, 2020, midnight


Start: April 19, 2020, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In