Document-level QE shared task 2018

Organized by fblain - Current server time: May 23, 2019, 6:55 p.m. UTC


June 15, 2018, midnight UTC


Competition Ends

Document-level task 2018

This is a completely new task. It is based on data from the Amazon Product Reviews dataset. More specifically, a selection of Sports and Outdoors product titles and descriptions in English which has been machine translated into French using a state of the art online neural MT system. The most popular products (those with more reviews) were chosen. This data poses interesting challenges for machine translation: titles and descriptions are often short and not always a complete sentence. The data was annotated for errors at the word level using a fine-grained error taxonomy (MQM).

MQM is composed of three major branches: accuracy (the translation does not accurately reflect the source text), fluency (the translation affects the reading of the text) and style (the translation has stylistic problems, like the use of a wrong register). These branches include more specific issues lower in the hierarchy. Besides the identification of an error and its classification according to this typology (by applying a specific tag), the errors will receive a severity scale that will show the impact of each error on the overall meaning, style, and fluency of the translation. An error can be minor (if it doesn’t lead to a loss of meaning and it doesn’t confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).

The word error annotations and their severity levels can be extrapolated to phrases, sentences and documents. For this task, we concentrate on the latter, where a document contains the product title and description for a given product. The document-level scores were generated from the word-level errors and their severity using the method in this paper (footnote 6). The dataset is the largest ever released collection with word-level errors manually annotated.

Submission Format and Requirements

The output of your system should produce scores for the translations at the document-level formatted in the following way:



  • METHOD NAME is the name of your quality estimation method.
  • DOCUMENT NUMBER is the line number of the plain text translation file you are scoring.
  • DOCUMENT SCORE is the predicted score for the particular document.

The predictions should be sorted by ascending DOCUMENT NUMBER, and each field should be delimited by a single tab character.

Example of the document-level format:


The example shows that documents named "doc0000", "doc0001", "doc0002", have got predicted quality scores of 00.000, 11.111 and 22.222, respectively.

To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt


Quality Estimation Shared Task

The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:

  • To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language pairs where source segments were translated by both statistical phrase-based and neural MT systems.
  • To study the predictability of missing words in the MT output. To do so, for the first time we provide data annotated for such errors at training time.
  • To study the predictability of source words that lead to errors in the MT output. To do so, for the first time we provide source segments annotated for such errors at the word level.
  • To study the effectiveness of manually assigned labels for phrases. For that we provide a dataset where each phrase was annotated by human translators.
  • To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, as well as post-editor ID.
  • To study quality prediction for documents from errors annotated at word-level with added severity judgements. This will be done using a new corpus manually annotated with a fine-grained error taxonomy, from which document-level scores are derived.


Offical task webpage: QE Shared Task 2018

Evaluation is performed against the true label and/or ranking using as metrics:

  • Scoring: Pearson's correlation (primary), Mean Average Error (MAE) and Root Mean Squared Error (RMSE).

The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.


Start: June 15, 2018, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In