This is a completely new task. It is based on data from the Amazon Product Reviews dataset. More specifically, a selection of Sports and Outdoors product titles and descriptions in English which has been machine translated into French using a state of the art online neural MT system. The most popular products (those with more reviews) were chosen. This data poses interesting challenges for machine translation: titles and descriptions are often short and not always a complete sentence. The data was annotated for errors at the word level using a fine-grained error taxonomy (MQM).
MQM is composed of three major branches: accuracy (the translation does not accurately reflect the source text), fluency (the translation affects the reading of the text) and style (the translation has stylistic problems, like the use of a wrong register). These branches include more specific issues lower in the hierarchy. Besides the identification of an error and its classification according to this typology (by applying a specific tag), the errors will receive a severity scale that will show the impact of each error on the overall meaning, style, and fluency of the translation. An error can be minor (if it doesn’t lead to a loss of meaning and it doesn’t confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).
The word error annotations and their severity levels can be extrapolated to phrases, sentences and documents. For this task, we concentrate on the latter, where a document contains the product title and description for a given product. The document-level scores were generated from the word-level errors and their severity using the method in this paper (footnote 6). The dataset is the largest ever released collection with word-level errors manually annotated.
Submission Format and Requirements
The output of your system should produce scores for the translations at the document-level formatted in the following way:
<METHOD NAME> <DOCUMENT NUMBER> <DOCUMENT SCORE>
METHOD NAMEis the name of your quality estimation method.
DOCUMENT NUMBERis the line number of the plain text translation file you are scoring.
DOCUMENT SCOREis the predicted score for the particular document.
The predictions should be sorted by ascending
DOCUMENT NUMBER, and each field should be delimited by a single tab character.
Example of the document-level format:
The example shows that documents named "doc0000", "doc0001", "doc0002", have got predicted quality scores of 00.000, 11.111 and 22.222, respectively.
To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt
The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:
Offical task webpage: QE Shared Task 2018
Evaluation is performed against the true label and/or ranking using as metrics:
The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Start: June 15, 2018, midnight
You must be logged in to participate in competitions.Sign In