As in previous years, we frame the problem as the binary task of distinguishing between 'OK' and 'BAD' tokens. Participating systems are required to detect errors for each token in MT output. In addition, in contrast to previous years, for the first time we attempt to predict missing words in the translation. We require participants label any sequence of one or more missing token with a single 'BAD' label and also indicate 'BAD' tokens in the source sentence that are related to the tokens missing in the translated sentence. This is particularly important to spot adequacy errors in NMT.
Submission Format
This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:
<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>
Where:
METHOD NAME
is the name of your quality estimation method.TYPE
is the type of label predicted: mt, gap or source.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring (starting at 0).WORD INDEX
is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.WORD
actual word. For the 'gap' submission, use a dummy symbol: 'gap'.BINARY SCORE
is either 'OK' for no issue or 'BAD' for any issue.Each field should be delimited by a single tab character.
Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).
To allow the automatic evaluation of your predictions, please submit them in a file named as follows:
and package them in a single zipped file (.zip).
If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.
The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:
Offical task webpage: QE Shared Task 2018
Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:
We will also provide an overall F1 score that combines the three labels for systems submitting them all.
The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Never
You must be logged in to participate in competitions.
Sign In