CodaLab - Competition

Word-level QE shared task 2018

Organized by fblain - Current server time: April 2, 2025, 3:10 p.m. UTC

English-Latvian (NMT)

June 15, 2018, midnight UTC

Current

English-Czech

June 15, 2018, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions

Word-level QE task 2018

As in previous years, we frame the problem as the binary task of distinguishing between 'OK' and 'BAD' tokens. Participating systems are required to detect errors for each token in MT output. In addition, in contrast to previous years, for the first time we attempt to predict missing words in the translation. We require participants label any sequence of one or more missing token with a single 'BAD' label and also indicate 'BAD' tokens in the source sentence that are related to the tokens missing in the translated sentence. This is particularly important to spot adequacy errors in NMT.

Submission Format

This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
TYPE is the type of label predicted: mt, gap or source.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
WORD INDEX is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.
WORD actual word. For the 'gap' submission, use a dummy symbol: 'gap'.
BINARY SCORE is either 'OK' for no issue or 'BAD' for any issue.

Each field should be delimited by a single tab character.

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).

To allow the automatic evaluation of your predictions, please submit them in a file named as follows:

Words in the MT: predictions_mt.txt
Source words: predictions_src.txt
Gaps in the MT: predictions_gaps.txt

and package them in a single zipped file (.zip).

If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.

Quality Estimation Shared Task

The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:

To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language pairs where source segments were translated by both statistical phrase-based and neural MT systems.
To study the predictability of missing words in the MT output. To do so, for the first time we provide data annotated for such errors at training time.
To study the predictability of source words that lead to errors in the MT output. To do so, for the first time we provide source segments annotated for such errors at the word level.
To study the effectiveness of manually assigned labels for phrases. For that we provide a dataset where each phrase was annotated by human translators.
To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, as well as post-editor ID.
To study quality prediction for documents from errors annotated at word-level with added severity judgements. This will be done using a new corpus manually annotated with a fine-grained error taxonomy, from which document-level scores are derived.

Offical task webpage: QE Shared Task 2018

Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:

words in the MT, as in WMT17 ('OK' for correct words, 'BAD' for incorrect words)
gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing words)
source words ('BAD' for words that lead to errors in the MT, 'OK' for other words)

We will also provide an overall F1 score that combines the three labels for systems submitting them all.

The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.