This task uses a subset of the German-English SMT data from Task 1 where each phrase (as produced by the decoder) has been annotated (as a phrase) by humans with four labels: 'OK', 'BAD' -- the phrase contain one or more errors, 'BAD_word_order' -- the phrase is in an incorrect position in the sentence, and 'BAD_omission' -- a word is missing before/after a phrase. We divided this task in two subtasks: word-level prediction (Task3a), and phrase-level prediction (Task3b):
The baseline will be the same system as in Task 2. Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:
The baseline will use a set of baseline features (based on black-box sentence-level features) extracted with the Marmot tool and is trained with the CRFSuite tool. Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:
Submission Format
Task3a: word-level predictions
This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:
<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY TAG>
Where:
METHOD NAME
is the name of your quality estimation method.TYPE
is the type of label predicted: mt, gap or source.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring (starting at 0).WORD INDEX
is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.WORD
actual word. For the 'gap' submission, use a dummy symbol: 'gap'.BINARY TAG
is either 'OK' for no issue or 'BAD' for any issue.Each field should be delimited by a single tab character.
Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).
To allow the automatic evaluation of your predictions, please submit them in a file named as follows:
and package them in a single zipped file (.zip).
If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.
Task3b: phrase-level predictions
This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:
<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <TAG>
Where:
METHOD NAME
is the name of your quality estimation method.TYPE
is the type of label predicted: mt, gap or source.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring (starting at 0).WORD INDEX
is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.WORD
actual word. For the 'gap' submission, use a dummy symbol: 'gap'.TAG
is either 'OK' for no issue, or either 'BAD' or 'BAD_word_order' for an issue (note that 'BAD_word_order' is only to predict for phrases in MT).Each field should be delimited by a single tab character.
Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).
To allow the automatic evaluation of your predictions, please submit them in a file named as follows:
and package them in a single zipped file (.zip).
If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.
The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:
Offical task webpage: QE Shared Task 2018
Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:
We will also provide an overall F1 score that combines the three labels for systems submitting them all.
The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Never
You must be logged in to participate in competitions.
Sign In