The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. As in previous years, we cover estimation at various levels. Important elements introduced this year include: a new task where sentences are annotated with Direct Assessment (DA) scores instead of labels based on post-editing; a new multilingual sentence-level dataset mainly from Wikipedia articles, where the source articles can be retrieved for document-wide context; the availability of NMT models to explore system-internal information for the task.
In addition to generally advancing the state of the art at all prediction levels for modern neural MT, our specific goals are:
Offical task webpage: QE Shared Task 2020
This submission platform covers Task 2: Word-level *Post-Editing Effort*.
In Task 2, participating systems are required to detect errors both on source side (to detect which words caused errors) and target side (to detect mistranslated or missing words):
We request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:
<LANGUAGE ID> <METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>
LANGUAGE PAIRis the ID (e.g., en-de) of the language pair.
METHOD NAMEis the name of your quality estimation method.
TYPEis the type of label predicted: mt, gap or source.
SEGMENT NUMBERis the line number of the plain text translation file you are scoring (starting at 0).
WORD INDEXis the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.
WORDactual word. For the 'gap' submission, use a dummy symbol: 'gap'.
BINARY SCOREis either 'OK' for no issue or 'BAD' for any issue.
Each field should be delimited by a single tab character.
To allow the automatic evaluation of your predictions, please submit them in a file named as follows:
and package them in a single zipped file (.zip).
If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.
Submissions will be evaluated in terms of MCC (Matthews correlation coefficient).
The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
Participants are allowed to explore any additional data and resources deemed relevant.
The provided QE labelled data is publicly available under Creative Commons Attribution Share Alike 4.0 International (https://github.com/facebookresearch/mlqe/blob/master/LICENSE).
Each participating team can submit at most 30 systems for each of the language pairs of each subtask (max 5 a day).
Start: April 19, 2020, midnight
Start: April 19, 2020, midnight
You must be logged in to participate in competitions.Sign In