Participating systems are required to score (and rank) sentences according to post-editing effort. Three labels are available: percentage of edits need to be fixed (HTER), post-editing time in seconds, and counts of various types of keystrokes. The primary prediction label for the scoring variant will be HTER, but we welcome participants to submit alternative models trained to predict other labels. Predictions according to each alternative label will be evaluated independently. For the ranking variant, the predictions can be generated by models built using any of these labels (or their combination), as well using external information.
Submission Format
The output of your system for a a given subtask should produce scores for the translations at the segment-level formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>
Where:
METHOD NAME
is the name of your quality estimation method.SEGMENT NUMBER
is the line number of the plain text translation file you are scoring/ranking.SEGMENT SCORE
is the predicted (HTER) score for the particular segment - assign all 0's to it if you are only submitting ranking results.SEGMENT RANK
is the ranking of the particular segment - assign all 0's to it if you are only submitting absolute scores.Each field should be delimited by a single tab character.
Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).
To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt
The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:
Offical task webpage: QE Shared Task 2018
As in previous years, two variants of the results can be submitted:
Evaluation is performed against the true label and/or ranking using as metrics:
The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Start: June 15, 2018, midnight
Never
You must be logged in to participate in competitions.
Sign In