CodaLab - Competition

Sentence-level QE shared task 2018

Organized by fblain - Current server time: April 5, 2025, 2:22 p.m. UTC

English-Latvian (NMT)

June 15, 2018, midnight UTC

Current

English-Czech

June 15, 2018, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions

Sentence-level QE task 2018

Participating systems are required to score (and rank) sentences according to post-editing effort. Three labels are available: percentage of edits need to be fixed (HTER), post-editing time in seconds, and counts of various types of keystrokes. The primary prediction label for the scoring variant will be HTER, but we welcome participants to submit alternative models trained to predict other labels. Predictions according to each alternative label will be evaluated independently. For the ranking variant, the predictions can be generated by models built using any of these labels (or their combination), as well using external information.

Submission Format

The output of your system for a a given subtask should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring/ranking.
SEGMENT SCORE is the predicted (HTER) score for the particular segment - assign all 0's to it if you are only submitting ranking results.
SEGMENT RANK is the ranking of the particular segment - assign all 0's to it if you are only submitting absolute scores.

Each field should be delimited by a single tab character.

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).

To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt

Quality Estimation Shared Task

The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:

To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language pairs where source segments were translated by both statistical phrase-based and neural MT systems.
To study the predictability of missing words in the MT output. To do so, for the first time we provide data annotated for such errors at training time.
To study the predictability of source words that lead to errors in the MT output. To do so, for the first time we provide source segments annotated for such errors at the word level.
To study the effectiveness of manually assigned labels for phrases. For that we provide a dataset where each phrase was annotated by human translators.
To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, as well as post-editor ID.
To study quality prediction for documents from errors annotated at word-level with added severity judgements. This will be done using a new corpus manually annotated with a fine-grained error taxonomy, from which document-level scores are derived.

Offical task webpage: QE Shared Task 2018

As in previous years, two variants of the results can be submitted:

Scoring: An absolute quality score for each sentence translation according to the type of prediction, to be interpreted as an error metric: lower scores mean better translations.
Ranking: A ranking of sentence translations for all source sentences from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions, likert predictions, post-editing time, etc.). The reference ranking will be defined based on the true HTER scores.

Evaluation is performed against the true label and/or ranking using as metrics:

Scoring: Pearson's correlation (primary), Mean Average Error (MAE) and Root Mean Squared Error (RMSE).
Ranking: Spearman's rank correlation (primary) and DeltaAvg.

The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.