Phrase-level QE shared task 2018

Organized by fblain - Current server time: July 22, 2018, 4:45 p.m. UTC

Previous

Phrase-level predictions
June 15, 2018, midnight UTC

Current

Word-level predictions
June 15, 2018, midnight UTC

End

Competition Ends
Never

Phrase-level QE task 2018

This task uses a subset of the German-English SMT data from Task 1 where each phrase (as produced by the decoder) has been annotated (as a phrase) by humans with four labels: 'OK', 'BAD' -- the phrase contain one or more errors, 'BAD_word_order' -- the phrase is in an incorrect position in the sentence, and 'BAD_omission' -- a word is missing before/after a phrase. We divided this task in two subtasks: word-level prediction (Task3a), and phrase-level prediction (Task3b):

  • Task3a -- As training and development data, we provide the tokenised translation output with word-level segmentation for both source and machine-translated sentences, such that this task can be addressed as a word-level prediction task. To annotate ommision errors, a gap token is inserted after each token and at the start of the sentence. The token-level labels are computed as follows: all tokens in the target sentence are labelled according to the label of the phrase they belong to. Therefore, if the phrase is annotated as either 'OK', 'BAD' or 'BAD_word_order', all tokens (and gap tokens!) within that phrase are labelled as either 'OK', 'BAD' or 'BAD_word_order'. The label for tokens between phrases is either 'OK' or 'BAD_omission', where 'BAD_omission' indicates that there should be one or more tokens in that position. The number of tags for each machine translated sentence is 2*N+1, where N is the number of tokens in that sentence.

    The baseline will be the same system as in Task 2. Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:

    • words in the MT, as in WMT17 ('OK' for correct words, 'BAD' for incorrect words);
    • gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing words);
    • source words ('BAD' for words that lead to errors in the MT, 'OK' for other words).
     
  • Task3b -- As training and development data, we provide the tokenised translation output with phrase-level segmentation (separator: '||'). A gap token is inserted after each phrase and at the start of the sentence. The gap is labelled as follows: 'OK' or 'BAD_omission', there the latter indicates that one or more words are missing. The labels are at phrase-level, therefore the number of tags for each machine translated sentence is 2*N+1, where N is the number of phrases in that sentence.

    The baseline will use a set of baseline features (based on black-box sentence-level features) extracted with the Marmot tool and is trained with the CRFSuite tool. Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:

    • phrases in the MT ('OK' for correct phrases, 'BAD' for incorrect phrases, 'BAD_word_order' for phrases in an incorrect position in the sentence);
    • gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing phrases);
    • source phrases ('BAD' for phrases that lead to errors in the MT, 'OK' for other phrases).

 

Submission Format

Task3a: word-level predictions

This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY TAG>

Where:

  • METHOD NAME is the name of your quality estimation method.
  • TYPE is the type of label predicted: mt, gap or source.
  • SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
  • WORD INDEX is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.
  • WORD actual word. For the 'gap' submission, use a dummy symbol: 'gap'.
  • BINARY TAG is either 'OK' for no issue or 'BAD' for any issue.

Each field should be delimited by a single tab character.

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).

To allow the automatic evaluation of your predictions, please submit them in a file named as follows:

  • Phrases in the MT: predictions_mt.txt
  • Source phrases: predictions_src.txt
  • Gaps in the MT: predictions_gaps.txt

and package them in a single zipped file (.zip).

If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.


Task3b: phrase-level predictions

This year we are also interested in evaluating missing words and source words that lead to errors, we request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

<METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <TAG>

Where:

  • METHOD NAME is the name of your quality estimation method.
  • TYPE is the type of label predicted: mt, gap or source.
  • SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
  • WORD INDEX is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence or the source sentence, or the gap index for MT gaps.
  • WORD actual word. For the 'gap' submission, use a dummy symbol: 'gap'.
  • TAG is either 'OK' for no issue, or either 'BAD' or 'BAD_word_order' for an issue (note that 'BAD_word_order' is only to predict for phrases in MT).

Each field should be delimited by a single tab character.

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs).

To allow the automatic evaluation of your predictions, please submit them in a file named as follows:

  • Phrases in the MT: predictions_mt.txt
  • Source phrases: predictions_src.txt
  • Gaps in the MT: predictions_gaps.txt

 and package them in a single zipped file (.zip).

If you don't have predictions for either one of the sub-tasks, only include what you have in your submission. If one of the files is missing, the scoring program will simply assign the score of 0 to the missing predictions.

 

Quality Estimation Shared Task

The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, with all tasks produced from post-editions or annotations by professional translators. The datasets are domain-specific (IT, life sciences, sports and outdoor activities). They are either extensions from those used previous years with more instances and more languages, or new data collected specifically for this year's edition. One important addition is that in 2018 we also include datasets with neural MT outputs. In addition to advancing the state of the art at all prediction levels, our specific goals are:

  • To study the performance of quality estimation approaches on the output of neural MT systems. We will do so by providing datasets for two language pairs where source segments were translated by both statistical phrase-based and neural MT systems.
  • To study the predictability of missing words in the MT output. To do so, for the first time we provide data annotated for such errors at training time.
  • To study the predictability of source words that lead to errors in the MT output. To do so, for the first time we provide source segments annotated for such errors at the word level.
  • To study the effectiveness of manually assigned labels for phrases. For that we provide a dataset where each phrase was annotated by human translators.
  • To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, as well as post-editor ID.
  • To study quality prediction for documents from errors annotated at word-level with added severity judgements. This will be done using a new corpus manually annotated with a fine-grained error taxonomy, from which document-level scores are derived.

 


Offical task webpage: QE Shared Task 2018

Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for three different types of labels, independently:

    • For word-level predictions, Task3a:
      • words in the MT, as in WMT17 ('OK' for correct words, 'BAD' for incorrect words)
      • gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing words)
      • source words ('BAD' for words that lead to errors in the MT, 'OK' for other words)

 

  • For phrase-level predictions, Task3b:
    • phrases in the MT ('OK' for correct phrases, 'BAD' for incorrect phrases, 'BAD_word_order' for phrases in an incorrect position in the sentence);
    • gaps in the MT ('OK' for genuine gaps, 'BAD' for gaps indicating missing phrases);
    • source phrases ('BAD' for phrases that lead to errors in the MT, 'OK' for other phrases).

 

We will also provide an overall F1 score that combines the three labels for systems submitting them all.

The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Participants are allowed to explore any additional data and resources deemed relevant.

Phrase-level predictions

Start: June 15, 2018, midnight

Word-level predictions

Start: June 15, 2018, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In