The official shared task on Quality Estimation aims to further examine automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. As in previous years, we cover estimation at various levels. Important elements introduced this year include: a new task where sentences are annotated with Direct Assessment (DA) scores instead of labels based on post-editing; a new multilingual sentence-level dataset mainly from Wikipedia articles, where the source articles can be retrieved for document-wide context; the availability of NMT models to explore system-internal information for the task.
In addition to generally advancing the state of the art at all prediction levels for modern neural MT, our specific goals are:
Offical task webpage: QE Shared Task 2020
This submission platform covers Task 3: Document-level fine-grained annotations.
In Task 3 fine-grained annotation subtask, systems will have to predict which text spans contain translation errors, as well as classify them as minor, major or critical. Two or more spans can be part of the same error annotation (for example, in agreement errors in which a noun and an adjective are not adjacent).
Submission Format
The system output format is similar to the annotations.tsv
files in the training data, but should include the document id. Each line in the output refers to a single error annotation (containing one or more spans) and should be formatted like this:
<METHOD NAME> <DOCUMENT ID> <LINES> <SPAN START POSITIONS> <SPAN LENGTHS> <SEVERITY>
Where:
METHOD NAME
is the name of your quality estimation method.DOCUMENT ID
is the containing folder, as in the MQM subtask.LINES
is a list of lines containing the error spans, starting from 0 and separated by white space.SPAN START POSITIONS
is a list of the character offsets in which the spans begin, separated by white space, and also starting from 0. The number of start positions should be the same as in lines.SPAN LENGTHS
is a list of the lengths, in number of characters, of the error spans. The number of lengths must be the same as the start positions. Spans should not overlap.SEVERITY
is either minor, major or critical.Each field should be delimited by a single tab character.
Note that while the training data includes the error category (such as missing words or word order), this field is not necessary in the system output.
To allow the automatic evaluation of your predictions, please submit them in a file named as follows: predictions.txt, and package them in a zipped file (.zip).
Submissions will be evaluated in terms of their F1 scores with respect to the gold annotations.
The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
Participants are allowed to explore any additional data and resources deemed relevant.
Each participating team can submit at most 30 systems for each of the language pairs of each subtask (max 5 a day).
Start: April 19, 2020, midnight
Never
You must be logged in to participate in competitions.
Sign In