Video-guided Machine Translation Challenge 2020

Organized by xwang - Current server time: Nov. 30, 2020, 6:15 p.m. UTC


Test (English-to-Chinese)
April 12, 2020, midnight UTC


Competition Ends
Jan. 1, 2099, 11:59 p.m. UTC

VMT Challenge website:

This Video-guided Machine Translation (VMT) Challenge aims to benchmark progress towards models that utilize video information to assist align different languages and thus help the machine translation task. Winners will be announced and awarded in the first workshop on Advances in Language and Vision Research (ALVR), co-located in ACL 2020.

We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSRVTT dataset, VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. Read our ICCV oral paper for more details here.



The aim of the VMT challenge is to benchmark progress towards models that can better utilize video information to assist align two languages for the task of machine translation. To facilitate consistent evaluation protocol, we put forth these guidelines for the the VMT challenge:

  1. Do NOT use any external corpora or pretrained MT models. The participants are not allowed to build upon any existing pretrained machine translation models for this challenge. The VMT model must be trained on our VATEX dataset from scratch.


As with existing MT benchmarks, we rely on the automatic evaluation metric (corpus-level Bleu-4) to evaluate the translated results. In addition, individual Bleu scores for n-grams (1-4) are also reported. Submission are ranked in terms of the corpus-level Bleu-4 score. 


The annotations provided by this benchmark are licensed under a Creative Commons Attribution 4.0 International License.

If you use this dataset, please cite

@InProceedings{Wang_2019_ICCV, author = {Wang, Xin and Wu, Jiawei and Chen, Junkun and Li, Lei and Wang, Yuan-Fang and Wang, William Yang}, title = {VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research}, booktitle = {The IEEE International Conference on Computer Vision (ICCV)}, month = {October}, year = {2019} }

Submission Procedure

To enter the competition, you will need to run the English-to-Chinese translation task locally on your machine. The submission procedure requires you to run your trained VMT model on the test set, and then submit the resulting test set translations in the appropriate JSON format.
To submit your JSON file, name it "submission.json" and zip it, then submit the zip file.

Submission Format

Submissions must be a valid JSON dictionary containing one entry for each video. The submission file must cover all the videos in the test set. Each entry is a (key=videoID&sentenceID, value=Chinese_translations) pair, where sentenceID = 0,1,2,3,4, since each video has 5 paired EN-ZH translations.

Example submission format for the Chinese corpus:

                  "G9zN5TTuGO4_000179_000189&0": "在 野 外 , 一 个 人 正 在 悬 吊 在 冰 地 里 面 进 行 作 业 。"
                  "G9zN5TTuGO4_000179_000189&5": "一 个 男 人 正 在 露 天 的 冰 地 上 施 工 。"


Test (English-to-Chinese)

Start: April 12, 2020, midnight

Competition Ends

Jan. 1, 2099, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In
# Username Score
1 tosho 0.366
2 wzy977 0.365
3 zsyzsx1823 0.363