VATEX Video Captioning Challenge

Organized by xwang - Current server time: July 9, 2020, 9:09 a.m. UTC


Private Test (Chinese)
April 11, 2020, midnight UTC


Private Test (English)
April 12, 2020, midnight UTC


Competition Ends
Jan. 1, 2099, 11:59 p.m. UTC
Project website:

For a deeper understanding of the activities, the task of video captioning/description aims at describing the video content with natural language. Despite the variants of this task, the fundamental challenge is to accurately depict the important activities in a video clip, which requires high-quality, diverse captions that describe a wide variety of videos at scale. Moreover, existing large-scale video captioning datasets are mostly monolingual (English only) and thus the development of video captioning models is restricted to English corpora. However, the study of multilingual video captioning is essential for a large population on the planet who cannot speak English.

This VATEX Captioning Challenge 2020 aims to benchmark progress towards models that can describe the videos in various languages such as English and Chinese. Compared to VATEX Captioning Challenge 2019, we release a private test set with 6,280 more unique videos for testing, and the original training and validation sets are combined into a larger training set. For more details, read our ICCV paper here.

Winners will be announced in the workshop of Language & Vision with Applications to Video Understanding (LVVU), co-located in CVPR 2020.



The aim of VATEX is to provide a high-quality video captioning dataset and benchmark progress towards models that can describe the videos in various languages such as English and Chinese. To facilitate consistent evaluation protocol, we put forth these guidelines for using VATEX:

  1. Do not use additional paired video-caption data. Improving evaluation scores by leveraging additional paired data is antithetical to this benchmark – the only paired video-caption dataset that should be used is the VATEX dataset. However, other datasets such as external text corpora, knowledge bases, and additional object detection datasets may be used during training or inference.
  2. The participants are encouraged to use both English and Chinese corpora to assist the caption generation of both or either. But to develop more advanced models that can only deal with a specific language (as usual) is also acceptable.


As with existing captioning benchmarks, we rely on automatic metrics to evaluate the quality of modelgenerated captions. The evaluation script is a modified version of MS COCO Caption Evaluation API. The script uses both candidate and reference captions, applies sentence tokenization, and output several performance metrics including BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, CIDEr.
Note that for the Chinese corpus, we run the evaluation on the segmented words rather than the raw characters. We use Jieba for Chinese word segmentation. Please read the paper for more details.


The ranking for the competition this year is based the automatic evaluation. Specifically, a rank list of teams is produced by sorting their scores on each objective evaluation metric, respectively. The final rank of a team is measured by combining its ranking positions in the four ranking list and defined as:

        R(team) = R(team)@BLEU-4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr.

where R(team) is the rank position of the team, e.g., if the team achieves the best performance in terms of BLEU@4, then R(team)@BLEU@4 is "1". The smaller the final ranking, the better the performance.
We will finally rank all the participants in two separate corpora, English and Chinese.

The annotations provided by this benchmark are licensed under a Creative Commons Attribution 4.0 International License.

If you use this dataset, please cite

author = {Wang, Xin and Wu, Jiawei and Chen, Junkun and Li, Lei and Wang, Yuan-Fang and Wang, William Yang},
title = {VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}

Submission Procedure

To enter the competition, you will need to run the caption generation locally on your machine. The submission procedure requires you to run your trained model on the test set videos, and then submit the resulting test set captions in the appropriate JSON format.
To submit your JSON file, name it "submission.json" and zip it, then submit the zip file to the corresponding track, Test (English) or Test (Chinese).

Submission Format

Submissions must be a valid JSON dictionary containing one entry for each video. The submission file must cover all the videos in the test set. Each entry is a (key=videoID, value=generated_caption) pair.

Example submission format for the English corpus:

                  "G9zN5TTuGO4_000179_000189": "a boy is playing basketball in the backyard .",

Example submission format for the Chinese corpus:

                  "G9zN5TTuGO4_000179_000189": "在 野外 , 一个 人 正在 悬吊 在 冰地 里面 进行 作业 。"

Public Test (Chinese)

Start: April 11, 2020, midnight

Public Test (English)

Start: April 11, 2020, midnight

Private Test (Chinese)

Start: April 11, 2020, midnight

Private Test (English)

Start: April 12, 2020, midnight

Competition Ends

Jan. 1, 2099, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In
# Username Score
1 IVA-CASIA 0.81
2 SRCB_ML_Lab 0.76
3 acdart 0.44