VATEX Captioning Challenge 2019 - Multilingual Video Captioning

Organized by xwang - Current server time: March 30, 2025, 11:36 a.m. UTC

First phase

Dev (Chinese)
Aug. 6, 2019, midnight UTC

End

Competition Ends
Oct. 1, 2019, midnight UTC
vatex

Project website: vatex.org
For a deeper understanding of the activities, the task of video captioning/description aims at describing the video content with natural language. Despite the variants of this task, the fundamental challenge is to accurately depict the important activities in a video clip, which requires high-quality, diverse captions that describe a wide variety of videos at scale. Moreover, existing large-scale video captioning datasets are mostly monolingual (English only) and thus the development of video captioning models is restricted to English corpora. However, the study of multilingual video captioning is essential for a large population on the planet who cannot speak English.

We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSRVTT dataset, VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. Read our ICCV paper for more details here.

This VATEX Captioning Challenge aims to benchmark progress towards models that can describe the videos in various languages such as English and Chinese. We have a few awards for the winners, which will be announced in the third workshop of Closing the Loop Between Vision and Language (CLVL), co-located in ICCV 2019.

Guidelines

The aim of VATEX is to provide a high-quality video captioning dataset and benchmark progress towards models that can describe the videos in various languages such as English and Chinese. To facilitate consistent evaluation protocol, we put forth these guidelines for using VATEX:

  1. Do not use additional paired video-caption data. Improving evaluation scores by leveraging additional paired data is antithetical to this benchmark – the only paired video-caption dataset that should be used is the VATEX dataset. However, other datasets such as external text corpora, knowledge bases, and additional object detection datasets may be used during training or inference.
  2. The participants are encouraged to use both English and Chinese corpora to assist the caption generation of both or either. But to develop more advanced models that can only deal with a specific language (as usual) is also acceptable.

Metrics

As with existing captioning benchmarks, we rely on automatic metrics to evaluate the quality of modelgenerated captions. The evaluation script is a modified version of MS COCO Caption Evaluation API. The script uses both candidate and reference captions, applies sentence tokenization, and output several performance metrics including BLEU-4, ROUGE-L, METEOR, CIDEr.
Note that for the Chinese corpus, we run the evaluation on the segmented words rather than the raw characters. We use Jieba for Chinese word segmentation. Please read the paper for more details.

Ranking

The ranking for the competition this year is based the automatic evaluation. Specifically, a rank list of teams is produced by sorting their scores on each objective evaluation metric, respectively. The final rank of a team is measured by combining its ranking positions in the four ranking list and defined as:


        R(team) = R(team)@BLEU-4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr.
    

where R(team) is the rank position of the team, e.g., if the team achieves the best performance in terms of BLEU@4, then R(team)@BLEU@4 is "1". The smaller the final ranking, the better the performance.
We will finally rank all the participants in two separate corpora, English and Chinese.

The annotations provided by this benchmark are licensed under a Creative Commons Attribution 4.0 International License.

If you use this dataset, please cite


   @article{wang2019vatex,
   title={VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research},
   author={Wang, Xin and Wu, Jiawei and Chen, Junkun and Li, Lei and Wang, Yuan-Fang and Wang, William Yang},
   journal={arXiv preprint arXiv:1904.03493},
   year={2019}
   } 
    

Submission Procedure

To enter the competition, you will need to run the caption generation locally on your machine. The submission procedure requires you to run your trained model on the test set videos, and then submit the resulting test set captions in the appropriate JSON format.
To submit your JSON file, name it "submission.json" and zip it, then submit the zip file to the corresponding track, Test (English) or Test (Chinese).

Submission Format

Submissions must be a valid JSON dictionary containing one entry for each video. The submission file must cover all the videos in the test set. Each entry is a (key=videoID, value=generated_caption) pair.

Example submission format for the English corpus:


            {
                  "G9zN5TTuGO4_000179_000189": "a boy is playing basketball in the backyard .",
                  ... 
            } 
        

Example submission format for the Chinese corpus:


            {
                  "G9zN5TTuGO4_000179_000189": "在 野外 , 一个 人 正在 悬吊 在 冰地 里面 进行 作业 。"
                  ... 
            }
        

Dev (Chinese)

Start: Aug. 6, 2019, midnight

Description: This phase evaluates algorithms on the VATEX val set. We recommend using this Dev phase for algorithm validation. This phase should not be used for reporting results in the paper. A submission needs to consist of results on the entire validation set to be considered as a valid submission.

Dev (English)

Start: Aug. 6, 2019, midnight

Description: This phase evaluates algorithms on the VATEX val set. We recommend using this Dev phase for algorithm validation. This phase should not be used for reporting results in the paper. A submission needs to consist of results on the entire validation set to be considered as a valid submission.

Test (Chinese)

Start: Aug. 6, 2019, midnight

Description: This phase evaluates algorithms on VATEX public test set. We recommend using this phase for reporting comparison numbers in academic papers. A submission needs to consist of results on the entire test set to be considered as a valid submission. This phase is aimed at the final evaluation of the model and one is not allowed to create multiple submissions using multiple teams.

Test (English)

Start: Aug. 6, 2019, midnight

Description: This phase evaluates algorithms on VATEX public test set. We recommend using this phase for reporting comparison numbers in academic papers. A submission needs to consist of results on the entire test set to be considered as a valid submission. This phase is aimed at the final evaluation of the model and one is not allowed to create multiple submissions using multiple teams.

Competition Ends

Oct. 1, 2019, midnight

You must be logged in to participate in competitions.

Sign In