The 1st Video Panoptic Segmentation Challenge

Organized by dhkim - Current server time: Nov. 30, 2020, 4:46 p.m. UTC


First phase
Aug. 23, 2020, midnight UTC


Competition Ends

The 1st Video Panoptic Segmentation Challenge: Cityscapes-VPS test split eval.


Panoptic segmentation has become a new standard of visual recognition task by unifying previous semantic segmentation and instance segmentation tasks in concert. In CVPR 2020, we presented and explored a new video extension of this task, nameed Video Panoptic Segmentation. The task requires generating consistent panoptic segmentation as well as an association of instance ids across video frames. To invigorate research on this new task, we presented a temporal extension on the Cityscapes val. set, by providing new video panoptic annotations: Cityscapes-VPS. Moreover, we proposed a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames. To provide appropriate metrics for this task, we proposed a video panoptic quality (VPQ) metric and evaluate our method and several other baselines. We achieve state-of-the-art results in image PQ on Cityscapes and also in VPQ on the datasets. The datasets and code are available at at our website.


Video panoptic segmentation extends the image panoptic segmentation from the image domain to the video domain. The new problem aims at a simultaneous prediction of object classes, bounding boxes, masks, instance id associations, and semantic segmentation, while assigning unique answers to each pixel in a video. A detailed explanation of the new task can be found in our paper.


We collect a new video panoptic segmentation dataset, named Cityscapes-VPS, which extends the popular Cityscapes to a video level by providing every five video frames with pixel-level panoptic labels that are temporally associated with respect to the public image-level annotations. It not only supports video panoptic segmentation (VPS) task, but also provides super-set annotations for video semantic segmentation (VSS) and video instance segmentation (VIS) tasks. Specifically, our new dataset has the following features.

  • A total 19-class label set including 8 thing and 11 stuff classes.
  • 1024 x 2048 resolution dense panoptic annotations.
  • Every 5 frames from 500 Cityscapes video clips, i.e., 5, 10, 15, 20, 25, and 30-th frames, where the 20-th frame already has the original Cityscapes panoptic annotations.
  • We further split the Cityscapes-VPS dataset into 400 training videos, 50 validation videos and 50 test videos.


Please submit a of the following structure.
├── pred.json
│ ├── 0010_0055_frankfurt_000000_003905.png
│ ├── 0010_0056_frankfurt_000000_003910.png
│ ├── ...
│ ├── 0500_3000_munster_000173_000029.png

Evaluation Criteria

We extend the image panoptic quality (PQ) metric to video panoptic quality (VPQ) metric. We introduce multiple window size k ={0, 5, 10, 15} which can represent the segment predictions over consecutive video frames as "tube" predictions. When k=0, the VPQ metric is equivalent to the image PQ metric. Please refer to our paper for more details.

Terms and Conditions

Our Cityscapes-VPS dataset extends upon the Cityscapes dataset, and therefore the permission is granted to use the data given that you agree to both our license terms Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) and that of the original Cityscapes-dataset.

Specifically, you should cite our work:

    title={Video Panoptic Segmentation},
    author={Dahun Kim and Sanghyun Woo and Joon-Young Lee and In So Kweon},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},

First phase

Start: Aug. 23, 2020, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 dhkim 57.38
2 ViP-DeepLab 62.47