ICDAR 2021 Competition on Scene Video Text Spotting

Organized by Embers - Current server time: Jan. 25, 2021, 11:07 a.m. UTC


End-to-end Video Text Spotting
March 1, 2021, midnight UTC


Competition Ends
April 11, 2021, 11 p.m. UTC


Scene video text spotting (SVTS) is a text spotting system of localizing and recognizing text from continuous video frames, which usually contains multiple modules: video text detection, text tracking and the final recognition. It is an important research topic, because many real-world applications work in video mode such as license plate recognition in intelligent transportation system, road sign recognition in advanced driver assistance system or even online handwritten character recognition.

With the rapid development of deep learning techniques, great progress has been made in scene text spotting in static images. However, spotting text from video frames has much higher robustness requirements than the static OCR tasks for many applications. That is, SVTS will become much challenging when facing various environmental interferences (e.g., camera shaking, motion blur and immediate illumination changing etc.) as well as the real-time response requirement. Therefore, it is necessary to research efficient and robust SVTS systems to meet amount of practical application requirements.

SVTS 2021 competition will provide 129 video clips from 21 real-life scenarios. By establishing such a large-scale dataset with diversified scenarios, we hope more inspiring works can be proposed to solve this issue which has very promising applicational prospect.


The competition have three tasks:

  • Video Text Detection, where the objective is to localize text regions in the video frames
  • Video Text Tracking, where the objective is to track all the text streams in the video frames
  • End-to-end Video Text Spotting, where the objective is to localize, track and recognize all the text in the video frames

Notice that we only evaluate the performance of 'alphanumeric text instances' in all the 3 tasks. In future, we intend to release the more challenging multilingual competition.


TASK 1 - Video Text Detection 

This task is to obtain the locations of words in each video frame in terms of their affine bounding boxes.  The ground truth of each bounding box is comprised of 4 coordinate points.

We evaluate results based on a single Intersection-over-Union criterion with a threshold of 50%, which is similar to the standard practice in object recognition and Pascal VOC challenge [1]. Recall, Precison and F_score will be uesd as the evaluation metrics.

 There are several evaluation details that the participants need to pay attention:
  • If the quaiity of a text area is annotated as 'LOW' or the transcription is "###", it means text in this bounding box is unreadable. Thus during the evaluation, such areas will not be taken into account (a method will not be penalised if it does not detect these words, while a method that detects them will not get any better score).
  • Words with less than 3 characters are not taken into account for evaluation.

The participants are required to submit a JSON file containing predictions of all test videos,the JSON file in this task should name "detection_predict.json",and upload the answer after compress and name it as "answer.zip", and the JSON file should follow the format as:


All the results are saved in a dict, and the field "bboxes" correspond to all the detected bounding boxes in each frame. Each bounding box is represented by 4 points in clock-wise order.


 Task 2 - Video Text Tracking 

This task intends to track all the text streams in videos. 

We evaluate results based on an adaptation of the CLEAR-MOT [3] and VACE [4] evaluation framework. We here adapt these metrics to the specificities of text tracking by following the protocals in [5], i.e., MOTA, MOTP, ATA will be used as the evaluation metrics. 

The evaluation details is same as Task 1 mentioned above.

In order to perform the evaluation, the participants are required to submit aJSON file containing predictions of all test videos. the JSON file in this task should name "track_predict.json",and upload the answer after compress and name it as "answer.zip".The JSON format is as follows:

The tracking results of each video is saved in a single dict, which contains all tracked text sequences in this video. For instance, the first sequence in "Video_10_3" is represented by a unique ID "1304" (ID of different sequence must be unique in the entire test set), and the field "tracks" includes the tracked bounding box in the text sequence. Each bounding box follows this format:

     "frame_id, x1_y1_x2_y2_x3_y3_x4_y4"

where "frame_id" is the frame number of the bounding box, and "x1_y1_x2_y2_x3_y3_x4_y4" is the coordinates of the 4 points in the clock-wise manner.


Task 3 - End-to-end Video Text Spotting

This task aims to evaluate the end-to-end performance of  video text spotting methods. It requires that words are correctly localised in every frame, correctly tracked over the video frames, and also correctly recognized at the sequence level.

The evaluation also adopts CLEAR-MOT [3] and VACE [4] evaluation framework (similar to 'Task 2'). In this case, a predicted word is considered a true positive if its intersection over union with a ground-truth word is larger than 0.5, and the word recognition is correct. Specifically,  MOTA, MOTP, ATA is used as the evaluation metrics by considers recognition performance.

We also propose the sequence-level evaluation protocal by adapting the metrics used in [2]. The adaptation is that, a predicted text sequence is regarded as a true postive only if it satisfies 2 constraints:

1)spatial-temporal localisation constraint: assuming a predicted text sequence matches its ground truth sequence, we first compute the spatial intersection over union (IoU) with the corresponding ground-truth of bounding box in each frame, and the predicted bounding box is a valid match if its IoU is larger than 0.5. On this basis, if the matched number of this sequence exceeds the half of union of predicted and ground truth frames, the predicted sequence can be treated as a candidate true positive.

2) recognition constraint: for text sequences satisfying constrain 1), their recognition results should also match the text transcription of the corresponding ground truth sequences.

Specifically, Recall, Precision, F_score are used for the sequence-level evaluation (notice that they are different from the detection metrics). We propose the sequence-level metrics to explicitely evaluate the performance of directly extracting text sequence information from videos. Since in many real-world applications, only the sequence-level information is needed, and the correct recognition of words in every frame is somewhat redundant and unnecessary.

This task follows the similar evaluation details as Task 1 with some extra conditions:

  • Word recognition evaluation is case-insensitive. 
  • All the symbols contained in recognition results should be removed before submitted.
  • Word recognition transcriptions of the same word object may be different in different frames due to entering to the frame, exiting from the frame and occlusions. Despite these obstructions, the sequence level recognition results should be COMPLETE words.

In order to perform the evaluation, the participants are required to submit a JSON file containing all the predictions for all test videos.the JSON file in this task should name "e2e_predict.json",and upload the answer after compress and name it as "answer.zip". The JSON file should follow the format as:

The submission format ( similar to'Task 2') is organized as:

     "frame_id, x1_y1_x2_y2_x3_y3_x4_y4, recognition"

We add the "recognition" result of each bounding box to the end of each line and add the "text" field to indicate the sequence-level recognition results of each sequence.



[1]. M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision111(1), 98-136.

[2]. Cheng, Z.; Lu, J.; Niu, Y.; Pu, S.; Wu, F.; and Zhou, S. 2019. You Only Recognize Once: Towards Fast Video Text Spoting. In ACM Multimedia 2019.

[3]. K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, May 2008.

[4]. R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 319–336, 2009.

[5]. Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In ICDAR. 1484–1493.


The dataset (download link is availabel on 'Participate/Downloads' page once registered) contains 129 video clips (ranging from several seconds to over 1 minutes long) from 21 real-life scenes. It was extended on the basis of LSVTD dataset [2] by addding 15 videos for 'harbor surveillance' scenario and 14 videos for 'train watch' scenario,  for the purpose of solving video text spotting problem in industrial transportation applications.

The dataset has 5 major characteristics: 

1) large scale: it contains 129 video clips, larger than most existing scene video text datasets

2) diversified scenes: it covers 21 indoor and outdoor real-life scenarios, which are:

Indoor scenes  Outdoor scenes
scene type Scene ID Scene type Scene ID
 reading books 4  outdoor shopping mall
 digital screen 5  pedestrian
 indoor shopping mall 6  fingerposts
 inside shops 7  street view 15 
supermarket 8 train watch 17
metro station 9 city road 18
restaurant 10 harbor surveillance 19
office building 11 highway 20
hotel 12
bus/railway station 13
bookstore 14
inside train 16
shopping bags 21

   "Scene ID" denotes the scene type in the dataset, and examples of different scenarios are given below.