Scene video text spotting (SVTS) is a text spotting system of localizing and recognizing text from continuous video frames, which usually contains multiple modules: video text detection, text tracking and the final recognition. It is an important research topic, because many real-world applications work in video mode such as license plate recognition in intelligent transportation system, road sign recognition in advanced driver assistance system or even online handwritten character recognition.
With the rapid development of deep learning techniques, great progress has been made in scene text spotting in static images. However, spotting text from video frames has much higher robustness requirements than the static OCR tasks for many applications. That is, SVTS will become much challenging when facing various environmental interferences (e.g., camera shaking, motion blur and immediate illumination changing etc.) as well as the real-time response requirement. Therefore, it is necessary to research efficient and robust SVTS systems to meet amount of practical application requirements.
SVTS 2021 competition will provide 129 video clips from 21 real-life scenarios. By establishing such a large-scale dataset with diversified scenarios, we hope more inspiring works can be proposed to solve this issue which has very promising applicational prospect.
The competition have three tasks:
Notice that we only evaluate the performance of 'alphanumeric text instances' in all the 3 tasks. In future, we intend to release the more challenging multilingual competition.
This task is to obtain the locations of words in each video frame in terms of their affine bounding boxes. The ground truth of each bounding box is comprised of 4 coordinate points.
We evaluate results based on a single Intersection-over-Union criterion with a threshold of 50%, which is similar to the standard practice in object recognition and Pascal VOC challenge . Recall, Precison and F_score will be uesd as the evaluation metrics.
The participants are required to submit a JSON file containing predictions of all test videos，the JSON file in this task should name "detection_predict.json",and upload the answer after compress and name it as "answer.zip", and the JSON file should follow the format as:
All the results are saved in a dict, and the field "bboxes" correspond to all the detected bounding boxes in each frame. Each bounding box is represented by 4 points in clock-wise order.
Task 2 - Video Text Tracking
This task intends to track all the text streams in videos.
We evaluate results based on an adaptation of the CLEAR-MOT  and VACE  evaluation framework. We here adapt these metrics to the specificities of text tracking by following the protocals in , i.e., MOTA, MOTP, ATA will be used as the evaluation metrics.
The evaluation details is same as Task 1 mentioned above.
In order to perform the evaluation, the participants are required to submit aJSON file containing predictions of all test videos. the JSON file in this task should name "track_predict.json",and upload the answer after compress and name it as "answer.zip".The JSON format is as follows:
The tracking results of each video is saved in a single dict, which contains all tracked text sequences in this video. For instance, the first sequence in "Video_10_3" is represented by a unique ID "1304" (ID of different sequence must be unique in the entire test set), and the field "tracks" includes the tracked bounding box in the text sequence. Each bounding box follows this format:
where "frame_id" is the frame number of the bounding box, and "x1_y1_x2_y2_x3_y3_x4_y4" is the coordinates of the 4 points in the clock-wise manner.
This task aims to evaluate the end-to-end performance of video text spotting methods. It requires that words are correctly localised in every frame, correctly tracked over the video frames, and also correctly recognized at the sequence level.
The evaluation also adopts CLEAR-MOT  and VACE  evaluation framework (similar to 'Task 2'). In this case, a predicted word is considered a true positive if its intersection over union with a ground-truth word is larger than 0.5, and the word recognition is correct. Specifically, MOTA, MOTP, ATA is used as the evaluation metrics by considers recognition performance.
We also propose the sequence-level evaluation protocal by adapting the metrics used in . The adaptation is that, a predicted text sequence is regarded as a true postive only if it satisfies 2 constraints:
1）spatial-temporal localisation constraint: assuming a predicted text sequence matches its ground truth sequence, we first compute the spatial intersection over union (IoU) with the corresponding ground-truth of bounding box in each frame, and the predicted bounding box is a valid match if its IoU is larger than 0.5. On this basis, if the matched number of this sequence exceeds the half of union of predicted and ground truth frames, the predicted sequence can be treated as a candidate true positive.
2) recognition constraint: for text sequences satisfying constrain 1), their recognition results should also match the text transcription of the corresponding ground truth sequences.
Specifically, Recall, Precision, F_score are used for the sequence-level evaluation (notice that they are different from the detection metrics). We propose the sequence-level metrics to explicitely evaluate the performance of directly extracting text sequence information from videos. Since in many real-world applications, only the sequence-level information is needed, and the correct recognition of words in every frame is somewhat redundant and unnecessary.
This task follows the similar evaluation details as Task 1 with some extra conditions:
In order to perform the evaluation, the participants are required to submit a JSON file containing all the predictions for all test videos.the JSON file in this task should name "e2e_predict.json",and upload the answer after compress and name it as "answer.zip". The JSON file should follow the format as:
The submission format ( similar to'Task 2') is organized as:
"frame_id, x1_y1_x2_y2_x3_y3_x4_y4, recognition"
We add the "recognition" result of each bounding box to the end of each line and add the "text" field to indicate the sequence-level recognition results of each sequence.
. M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98-136.
. Cheng, Z.; Lu, J.; Niu, Y.; Pu, S.; Wu, F.; and Zhou, S. 2019. You Only Recognize Once: Towards Fast Video Text Spoting. In ACM Multimedia 2019.
. K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, May 2008.
. R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 319–336, 2009.
. Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In ICDAR. 1484–1493.
The dataset (download link is availabel on 'Participate/Downloads' page once registered) contains 129 video clips (ranging from several seconds to over 1 minutes long) from 21 real-life scenes. It was extended on the basis of LSVTD dataset  by addding 15 videos for 'harbor surveillance' scenario and 14 videos for 'train watch' scenario, for the purpose of solving video text spotting problem in industrial transportation applications.
The dataset has 5 major characteristics:
1) large scale: it contains 129 video clips, larger than most existing scene video text datasets
2) diversified scenes: it covers 21 indoor and outdoor real-life scenarios, which are:
|Indoor scenes||Outdoor scenes|
|scene type||Scene ID||Scene type||Scene ID|
|reading books||4||outdoor shopping mall||1|
|indoor shopping mall||6||fingerposts||3|
|inside shops||7||street view||15|
|metro station||9||city road||18|
"Scene ID" denotes the scene type in the dataset, and examples of different scenarios are given below.