With the ever-increasing demand for the creation of a digital world, many Optical Character Recognition (OCR) algorithms have been developed over the years. A script can be defined as the graphic form of the writing system used to write a statement. The availability of large numbers of scripts makes the development of a universal OCR a challenging task. This is because the features needed for character recognition are usually a function of structural script properties and of the number of possible classes or characters. The extremely high number of available scripts makes the task quite daunting and sometimes deterring, and as a result, most OCR systems are script-dependent. The approach for handling documents in a multi-script environment is divided into two steps: first, the script of the document, block, line or word is estimated, and secondly, the appropriate OCR is used. This approach requires a script identifier and a bank of OCRs, at a rate of one OCR per possible script. Many script identification algorithms have been proposed in the literature. Script identification can be conducted either offline, from scanned documents, or online, if the writing sequence is available. Identification can also be classified either as printed or handwritten, with the latter being the more challenging. Script identification can be performed at different levels: page or document, paragraph, block, line, word, and character. As it is similar to any classical classification problem, the script identification problem is a function of the number of possible classes or scripts to be detected. Furthermore, any similarity in the structure of scripts represents an added challenge.
The benchmarking works on script identification in the literature uses different datasets with different script combinations. Therefore, it is difficult to carry out a fair comparison of these different approaches. Moreover, the databases employed in related studies usually include two to four scripts. A few actually include an even higher number of scripts. Hence to alleviate this drawback, in this competition we aim to offer a database for script identification, which consists of a wide variety of some the most commonly used scripts, collected from real-life printed and handwritten documents. The printed documents in the database were obtained from local newspapers and magazines, and therefore, comprise different fonts and sizes and cursive and bold text. The handwritten part was obtained from volunteers from different parts of the world, who scanned and shared their manuscripts.
Each Task is evaluated separately: printed text, handwritten text, and mixed.
The evaluation metric considered will be the Average Classification Accuracy (%).
• Task 1: Script identification in handwritten samples.
• Task 2: Script identification in printed samples.
• Task 3: Mixed script identification: Train and tested with handwritten and printed samples. Note that each sample only includes one type of text (printed or handwritten).
The platform used in the competition is CodaLab. Participants need to register to take part in the competition. Please, follow the instructions:
1) Sing up in CodaLab using your team name. Important, submission are limited by temas. Therefore, only one member of the team should submmit the results.
2) Join in CodaLab to the SIW 2021: 1ST COMPETITION ON SCRIPT IDENTIFICATION IN THE WILD. Just click in the “Participate” tab for the registration.
Anonymous participants: participants are allowed to decide at the end of the competition whether they would like to include their names and affiliations in the competition report or not. Nevertheless, the organizers might include the results and some information about the systems but always completely anonymized.
About the Dataset: In this competition we introduce a database, which contains both printed and handwritten documents obtained from a wide variety of scripts, such as Arabic, Bengali, Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu and Thai. The dataset consists of 1137 documents scanned from local newspapers, as well as handwritten letters and notes.
Training Data: the organizers will provide the participants with the MDIW-13 database after sending us the corresponding license agreements (see details Get Data section). In addition, the participants can freely use other databases to train their systems. The training data include a total of 30.861 samples divided into 21.974 printed and 8.887 handwritten samples.
Final Evaluation (Test Data): the organizers will ask the participants to submit their final evaluation results using CodaLab. The test dataset (together with the submission file example) will be published 11th April 2021 (without ground-truth) at the end of the competition (with ground-truth). Participants are allowed to submit the results achieved by multiple approaches/models.
Final Ranking: ranking will be provided for each of the tasks (see Tasks secion).
For more information regarding the submission process with CodaLab, please read the Submission Format tab.
The organizers might need to verify the truthfulness of the scores submitted if necessary.
As commented before, this is an on-going competition. Participants can test their developed systems using the validation dataset (without ground-truth) and corresponding signature comparison files through the CodaLab platform of the competition. Validation results are updated in CodaLab in real-time.
• 15th December 2020: Registration starts and train dataset available.
• 11th April 2021: Test dataset available.
• 27th April 2021: Registration closes.
• 27th April 2021: Algorithm submission deadline.
• 3th May 2021: Results and report announcement.
The data is reserved for the signature of the use license. The License Agreeement must be signed by a permanent faculty member.
1) Download the Licence Agreement: Link
2) Send the signed Licence Agreemento to firstname.lastname@example.org with the subject line as "SIW 2021 registration" and the following information:
3) You will receive an email with the link to download the training data.
The test data will be available on 11th April, 2020. Information to download the data, together with instructions to prepare the submission file will be announced in this page.
Start: Dec. 31, 2020, midnight
Description: In the training phase, you can download the training data and train your model. We recommend to split the training data into training a development to train and test your models. There is not submission during this PHASE.
Start: April 11, 2021, midnight
Description: In the test phase, you can make up to 10 submissions per day. Please read the evaluation protocol instructions for more details.
April 27, 2021, midnight
You must be logged in to participate in competitions.Sign In