ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX

Organized by icdariitgn - Current server time: Oct. 23, 2020, 10:21 p.m. UTC

Previous

(S) Table Structure Reconstruction: Validation
Oct. 14, 2020, midnight UTC

Current

(C) Table Content Reconstruction:Post-Evaluation
April 1, 2021, midnight UTC

Next

(S) Table Structure Reconstruction: Testing
March 1, 2021, midnight UTC

Welcome to ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX!

Table recognition is a well-studied problem in document analysis, and many academic and commercial approaches have been developed to recognize tables in several document formats, including plain text, scanned page images, and born-digital, object-based formats such as PDF. There are several works that can convert tables in text-based PDF format into structured representations. However, there is limited work on image-based table content recognition.

The challenge aims at assessing the ability of state-of-the-art methods to recognize scientific tables in
LaTeX format. In particular, the problem would be split up into two subtasks:

  • Subtask I: Table structure reconstruction (S): Reconstructing the structure of a table in the form of LaTeX symbols and code

  • Subtask II: Table content reconstruction (C): Reconstructing and recognizing the content of a table in the form of LaTeX symbols and code

 

Tasks

Our shared task has two subtasks. Subtask-1 and Subtask-2 focus on evaluating machine-learning models' performance with regard to two broader table recognition tasks.

  • Subtask-I: Table structure reconstruction         

           In this subtask, you are given an image of a table and its corresponding LaTeX code. You need to construct the LaTeX structural tokens that define the table in LaTeX.

           

  • Subtask-II: Table content reconstruction

           In this subtask, you are given an image of a table and its corresponding LaTeX code.  You need to construct the LaTeX content tokens that belong to the table in LaTeX.

           

 

Timeline:

  • Registration Period: 15th Oct 2020 to 28th Feb 2021
  • Release of training and validation set: 20th Oct 2020
  • Release of test set: 01st Mar 2021
  • Submission Deadline: 31st Mar 2021
  • Post-Evaluation Phase Starts: 01st Apr 2021

 

Contact Us

Please contact at icdariitgn@gmail.com if you have any further queries.

Evaluation

For both the subtasks, the participants would be required to submit the prediction files as per the submission format.

The tasks would be scored by Exact Match Accuracy and Exact Match Accuracy @ 95% similarity as common evaluation metrics.

Also, task-specific metrics include:

  • Row Prediction Accuracy and Column Prediction Accuracy for Table structure reconstruction task
  • Alpha-Numeric characters Prediction Accuracy, LaTeX Token Accuracy, LaTex Symbol Accuracy, and Non-LaTeX Symbols Prediction Accuracy for Table content reconstruction task

The description of each metric is as follows:

  1. Exact Match Accuracy: Fraction of predictions which match exactly with the ground truth
  2. Exact Match Accuracy @ 95% similarity: Fraction of predictions with at least 95% similarity between ground truth
  3. Row Prediction Accuracy: Fraction of predictions with a count of rows equal to the count of rows in the ground truth 
  4. Column Prediction Accuracy: Fraction of predictions with a count of cell alignment ('c', 'r', 'l') tokens equal to the count of cell alignment tokens in the ground truth
  5. Alpha-Numeric Characters Prediction Accuracy: Fraction of predictions which has the same alphanumeric characters as in the ground truth
  6. LaTeX Token Accuracy: Fraction of predictions which has the same LaTeX tokens as in the ground truth
  7. LaTeX Symbol Accuracy: Fraction of predictions which has the same LaTeX Symbols as in the ground truth
  8. Non-LaTeX Symbol Prediction Accuracy: Fraction of predictions which has the same Non-LaTeX Symbols as in the ground truth

 

Example:

For the given image, to calculate Exact Match Accuracy @ 95% similarity between the ground truth target sequence and predicted target sequence, we use the Longest Common Subsequence algorithm to find the similarity percentage and set the similarity percentage minimum threshold to 95%.  

The ground truth target sequence (G) for Table structure recognition task is { c | c c c } & \milticolumn { 3 } { c } \\ & & & \\ \hline \hline & & \\ & & & \\ \hline \multicolumn { 3 } { c } (No. of tokens = 37)

and the predicted target sequence (P) is { c | c c } & \milticolumn { 2 } { c } \\ & & & \\ \hline \hline & & \\ & & & \\ \hline \multicolumn { 3 } { c } (No. of tokens = 36)

The longest common subsequence between G and P is } { c } \\ & & & \\ \hline \hline & & \\ & & & \\ \hline \multicolumn { 3 } { c }.

Thus, the percentage similarity calculated is 70.27% (26/.37). 

 

Note: Final Evaluation will be based on Exact Match Accuracy for both the Subtasks on Test Dataset

For more details visit the FAQ page in the "Learn the Details" tab.

 

Submission Format

To submit your results to the leaderboard you must construct a submission.zip file containing a single prediction file in .txt format according to the naming convention given below.

This prediction file should follow the format detailed in the subsequent section.

File Submission format

The prediction file should contain LaTeX codes for the corresponding Image Ids. 

Note: Each LaTeX code must be in the new line for the corresponding Image Ids.

File Naming Convention

The naming convention of the prediction file for corresponding sub-tasks are given below:

  • Table Structure Reconstruction(Validation Phase): submission_tsr_val.txt
  • Table Content Reconstruction(Validation Phase): submission_tcr_val.txt
  • Table Structure Reconstruction(Testing & Post-Evaluation Phase): submission_tsr_test.txt
  • Table Content Reconstruction(Testing & Post-Evaluation Phase): submission_tcr_test.txt

 

Submission Archive

To upload your results to CodaLab you have to zip the prediction file into a flat zip archive (it can’t be inside a folder within the archive).

You can create a flat archive using the command providing the txt file is in your current directory.

$ zip -j submission.zip submission_tsr_test.txt

Note:  Team details must be updated on codalab account settings page for leaderboard visibility and result evaluation.

Sample Submission Download: Link 

 

 

Submission Process

You need several steps to make a submission:

  • Navigate to 'Participate' and go to 'Submit/View Results'

  • Select the subtask for which you want to make a submission
  • Write a brief description of your model

  • Click the button 'Submit / View Results'

  • Upload your zipped submission

  • Wait until the evaluation status turns to 'Finished' or 'Failed'

If the submission status is 'Failed'(*), you can click 'View scoring output log' and 'View scoring error log' to see the debug logs or you will get a message by the system if there is an error with the submission file.

 

(*) If the error log raises an exception that contains "File "/worker/worker.py", line 330, in run", this may be because the codalab runners are busy now. If you face this problem and do not have more submission chances in a day, please contact us to delete your failed submissions so that you can submit them again.

Terms and Conditions

  1. By submitting results to this competition, you consent to the public release of your scores at the ICDAR 2021 and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  2. You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  3. You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
  4. You agree not to redistribute the test data except in the manner prescribed by its license.
  5. You agree to us storing your submission results for evaluation purposes.
  6. You agree that if you place in the top-10 at the end of the challenge you will submit your code so that we can verify that you have not cheated.
  7. Each team must create and use exactly one CodaLab account.
  8. Team constitution (members of a team) cannot be changed after the evaluation phase has begun.
  9. The organizers and the organizations they are affiliated with make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  10. Winning teams are strongly encouraged to release their code to help the community to reproduce their results and develop better models. If code cannot be released, winning teams must at least allow the organizers to reproduce their method and results for legitimacy verification.

Dataset

Full dataset for both subtasks can be found in "Participate" Tab (only registered participants can access it)

Data description

Each training data (of any of the subtask) contain these files(as below):

  • Training Images
  • src-train.txt: contains training Image Ids
  • tgt-train.txt: contains LaTeX code for training Image Ids

Each validation data (of any of the subtask) contain these files(as below):

  • Validation Images
  • src-val.txt: contains validation Image Ids

Each test data (of any of the subtask) contain these files(as below):

  • Test Images
  • src-test.txt: contains test Image Ids

Frequently Asked Questions

 

Q1.  What is the size of the dataset with specific numbers for each task (training set - test - validation set)?

A1.  Size of the dataset for both the subtasks is given as follows:

We abbreviate Table structure reconstruction task dataset as TSRD and Table content reconstruction task dataset as TCRD.

For the TSR dataset, we take tables having less than 250 tokens and for TCR dataset we take tables having less than 500 tokens.

 

Q2. Will the code of the competitors be available for the research community (reproducibility of the results)?

A2. It would be mandatory for participants to make their code available for reproducibility. The dataset provided for this task would be licensed under CC BY-NC-SA 4.0 international license, and the evaluation script would be provided under MIT License.

 

Q3. Will there be an award for all the proposed Tasks?

A3.  We would be awarding both the proposed subtasks:

  • Table structure reconstruction task

  • Table content reconstruction task

 

Q4. What are some Examples for the two tasks?

A4. Examples:  

Table Structure Reconstruction:

{ | c c | } \\hline \\multicolumn { 2 } { | c | } CELL \\\\ \\hline \\multicolumn { 2 } { | c | } CELL \\\\ \\multicolumn { 2 } { | c | } CELL \\\\ \\multicolumn { 2 } { | c | } CELL \\\\ \\multicolumn { 2 } { | c | } CELL \\\\ \\multicolumn { 2 } { | c | } CELL \\\\ \\hline

 

Table Content Reconstruction:

 

$ T _ { \mathbf { D } 1 } = p _ { 1 1 ¦ } \frac { t _ { \mathbf { A } } + \mathbf { p } - \frac { \mathbf { r } } { 2 } } { 2 t ¦ _ { \mathbf { D } } } + p _ { 1 2 ¦ } \frac { t _ { \mathbf { D } } + \mathbf { p - d - r } } { 2 t ¦ _ { \mathbf { D } } } + $ \\ $ p _ { 1 3 ¦ } \frac { t _ { \mathbf { A } } + t _ { \mathbf { D } } - 2 \mathbf { r + p - d } } { 4 t ¦ _ { \mathbf { D } } } . $

 

 

 

The organizers of the competition are:

 

Pratik Kayal, IIT Gandhinagar, India

Emailpratik.kayal@iitgn.ac.in

 

Mrinal Anand, IIT Gandhinagar, India

Email: mrinal.anand@iitgn.ac.in

 

Harsh Desai, IIT Gandhinagar, India

Email: hsd31196@gmail.com

 

Prof. Mayank Singh, IIT Gandhinagar, India

Emailsingh.mayank@iitgn.ac.in

(S) Table Structure Reconstruction: Validation

Start: Oct. 14, 2020, midnight

Description: In the validation phase, you can make up to 50 successful submissions per day, while the online evaluation chances will be updated at the stipulated time per week. Failed submissions on CodaLab (this website) caused by any unexpected issue can be re-submitted. Please do not intentionally do this. Please enter your method description.

(C) Table Content Reconstruction: Validation

Start: Oct. 14, 2020, midnight

Description: In the validation phase, you can make up to 50 successful submissions per day, while the online evaluation chances will be updated at the stipulated time per week. Failed submissions on CodaLab (this website) caused by any unexpected issue can be re-submitted. Please do not intentionally do this. Please enter your method description.

(S) Table Structure Reconstruction: Testing

Start: March 1, 2021, midnight

Description: In the testing phase, you can make up to 50 successful submissions per day, while the online evaluation chances will be updated at the stipulated time per week. Failed submissions on CodaLab (this website) caused by any unexpected issue can be re-submitted. Please do not intentionally do this. Please enter your method description. The results will be revealed after the final check.

(C) Table Content Reconstruction: Testing

Start: March 1, 2021, midnight

Description: In the testing phase, you can make up to 50 successful submissions per day, while the online evaluation chances will be updated at the stipulated time per week. Failed submissions on CodaLab (this website) caused by any unexpected issue can be re-submitted. Please do not intentionally do this. Please enter your method description. The results will be revealed after the final check.

(S) Table Structure Reconstruction:Post Evaluation

Start: April 1, 2021, midnight

Description: In the Post-Evaluation phase, you can make any number of successful submissions per day. Failed submissions on CodaLab (this website) caused by any unexpected issue can be re-submitted. Please do not intentionally do this. Please enter your method description.

(C) Table Content Reconstruction:Post-Evaluation

Start: April 1, 2021, midnight

Description: In the Post-Evaluation phase, you can make any number of successful submissions per day. Failed submissions on CodaLab (this website) caused by any unexpected issue can be re-submitted. Please do not intentionally do this. Please enter your method description.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In