SemEval Task 9: Statement Verification and Evidence Finding with Tables (SEM-TAB-FACT)

Organized by semtabfacts - Current server time: Jan. 25, 2021, 11:40 a.m. UTC


Nov. 1, 2020, noon UTC


SubTask B: Evidence finding
Jan. 27, 2021, noon UTC


SubTask B: Evidence finding
Jan. 27, 2021, noon UTC

Tables are ubiquitous in documents and presentations for conveying important information in a concise manner. This is true in many domains, stretching from scientific to government documents. In fact, surrounding text in these articles are often statements summarizing or highlighting some information derived from the primary source data in tables. Describing all the information provided in a table in a readable manner would be lengthy and considerably more difficult to understand. Wpresent a task for statement verification and evidence finding using tables from scientific articles. This important task promotes proper interpretation of the surrounding article.



The task will have two subtasks to explore table understanding:

A: Table Statement Support

Does the table support/refute the given statement?

B: Relevant Cell Selection

Which cells in the table provide evidence for supporting/refuting the statement?


For more information and to download the data, please visit our website at 

Please join our Google Groups at


Important notes

You are allowed to submit multiple runs on CodaLab (up to 10 in each phase) but only your last submission will be considered. Our intent is that each team should only submit one model to the competition and multiple submissions are there in case of submission bugs. You will have an opportunity to submit more models and experiments after the competition period has ended. Please do not create multiple codalab accounts. Each team should only have one account and teams found with multiple accounts will be contacted by the organizers. Again, only their final submission will be considered. 

How will I know my score and the leaderboard?

After submission, each team should submit their system and team information to . We will email their score only to the teams that have completed this form after the competition period has ended. Shortly after this period, we will also release the leaderboard to all participants who have responded to our questionnaire. 


Evaluation Criteria

Task A

The goal of task A is to determine if a statement is entailed or refuted by the given table, or whether, as is in some cases, this cannot be determined from the table.

There will be two evaluation methods:

  1. The first will be a standard precision / recall evaluation (Three Way) of a multi-class classification that evaluates whether each table was classified correctly as Entailed / Refuted / Unknown. This will test whether the classification algorithm understands cases where there is insufficient information to make a determination.
  2. The second, simpler evaluation (Two Way) will remove statements with the "unknown" ground truth label from the evaluation. However, this metric will still penalize misclassifying Refuted/ Entailed statement as unknown. The score used for ranking is the F1 score for both evaluation methods.

For score averaging, we first average the score for all statements in each table, which is then averaged across all tables for the final F1 score. 

Please note that in the actual competition, only one submission will be accepted for each team so a team may not have two systems that each specialize in the two/three way evaluation metrics. We will, however, show rankings for both evaluation schemes at the end of the evaluation period. 

Task B

In Task B, the goal is to determine for each cell and each statement, if the cell is within the minimum set of cells needed to provide evidence for the statement ("relevant") or not ("irrelevant"). In other words, if the table were shown with all other cells blurred out, would this be enough for a human to reasonably determine that the table entails or refutes the statement?
The evaluation will calculate the recall and precision for each cell, with "relevant" cells as the positive category. Similar to Task A, the score will be averaged over all statements in each table first, before proceeding to average across all tables. 

For some statements, there may be multiple minimal sets of cells that can be used to determine statement entailment or refusal. In such cases, our ground truth will contain all of these versions. We compare the submission to each ground truth version and only record the highest score. Participants should only submit one version per cell per statement, as all others besides the highest scoring version will be discounted. 

The statements in Task B will be the subset of statements from Task A that are refuted or entailed (unknown statements will be excluded). As this provides the ground truth for Task A, this phase (Task B) only begins after the completion of the phase for Task A. See Important Dates for more information.

For more information and to download the data, please visit our website at 

Please join our Google Groups at

** Please note the very short evaluation periods for both tasks. We provide the dev set submission so that all participants can test their submission ahead of time to ensure that they comply with our evaluation code and the Codalab submission system. We also provide the evaluation script on our website***

Development dataset ready: December 3, 2020
Task A Evaluation Period: Jan 20 - Jan 22, 2021 (Noon UTC)
Task B Evaluation Period: Jan 27 - Jan 29, 2021 (Noon UTC)
Paper submission due: February 23, 2021
Notification to authors: March 29, 2021
Camera ready due: April 5, 2021
SemEval workshop: Summer 2021

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

By downloading the data or by accessing it any manner, you agree to abide by the CC BY-4.0 license, as described here

Your submission zip should have one solution file for each input file with the same name.

For task A, in the file itself, ensure that all statement relationships are specified as one of the following strings: "entailed", "refuted" or "unknown". Leaving this empty or entering a different value will cause an error. 

Similarly, for task B, ensure that the evidence for each cell for each statement is specified with either the (string "relevant" or the string "irrelevant". Leaving this empty will cause all cells to be considered irrelevant and thus a very poor score. Double check your spelling!


Start: Nov. 1, 2020, noon

Description: Development phase: Submit your predictions on the dev set.

SubTask A: Statement verification

Start: Jan. 19, 2021, noon

Description: !!! PHASE NOW CLOSED!!!In this phase, submit your predictions(entailed, refuted, unknown) on the test set !!! PHASE NOW CLOSED!!!

SubTask B: Evidence finding

Start: Jan. 27, 2021, noon

Description: In this phase, given the ground truth fact verification (entailed or refuted), determine the relevance of cells in the table to the statement

Competition Ends

Jan. 29, 2021, noon

You must be logged in to participate in competitions.

Sign In