SemEval-2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals

Organized by Ariel_yang - Current server time: Sept. 19, 2019, 7 p.m. UTC


Practice (Training data ready)
Sept. 1, 2019, midnight UTC


Evaluation (Test data ready)
Jan. 10, 2020, midnight UTC


Competition Ends

SemEval-2020 task 5: Detecting Counterfactual Statement

To contact our competition organizer, please email us. Our email address:

To model counterfactual semantics and reasoning in natural language, our shared task aims to provide a benchmark for two basic problems.

Subtask1: Detecting counterfactual statements

In this task, you are asked to determine whether a given statement is counterfactual or not. Counterfactual statements describe events that did not actually happen or cannot happen, as well as the possible consequence if the events have had happened. More specifically, counterfactuals describe events counter to facts and hence naturally involve common sense, knowledge, and reasoning. Tackling this problem is the basis for all down-stream counterfactual related causal inference analysis in natural language. For example, the following statements are counterfactuals that need to be detected: one from healthcare and one from the finance domain:

  • Her post-traumatic stress could have been avoided if a combination of paroxetine and exposure therapy had been prescribed two months earlier.
  • Finance Minister Jose Antonio Meade noted that if a jump in tomato prices had been factored out, inflation would have begun to drop.
While the above examples are chosen for clarity for demonstration, real statements are much harder for computers to judge.

Subtask2: Detecting antecedent and consequence

Indicating causal insight is an inherent characteristic of counterfactual. To further detect the causal knowledge conveyed in counterfactual statements, subtask 2 aims to locate antecedent and consequent in counterfactuals.
According to (Goodman, 1947), a counterfactual statement can be converted to a contrapositive with a true antecedent and consequent. Consider the “post-traumatic stress” example discussed above; it can be transposed into “because her post-traumatic stress was not avoided, (we know) a combination of paroxetine and exposure therapy was not prescribed”. Such knowledge can be not only used for analyzing the specific statement but also be accumulated across corpora to develop domain causal knowledge (e.g., a combination of paroxetine and exposure may help cure post-traumatic stress).

Evaluation Criteria

A valid submission file for CodaLab is a zip compressed file containing the following files:

  • subtask1.csv
  • subtask2.csv

Example of subtask1.csv


"6001846","0","Then there's the hyperbole: "If Congress does away with or reduces mandatory minimum sentences, they may as well fold the tent on drug prosecution as a whole." And "The current movement has no statistical support for revising the mandatorys; we are headed for a crime-ridden future." Hear that? Either give prosecutors whatever they say they need, without regard to justice or fairness or cost-effectiveness or any of those other namby-pamby ideals, or we might as well leave the prison doors open and let the crackheads come for you and your family."

"6000627","1","Had Russia possessed such warships in 2008, boasted its naval chief, Admiral Vladimir Vysotsky, it would have won its war against Georgia in 40 minutes instead of 26 hours."

  • ID: indicating which sentence you are labeling
  • label: if you estimate the sentence is counterfactual, put 1. Otherwise, put 0
  • sentence: the original sentence as the provided dataset

Example of subtask2.csv


"9000457","Mr Gladwell cites the six forms of address in Korean, based on seniority, and the high score of Koreans on a psycho-cultural ranking called the Power Distance Index; he says that in a series of Korean Air crashes in the 1980s and 1990s, flight officers were reluctant to indicate to the captain that anything was wrong except in elliptical and confusing ways, because it would have been viewed as criticism of a higher-up.","it would have been viewed as criticism of a higher-up",""

"8000079","He says that number would have been much higher were it not for the work of organizations such as the Mentor Initiative, which provides disease control, and the Syria Relief Network, a group of humanitarian organizations working inside Syria and neighbouring countries.","were it not for the work of organizations such as the Mentor Initiative, which provides disease control, and the Syria Relief Network, a group of humanitarian organizations","that number would have been much higher"

  • ID: indicating which sentence you are labeling
  • sentence: the original sentence as the provided dataset
  • antecedent: the assumption that contradicts the fact
  • consequence: the following development if the antecedent was true


Evaluation Method

Participants have to participate both of the 2 tasks. The evaluation metrics that will be applied are:

task1: precision, recall, and F1

task2: Exact Match, F1

Terms & Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to redistribute the test data except in the manner prescribed by its license.

Baseline (not available yet)

the hybrid SVM baseline (subtask1)

Instructions for the subtask1 baseline.

the NER model baseline (subtask2)

Instructions for the subtask2 baseline.

You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. In order to assist testing of ideas, we also provide the hybrid SVM baseline that you can build on. The use of this system is completely optional. The system is available.

Practice (Training data ready)

Start: Sept. 1, 2019, midnight

Evaluation (Test data ready)

Start: Jan. 10, 2020, midnight


Start: Jan. 31, 2020, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In