SemEval-2020 Task5: Modelling Causal Reasoning in Language: Detecting Counterfactuals

Organized by Ariel_yang - Current server time: Nov. 17, 2019, 9:55 a.m. UTC


Sept. 1, 2019, midnight UTC


Sept. 1, 2019, midnight UTC


Jan. 10, 2020, midnight UTC

SemEval-2020 task 5: Detecting Counterfactual Statement

To contact our competition organizer, please email us. Our email address is

Or you could also contact our organizers by emailing to

To model counterfactual semantics and reasoning in natural language, our shared task aims to provide a benchmark for two basic problems.

Subtask1: Detecting counterfactual statements

In this task, you are asked to determine whether a given statement is counterfactual or not. Counterfactual statements describe events that did not actually happen or cannot happen, as well as the possible consequence if the events have had happened. More specifically, counterfactuals describe events counter to facts and hence naturally involve common sense, knowledge, and reasoning. Tackling this problem is the basis for all down-stream counterfactual related causal inference analysis in natural language. For example, the following statements are counterfactuals that need to be detected: one from healthcare and one from the finance domain:

  • Her post-traumatic stress could have been avoided if a combination of paroxetine and exposure therapy had been prescribed two months earlier.
  • Finance Minister Jose Antonio Meade noted that if a jump in tomato prices had been factored out, inflation would have begun to drop.
While the above examples are chosen for clarity for demonstration, real statements are harder for computers to judge.

Subtask2: Detecting antecedent and consequence

Indicating causal insight is an inherent characteristic of counterfactual. To further detect the causal knowledge conveyed in counterfactual statements, subtask 2 aims to locate antecedent and consequent in counterfactuals.
According to (Nelson Goodman, 1947. The problem of counterfactual conditionals), a counterfactual statement can be converted to a contrapositive with a true antecedent and consequent. Consider the “post-traumatic stress” example discussed above; it can be transposed into “because her post-traumatic stress was not avoided, (we know) a combination of paroxetine and exposure therapy was not prescribed”. Such knowledge can be not only used for analyzing the specific statement but also be accumulated across corpora to develop domain causal knowledge (e.g., a combination of paroxetine and exposure may help cure post-traumatic stress).
Please note that in some cases there is only an antecedent part while without a consequent part in a counterfactual statement. For example, "Frankly, I wish he had issued this order two years ago instead of this year", in this sentence we could only get the antecedent part. In our subtask2, when locating the antecedent and consequent part, please set '-1' as consequent starting index (character index) and ending index (character indexto refer that there is no consequent part in this sentence. For details, please refer to the 'Evaluation' on this website.

Evaluation Criteria

We provide datasets for task-1 and task-2 respectively, and both will include train.csv and test.csv. 

Please note that you could only use the corresponding dataset for task-1 to build models for task-1 and dataset for task-2 to build models for task-2 to ensure fairness.

A valid submission zip file for CodaLab contains one of the following files:

  • subtask1.csv (only submitted to "xxx-Subtask1" section)
  • subtask2.csv (only submitted to "xxx-Subtask2" section)

* The .csv file with the incorrect file name (sensitive to capitalization of letters) will not be accepted.

* A zip file containing both files will not be accepted.

Neither .csv nor .rar file will be accepted, only .zip file is accepted.

* Please zip your results file (e.g. subtask1.csv) directly without putting it into a folder and zipping the folder.

Submission format for task1

For the pred_label, '1' refers to counterfactual while '0' refers to non-counterfactual. The 'sentenceID' should be in the same order as in 'test.csv' for subtask-1 (in evaluation phase).

sentenceID pred_label
322893 1
322892 0
... ... 





Submission format for task2

If there is no consequent part (a consequent part not always exists in a counterfactual statement) in this sentence, please put '-1' in the consequent_startid and 'consequent_endid'. The 'sentenceID' should be in the same order as in 'test.csv' for subtask-2 (in evaluation phase).

sentenceID antecedent_startid antecedent_endid consequent_startid consequent_endid
104975 15 72 88 100
104976 18 38 -1 -1
... ... ... ... ...





Example of subtask1.csv


"6001846","0","Then there's the hyperbole: "If Congress does away with or reduces mandatory minimum sentences, they may as well fold the tent on drug prosecution as a whole." And "The current movement has no statistical support for revising the mandatorys; we are headed for a crime-ridden future." Hear that? Either give prosecutors whatever they say they need, without regard to justice or fairness or cost-effectiveness or any of those other namby-pamby ideals, or we might as well leave the prison doors open and let the crackheads come for you and your family."

"6000627","1","Had Russia possessed such warships in 2008, boasted its naval chief, Admiral Vladimir Vysotsky, it would have won its war against Georgia in 40 minutes instead of 26 hours."

  • sentenceID: indicating which sentence you are labeling
  • gold_label: if you estimate the sentence is counterfactual, put 1, otherwise please put 0
  • sentence: the original sentence as the one in the provided dataset

Example of subtask2.csv

sentenceID,sentence,domain,antecedent_startid,antecedent_endid,consequence_startid, consequence_endid

3S0001,"For someone who's so emotionally complicated, who could have given up many times if he was made of straw - he hasn't.",Health,83,105,48,81

  • sentenceID: indicating which sentence you are labeling
  • sentence: the original sentence as the provided dataset
  • domain: the sentence related to a specific domain
  • antecedent_startid: the index of the original sentence where your predicted antecedent starts (index of the character in the corresponding sentence)
  • antecedent_endid:  the index of the original sentence where your predicted antecedent ends (index of the character in the corresponding sentence)
  • consequent_startid: the index of the original sentence where your predicted consequence starts (if the consequent part is not available, put -1 here)
  • consequent_endid:  the index of the original sentence where your predicted consequence ends (if the consequent part is not available, put -1 here)


Evaluation Method

Participants have to participate in both of the 2 tasks. The evaluation metrics that will be applied are:

  • Subtask1: Precision, Recall, and F1

The evaluation script will verify whether the predicted binary "label" is the same as the desired "label" which is annotated by human workers, and then calculate its precision, recall, and F1 scores.

  • Subtask2: Exact Match, Precision, Recall, and F1

Exact Match will represent what percentage of both your predicted antecedents and consequences are exactly matched with the desired outcome that is annotated by human workers. 

F1 score is a token level metric and will be calculated according to the submitted antecedent_startid, antecedent_endid, consequent_startid, consequent_endid. Please refer to our baseline model for evaluation details. 

Terms & Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to redistribute the test data except in the manner prescribed by its license.


  • Task 1

            SVM model

  • Task2

            Sequence Labeling model (with eval method)

You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. 

Xiaodan Zhu, Queen's University

Xiaoyu Yang, Queen's University

Huasha Zhao, Alibaba Group

Qiong Zhang, Alibaba Group

Stan Matwin, Dalhousie University


We also kindly thank Jiaqi Li, Qianyu Zhang, Stephen Obadinma, Xiao Chu and Rohan for their help and effort in this project. 


Start: Sept. 1, 2019, midnight

Description: In practice stage, you could just try to upload your results to confirm the submission format. The labels in our reference data come from 'train.csv' of subtask-1


Start: Sept. 1, 2019, midnight


Start: Jan. 10, 2020, midnight


Start: Jan. 10, 2020, midnight


Start: Jan. 31, 2020, midnight


Start: Jan. 31, 2020, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In