SemEval-2020 Task5: Modelling Causal Reasoning in Language: Detecting Counterfactuals

Organized by Ariel_yang - Current server time: Jan. 18, 2020, 8:47 a.m. UTC


Sept. 1, 2019, midnight UTC


Sept. 1, 2019, midnight UTC


Feb. 19, 2020, midnight UTC

Contact Us

To contact our competition organizers, please email us. Our email address:

(1)  (2)


Task Description

To model counterfactual semantics and reasoning in natural language, our shared task aims to provide a benchmark for two basic problems.

  • Subtask1: Detecting counterfactual statements

In this task, you are asked to determine whether a given statement is counterfactual or not. Counterfactual statements describe events that did not actually happen or cannot happen, as well as the possible consequence if the events have had happened. More specifically, counterfactuals describe events counter to facts and hence naturally involve common sense, knowledge, and reasoning. Tackling this problem is the basis for all down-stream counterfactual related causal inference analysis in natural language. For example, the following statements are counterfactuals that need to be detected: one from healthcare and one from the finance domain:

  • Her post-traumatic stress could have been avoided if a combination of paroxetine and exposure therapy had been prescribed two months earlier.
  • Finance Minister Jose Antonio Meade noted that if a jump in tomato prices had been factored out, inflation would have begun to drop.
While the above examples are chosen for clarity for demonstration, real statements are harder for computers to judge.
  • Subtask2: Detecting antecedent and consequence

Indicating causal insight is an inherent characteristic of counterfactual. To further detect the causal knowledge conveyed in counterfactual statements, subtask 2 aims to locate antecedent and consequent in counterfactuals.
According to (Nelson Goodman, 1947. The problem of counterfactual conditionals), a counterfactual statement can be converted to a contrapositive with a true antecedent and consequent. Consider the “post-traumatic stress” example discussed above; it can be transposed into “because her post-traumatic stress was not avoided, (we know) a combination of paroxetine and exposure therapy was not prescribed”. Such knowledge can be not only used for analyzing the specific statement but also be accumulated across corpora to develop domain causal knowledge (e.g., a combination of paroxetine and exposure may help cure post-traumatic stress).
Please note that in some cases there is only an antecedent part while without a consequent part in a counterfactual statement. For example, "Frankly, I wish he had issued this order two years ago instead of this year", in this sentence we could only get the antecedent part. In our subtask2, when locating the antecedent and consequent part, please set '-1' as consequent starting index (character index) and ending index (character indexto refer that there is no consequent part in this sentence. For details, please refer to the 'Evaluation' on this website.

Important Dates

The important dates have been updated as below according to the updated SemEval-2020 schedule. For the details, please refer to the official website of SemEval-2020: 

19 February 2020: Evaluation start (will release the test data then)*
  • Test data for subtask-1 will be released on Feb 19, 2020.   (Subtask-1 Evaluation: Feb 19 to Feb 29 in 2020)
  • Test data for subtask-2 will be released on Mar 1, 2020.     (Subtask-2 Evaluation: Mar 1 to Mar 11 in 2020)
  • Note: we plan to set a submission limitation for evaluation phase: each one could only submit 10 times for each subtask, and please make sure you've got the correct file format before that by trying to submit to 'Practise-Subtask1' and 'Practise-Subtask2' on Codalab. To practice, please click the 'Participate ->Submit/View Results -> Practise-Subtask1'.
11 March 2020: Evaluation end*
18 March 2020: Results posted
17 April 2020: System description paper submissions due
24 April 2020: Task description paper submissions due
10 Jun 2020: Author notifications
1 Jul 2020: Camera-ready submissions due
13-14 September 2020:  SemEval 2020

Submission Details & Evaluation Criteria

We provide datasets for task-1 and task-2 respectively, and both will include train.csv and test.csv. 

Please note that you could only use the corresponding dataset for task-1 to build models for task-1 and dataset for task-2 to build models for task-2 to ensure fairness.

Here we provide two example zip files to show the format of submission. In 'Participate -> Submit/View Results -> Practise-Subtask1' or '...->Practise-Subtask2', you could also try to submit your own results to verify the format. 

A valid submission zip file for CodaLab contains one of the following files:

  • subtask1.csv (only submitted to "xxx-Subtask1" section)
  • subtask2.csv (only submitted to "xxx-Subtask2" section)

* The .csv file with the incorrect file name (sensitive to capitalization of letters) will not be accepted.

* A zip file containing both files will not be accepted.

Neither .csv nor .rar file will be accepted, only .zip file is accepted.

* Please zip your results file (e.g. subtask1.csv) directly without putting it into a folder and zipping the folder.

Submission format for task1

For the pred_label, '1' refers to counterfactual while '0' refers to non-counterfactual. The 'sentenceID' should be in the same order as in 'test.csv' for subtask-1 (in evaluation phase).

sentenceID pred_label
322893 1
322892 0
... ... 





Submission format for task2

If there is no consequent part (a consequent part not always exists in a counterfactual statement) in this sentence, please put '-1' in the consequent_startid and 'consequent_endid'. The 'sentenceID' should be in the same order as in 'test.csv' for subtask-2 (in evaluation phase).

sentenceID antecedent_startid antecedent_endid consequent_startid consequent_endid
104975 15 72 88 100
104976 18 38 -1 -1
... ... ... ... ...





Example of train.csv for subtask1


"6000627","1","Had Russia possessed such warships in 2008, boasted its naval chief, Admiral Vladimir Vysotsky, it would have won its war against Georgia in 40 minutes instead of 26 hours."

  • sentenceID: indicating which sentence you are labeling
  • gold_label: if you estimate the sentence is counterfactual, put 1, otherwise please put 0
  • sentence: the original sentence as the one in the provided dataset

Example of train.csv for subtask2

sentenceID,sentence,domain,antecedent_startid,antecedent_endid,consequence_startid, consequence_endid

3S0001,"For someone who's so emotionally complicated, who could have given up many times if he was made of straw - he hasn't.",Health,83,105,48,81

  • sentenceID: indicating which sentence you are labeling
  • sentence: the original sentence as the provided dataset
  • domain: the sentence related to a specific domain
  • antecedent_startid: the index of the original sentence where your predicted antecedent starts (index of the character in the corresponding sentence)
  • antecedent_endid:  the index of the original sentence where your predicted antecedent ends (index of the character in the corresponding sentence)
  • consequent_startid: the index of the original sentence where your predicted consequence starts (if the consequent part is not available, put -1 here)
  • consequent_endid:  the index of the original sentence where your predicted consequence ends (if the consequent part is not available, put -1 here)


Evaluation Method

Participants have to participate in both of the 2 tasks. The evaluation metrics that will be applied are:

  • Subtask1: Precision, Recall, and F1

The evaluation script will verify whether the predicted binary "label" is the same as the desired "label" which is annotated by human workers, and then calculate its precision, recall, and F1 scores.

  • Subtask2: Exact Match, Precision, Recall, and F1

Exact Match will represent what percentage of both your predicted antecedents and consequences are exactly matched with the desired outcome that is annotated by human workers. 

F1 score is a token level metric and will be calculated according to the submitted antecedent_startid, antecedent_endid, consequent_startid, consequent_endid. Please refer to our baseline model for evaluation details. 

Terms & Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to redistribute the test data except in the manner prescribed by its license.


  • Task 1

            SVM model

  • Task2

            Sequence Labeling model (with eval method)

You are free to build a system from scratch using any available software packages and resources, as long as they are not against the spirit of fair competition. 

Xiaodan Zhu, Queen's University

Xiaoyu Yang, Queen's University

Huasha Zhao, Alibaba Group

Qiong Zhang, Alibaba Group

Stan Matwin, Dalhousie University


We also kindly thank Jiaqi Li, Qianyu Zhang, Stephen Obadinma, Xiao Chu and Rohan for their help and effort in this project. 


Start: Sept. 1, 2019, midnight

Description: In practice stage, you could just try to upload your results to confirm the submission format. The labels in our reference data come from 'train.csv' of subtask-1


Start: Sept. 1, 2019, midnight


Start: Feb. 19, 2020, midnight


Start: March 1, 2020, midnight


Start: March 12, 2020, midnight


Start: March 12, 2020, midnight

Competition Ends

Sept. 14, 2020, midnight

You must be logged in to participate in competitions.

Sign In
# Username Score
1 will_go 1.0000