SemEval 2020 Task 4 - Commonsense Validation and Explanation (ComVE)

Organized by Shuailong - Current server time: Jan. 21, 2021, 3:51 p.m. UTC


Evaluation - Subtask B
March 4, 2020, midnight UTC


March 11, 2020, 11:59 p.m. UTC


Competition Ends


Welcome to Commonsense Validation and Explanation (ComVE) Challenge!

The task is to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. We designed three subtasks. The first task is to choose from two natural language statements with similar wordings which one makes sense and which one does not make sense; The second task is to find the key reason from three options why a given statement does not make sense; The third task asks machine to generate the reasons and we use BLEU to evaluate them.

Formally, each instance in our dataset is composed of 10 sentences: {s1, s2, o1, o2, o3, r1, r2, r3}. s1 and s2 are two similar statements which in the same syntactic structure and differ by only a few words, but only one of them makes sense while the other does not. They are used on our first subtask called Validation, which requires the model to identify which one makes sense. For the against-common-sense statement s1 or s2, we have three optional sentences o1, o2 and o3 to explain why the statement does not make sense. Our subtask 2, named Explanation (Multi-Choice), requires that the only one correct reason be identified from two other confusing ones. For the same against-common-sense statement s1 or s2, our subtask 3 naming Explanation (Generation), asks the participants to generate the reason why it does not make sense. The 3 referential reasons r1, r2 and r3 are used for evaluating task 3.


Task A: Validation
Task: Which statement of the two is against common sense?
Statement1: He put a turkey into the fridge.
Statement2: He put an elephant into the fridge.
Task B: Explanation (Multi-Choice)
Task: Select the most corresponding reason why this statement is against common sense.
Statement: He put an elephant into the fridge.
A: An elephant is much bigger than a fridge.
B: Elephants are usually white while fridges are usually white.
C: An elephant cannot eat a fridge.
Task C: Explanation (Generation)
Task: Generate the reason why this statement is against common sense and we will use BELU to evaluate it.
Statement: He put an elephant into the fridge.
Referential Reasons:
1. An elephant is much bigger than a fridge.
2. A fridge is much smaller than an elephant.
3. Most of the fridges aren’t large enough to contain an elephant.


For more detailed information, please refer to this link.

Please contact the task organisers or post on the competition forum if you have any further queries.

You can use the following if you want to use our dataset or cite our work:

Cunxiang Wang, Shuailong Liang, Yili Jin, Yi-long Wang, Xiaodan Zhu, and Yue Zhang. 2020.SemEval-2020 task 4: Commonsense Validation and Explanation. In Proceedings of The 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics.

    title = "{S}em{E}val-2020 Task 4: Commonsense Validation and Explanation",
    author = "Wang, Cunxiang  and
      Liang, Shuailong  and
      Jin, Yili  and
      Wang, Yilong  and
      Zhu, Xiaodan  and
      Zhang, Yue",
    booktitle = "Proceedings of The 14th International Workshop on Semantic Evaluation",
    year = "2020",
    publisher = "Association for Computational Linguistics",

   title = "Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation",
   author = "Wang, Cunxiang  and
     Liang, Shuailong  and
     Zhang, Yue  and
     Li, Xiaonan  and
     Gao, Tian",
   booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
   month = jul,
   year = "2019",
   address = "Florence, Italy",
   publisher = "Association for Computational Linguistics",
   url = "",
   pages = "4020--4026",
   abstract = "Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has the sense-making capability. Existing benchmarks measure common sense knowledge indirectly or without reasoning. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense-making.",


Senmaking task consists of 3 subtasks. Participating teams should participate in at-least one of the subtasks. Relevant scripts and datasets are available at: Github

Task A and B are evaluated by accuracy and Task C is evaluated using BLEU. To improve the reliability of the evaluation of Task C, we use a random subset of the test set and will do a human evaluation to further evaluate the systems with relatively high BLEU score.


The Evaluation Tool will evaluate each sample by its id rather than its order. Therefore it is important to use the original sample id when you submit.

Submitted systems

  • Teams are allowed to use the development set for training.
  • Teams can use additional resources such as pretrained language models, knowledge bases etc.
  • Only one final submission will be recorded per team. The codalab website will only show an updated submission if results are higher.


  • All data released for this task is done so under the CC BY-SA 4.0 License (licenses could also be found with the data).
  • Organizers of the competition might choose to publicize, analyze and change in any way any content submitted as a part of this task. Wherever appropriate, academic citation for the sending group would be added (e.g. in a paper summarizing the task).

The teams wishing to participate in SemEval 2020 should strictly adhere to the following deadlines.

Task Schedule for SemEval2020 (Updated April 1st 2020)

  • 19 February 2020: Evaluation start*
  • 11 March 2020: Evaluation end*
  • 18 March 2020: Results posted
  • 15 May 2020 UTC-12 23:59 : System description paper submissions due
  • 22 May 2020: Task description paper submissions due
  • 24 Jun 2020: Author notifications
  • 8 Jul 2020: Camera ready submissions due
  • 12-13 December 2020:  SemEval 2020

Competitions should comply with any general rules of SEMEVAL.

The organizers are free to penalized or disqualify for any violation of the above rules or for misuse, unethical behaviour or other behaviours they agree are not accepted in a scientific competition in general and in the specific one at hand.

Please contact the task organisers or post on the competition forum if you have any further queries.

Submission phases

Practice Phase

In this phase, feel free to make yourself familiar with the task, the input data format, the submission data format, and the the submission process.

Evaluation Phase

For formal evaluation phase, train your models on our provided training set, use our dev set if you need, and make prediction on formal test set. You are also welcome to use any external resources or pretrained models. The result will not show on the leaderboard until the end of the evaluation period. To avoid data leakage between subtasks, each subtask has its own phase. Evaluation of subtask A is released first, which is to choose the sensical statement. Subtask C is released after task A, which is to generate the reason why the nonsensical sentence does not make sense. Then Subtask B is released, which is to choose the correct reason out of the three candidate reasons. You are not required to attend each subtask. The evaluation for each subtask will last for 1 week. To evaluate a particular subtask you can just wait for its evlution phase to come.

Submission format

Please refer to Participate -> Files -> Starting Kit for submission file format as well as everything you need to know to make a valid submission.


For the scores, if the score shows "0.0" in your submission list, which is below the submission box, do not be alarmed. Click the "Download output from scoring step", there is a file named "scores.txt" which shows your score for all three subtasks. If you click "Submit to Leaderboard" for this submission, you can also check it out under "Results" Tab.

Q: I cannot see my scores during the formal evaluation period, why?

You are not allowed to see your result during the evaluation period. Your results will be available after the evaluation. This is to prevent model tuning on test set. See the rules here

Q: I got :ERROR:root:Found 1021 extra predictions, for example: 1, 3, 4", why?

It is probably because you ignore the sample id and just use the row number as a new id maybe? Please note that the original sample id is essential for the evaluation to match a correct reference. The trial data just use sequence id, but note that for test data you need to use the origianl sample id.

Q: How will you evalute the three tasks?
A: The evaluation will last for 3 weeks. In the first week, we will provide the test data of Subtask A (a pair of statements per case). In the second week, we will close the evaluation of Subtask A and provide the test data of Subtask C (the nonsensical statements in SubtaskA). In the last week, we will close the evaluation of Subtask C and provide the test data of Subtask B (the nonsensical statements and the three possible options to explain why it does not make sense) Q: Can I use all subtasks to do multitask learning? A: In the training period, you can use any subtask to help train any other subtasks as long as you think it is beneficial. In the evaluation period, you can use the data of SubtaskA to help Subtask B and Subtask C. However, you cannot use Subtask B or Subtask C to help Subtask A, since we will close the evaluation of Subtask A before releasing the test data of Subtask B&C. And you cannot use Subtask B to help Subtask C for the same reason. Q: Can I use external databases? External data, including knowledge graph, raw texts, can be utilized if you think they will help. Q: I have found some annotation errors in your dataset, what shall I do? A: We have opened an issue on GitHub about errors in the training data ( Everyone is welcome to submit issues on the dataset there. Q: My system has achieved really high performance on the trial data. Will test set include the trial data? A: Most of the trial data has been included in the training set, so it is natural to achieve good performance. However, the test set will not overlap with trial data or training data. The test data will be more carefully written and checked and may be a bit more difficult than training data.

Q: I missed the deadline of one subtask / I update my model and get better results after the competition etc., can I evaluate the result and write it in the system description paper?

A: Short answer: yes. After the competition, we will make the evaluation of all subtasks open, you can submit your results then. You can write any updated models or results in the system description paper, but you must make it clear that they are not submitted to SemEval competition.

Q: Will you choose the 'last' result or the 'best' result of my submissions?

A: After some internal discussions between organizers, we decide that the 'best' result of your submissions will be used as the final valid entry.

Q: Could you tell me the accept rate of former SemEval system description paper?

A: According to what we know, SemEval is very inclusive. Unless the paper is badly written, SemEval often do not reject papers. Therefore, we encourage you to submit a papar even if you did not get a very high score, as long as your work reflects your novelty thinking and may inspire others.


Start: Aug. 15, 2019, midnight

Description: Practice phase: submit result on trial data and get result for a taste of the data and task

Evaluation - Subtask A

Start: Feb. 19, 2020, midnight

Description: Evaluation phase: train your model on offical training set and you may use official validation set during training. Feel free to use additional resources such as knowledge bases etc. Submit results on official test data and get result for competition. Note that only the final valid submission on CodaLab will be taken as the official submission to the competition.

Evaluation - Subtask C

Start: Feb. 26, 2020, midnight

Evaluation - Subtask B

Start: March 4, 2020, midnight


Start: March 11, 2020, 11:59 p.m.

Competition Ends


You must be logged in to participate in competitions.

Sign In