RumourEval 2019 (SemEval 2019 Task 7)

Organized by ggorrell - Current server time: May 19, 2019, 7:16 a.m. UTC


Jan. 21, 2019, midnight UTC


Feb. 1, 2019, midnight UTC


Competition Ends

Welcome to RumourEval 2019!

The core mission is to automatically determine the veracity of rumours. The task falls into two parts; task A, in which responses to a rumourous post are classified according to stance, and task B, in which the statements themselves are classified for veracity. Each is described in more detail below.


Latest news! The competition has ended. Thank you to all teams who showed interest and made submissions. The final leaderboard is below.

User Verif RMSE SDQC
quanzhi 0.5765 (1) 0.6078 0.5776
ukob-west 0.2856 (2) 0.7642 0.3740
sardar 0.2620 (3) 0.8012 0.4352
BLCU-nlp 0.2525 0.8179 0.6187
shaheyu 0.2284 0.8081 0.3053
ShivaliGoel 0.2244 0.8623 0.3625
mukundyr 0.2244 0.8623 0.3404
Xinthl 0.2238 0.8623 0.2297
lzr 0.2238 0.8678 0.3404
eebism 0.1845 0.7857 0.2530
Bilal.ghanem 0.1996 0.8264 0.4895
NimbusTwoThousand 0.0950 0.9148 0.1272
deanjjones 0.0000 0.0000 0.3267
jurebb 0.0000 0.0000 0.3537
z.zojaji 0.0000 0.0000 0.3875
lec-unifor 0.0000 0.0000 0.4384
magc 0.0000 0.0000 0.3927
Martin 0.0000 0.0000 0.6067
jacobvan 0.0000 0.0000 0.4792
wshuyi 0.0000 0.0000 0.3699
cjliux 0.0000 0.0000 0.4298

You can still submit for your own experimentation purposes.

You can also join the Google group for the task, where you will find answers to your questions.

Task A (SDQC)

Related to the objective of predicting a rumour's veracity, the first subtask will deal with the complementary objective of tracking how other sources orient to the accuracy of the rumourous story. A key step in the analysis of the surrounding discourse is to determine how other users in social media regard the rumour. We propose to tackle this analysis by looking at the replies to the post that presented the rumourous statement, i.e. the originating rumourous (source) post. We will provide participants with a tree-structured conversation formed of posts replying to the originating rumourous post, where each post presents its own type of support with regard to the rumour. We frame this in terms of supporting, denying, querying or commenting on (SDQC) the claim. Therefore, we introduce a subtask where the goal is to label the type of interaction between a given statement (rumourous post) and a reply post (the latter can be either direct or nested replies). Each tweet in the tree-structured thread will have to be categorised into one of the following four categories:

  • Support: the author of the response supports the veracity of the rumour they are responding to.
  • Deny: the author of the response denies the veracity of the rumour they are responding to.
  • Query: the author of the response asks for additional evidence in relation to the veracity of the rumour they are responding to.
  • Comment: the author of the response makes their own comment without a clear contribution to assessing the veracity of the rumour they are responding to.

Task B (verification)

The goal of the second subtask is to predict the veracity of a given rumour. The rumour is presented as a post reporting or querying a claim but deemed unsubstantiated at the time of release. Given such a claim, and a set of other resources provided, systems should return a label describing the anticipated veracity of the rumour as true or false. The ground truth of this task is manually established by journalist and expert members of the team who identify official statements or other trustworthy sources of evidence that resolve the veracity of the given rumour. Additional context will be provided as input to veracity prediction systems; this context will consist of snapshots of relevant sources retrieved immediately before the rumour was reported, including a snapshot of an associated Wikipedia article, a Wikipedia dump, news articles from digital news outlets retrieved from NewsDiffs, as well as preceding tweets from the same event. Critically, no external resources may be used that contain information from after the rumour's resolution. To control this, we will specify precise versions of external information that participants may use. This is important to make sure we introduce time sensitivity into the task of veracity prediction. We take a simple approach to this task, using only true/false labels for rumours. In practice, however, many claims are hard to verify; for example, there were many rumours concerning Vladimir Putin's activities in early 2015, many wholly unsubstantiable. Therefore, we also expect systems to return a confidence value in the range of 0-1 for each rumour; if the rumour is unverifiable, a confidence of 0 should be returned.


  • Codalab lead and Reddit data: Genevieve Gorrell
  • Twitter new (test) data: Ahmet Aker
  • Danish and Russian data: Leon Derczynski
  • Baseline: Elena Kochkina
  • Advice and support from the rest of the team: Arkaitz Zubiaga, Maria Liakata, Kalina Bontcheva

Evaluation Criteria

A submission should be a JSON format file, called "answer.json", consisting of two fields, one for task A and one for task B, like so:

    "subtaskaenglish": {
	"tweetid1": "comment",
	"tweetid2": "query",
	"tweetid3": "support",
	"redditid1": "deny"
    "subtaskbenglish": {
        "tweetthreadid1": ["true",1.0]
        "redditthreadid2": ["false",0.0]
    "subtaskadanish": {
	"tweetid4": "comment",
	"tweetid5": "query",
	"tweetid6": "comment",
	"redditid2": "deny"
    "subtaskbdanish": {
        "tweetthreadid3": ["false",1.0]
        "redditthreadid4": ["false",0.0]
    "subtaskarussian": {
	"tweetid7": "comment",
	"tweetid8": "comment",
	"tweetid9": "support",
	"redditid3": "comment"
    "subtaskbrussian": {
        "tweetthreadid5": ["true",1.0]
        "redditthreadid6": ["true",0.0]

E.g. a Twitter tweet/thread ID might be something like 514957228327907328. A Reddit post/thread ID might be something like dbmdk4o.

Subtask A (SDQC) takes one element per comment. Comments should be classified into four categories; support, deny, query and comment. Performance is evaluated using macro F1. Although training threads for task B (verification) come with three labels, true, false and unverified, you should classify into two classes; true and false. Again, macro F1 will be calculated. Classes are followed by a confidence score, which will be used to calculate an RMSE, in order to give a more nuanced view of performance on task B. Unverified items should receive a confidence score of zero. For the F1 calculation, confidences below 0.5 will be considered a classification of unverified.

You will need to zip up your answer file to submit it. It should be at the top level of the archive, and should be called "answer.json".

If you are completing only one subtask, or not all languages, you can omit or leave empty the other fields.

Terms and Conditions

Use of the data indicates acceptance of the Twitter and Reddit terms of service.


Start: Aug. 6, 2018, midnight


Start: Jan. 21, 2019, midnight


Start: Feb. 1, 2019, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 ukob-west 0.3799
2 AndrejJan 0.3326
3 kochkinael 0.3089