Fact Extraction and VERification (FEVER) Challenge

Organized by oana - Current server time: June 25, 2018, 3:47 p.m. UTC

Current

Before Competition (Development Set Evaluation)
April 3, 2018, midnight UTC

Next

Competition (Blind Test Set Evaluation)
July 24, 2018, midnight UTC

End

Competition Ends
Never

Fact Extraction and VERification (FEVER) Challenge

Shared Task Track

Participants will be invited to develop systems to identify evidence and reason about truthfulness of a given claim that we have generated. Our dataset currently contains 200,000 true and false claims. The true claims are written by humans annotators extracting information from Wikipedia.

Task Definition

The purpose of the FEVER challenge is to evaluate the ability of a system to verify information using evidence from Wikipedia.

  • Given a factual claim involving one or more entities (resolvable to Wikipedia pages), the system must extract textual evidence (sets of sentences from Wikipedia pages) that support or refute the claim.
  • Using this evidence, label the claim as Supported, Refuted given the evidence or NotEnoughInfo (if there isn’t sufficient evidence to either support of refute it).
  • One piece of evidence can consist of multiple sentences that only if examined together provide the stated label (e.g. for the claim “Oliver Reed was a film actor.”, one piece of evidence can be the set {“Oliver Reed starred in the Gladiator”, “Gladiator is film released in 2000”}).

Find out more about the challenge on our website http://fever.ai and submit system descriptions to Softconf

Key Dates

  • Challenge Launch: 3rd April 2018
  • Testing Begins (test set released): 24th July 2018
  • Submission Closes: 27th July 2018
  • Results Announced: 30th July 2018
  • System Descriptions Due for Workshop: 10th August 2018
  • Winners Announced: 31st of October or 1st of November (EMNLP)

All deadlines are calculated at 11:59pm Pacific Daylight Savings Time (UTC -7h).

Scoring

Our scoring considers classification accuracy and evidence recall.

  • We will only award points for accuracy if the correct evidence is found.
  • For a claim, we consider the correct evidence to be found if the complete set of annotated sentences is returned.

Baseline system

For a detailed description of the data annotation process and baseline results see the paper

Data Format

The data will be distributed in JSONL format with one example per line (see http://jsonlines.org/ for more details).

In addition to the task-specific dataset, the full set of Wikipedia pages (segmented at the sentence level) will be distributed on the data tab or on our website https://sheffieldnlp.github.io/fever.

Training/Development Data format

The training and development data will contain 4 fields:

  • id: The ID of the claim
  • label: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO.
  • claim: The text of the claim.
  • evidence: A list of evidence sets (lists of [Annotation ID, Evidence ID, Wikipedia URL, sentence ID] tuples) or a [Annotation ID, Evidence ID, null, null] tuple if the label is NOT ENOUGH INFO.

Below are examples of the data structures for each of the three labels.

Supports Example

{
    "id": 62037,
    "label": "SUPPORTS",
    "claim": "Oliver Reed was a film actor.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 3],
            [<annotation_id>, <evidence_id>, "Gladiator_-LRB-2000_film-RRB-", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 2],
            [<annotation_id>, <evidence_id>, "Castaway_-LRB-film-RRB-", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 1]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 6]
        ]
    ]
}

Refutes Example

{
    "id": 78526,
    "label": "REFUTES",
    "claim": "Lorelai Gilmore's father is named Robert.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
        ]
    ]
}

NotEnoughInfo Example

{
    "id": 137637,
    "label": "NOT ENOUGH INFO",
    "claim": "Henri Christophe is recognized for building a palace in Milot.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, null, null]
        ]
    ]
}

Test Data format

The test data will follow the same format as the training/development examples, with the label and evidence fields removed.

{
    "id": 78526,
    "claim": "Lorelai Gilmore's father is named Robert."
}

Answer Submission Instructions

  • Go to Codalab
  • Create a team/system account
  • Submit answers file as a jsonl document (file name must be predictions.jsonl). One claim object per line submitting predicted evidence as a set of [Page, Line ID] tuples. Each json object should adhere to the following format (with line breaks removed) and the order of the claims must be preserved.
{
    "id": 78526,
    "predicted_label": "REFUTES",
    "predicted_evidence": [
        ["Lorelai_Gilmore", 3]
    ]
}

Evaluation

In the first instance, submissions will be measured on the correctness of the label assignment for claims. If a claim is labelled as supported or refuted, additionally the evidences will be checked against the list of annotated evidences. The label will be considered as correct if atleast one provided evidence matches the annotated evidences.

At a later date (before the workshop) we will manually assess evidence marked as false-positive and release updated scores and an update to the corpus.

Best practice

  1. We highly discourage the usage of submission server for parameter tuning. Such submissions will be disqualified from the task.
  2. If you are using any additional data beyond wikipedia to build your system, please mention it clearly in your submission notes.

BibTeX

@inproceedings{Thorne18Fever,
    author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
    title = {{FEVER}: a Large-scale Dataset for Fact Extraction and VERification},
    booktitle = {NAACL-HLT},
    year = {2018}
}

Before Competition (Development Set Evaluation)

Start: April 3, 2018, midnight

Competition (Blind Test Set Evaluation)

Start: July 24, 2018, midnight

After Competition: Perpetual Evaluation (Test Set)

Start: July 28, 2018, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 Tuhin 0.2978
2 jamesthorne 0.1718