Assessing the Funniness of Edited News Headlines (SemEval-2020)

Organized by nabilhossain - Current server time: Nov. 17, 2019, 11:01 a.m. UTC

Previous

Development-Task-2
May 28, 2019, midnight UTC

Current

Development-Task-1
May 28, 2019, midnight UTC

Next

Evaluation-Task-2
Jan. 10, 2020, midnight UTC

Overview

SemEval-2020 Task 7: Assessing Humor in Edited News Headlines

Join the task mailing list: semeval-2020-task-7-all@googlegroups.com

Background and Significance: Nearly all existing humor datasets are annotated to study whether a chunk of text is funny. However, it is interesting to study how short edits applied to a text can turn it from non-funny to funny. Such a dataset helps us focus on the humorous effects of atomic changes and the tipping point between regular and humorous text. The goal of our task is to determine how machines can understand humor generated by such short edits.

In addition, almost all humor datasets are annotated categorically, with the shared tasks being humor classification. However, humor occurs in various intensities, that is, certain jokes are much more funnier than others. A system's ability to assess the intensity of humor makes it useful in various applications, for example, humor generation where such a system can be used in a generate-and-test scheme to generate many potentially humorous texts and rank them in terms of funniness.

Tasks: In this competition, participants will estimate the funniness of news headlines that have been modified by humans using a micro-edit to make them funny. We define a headline micro-edit as any of the following replacements:

Replaced Replacement
entity noun
noun noun
verb verb

Each edited headline is scored by five judges, each of whom assigned a grade from one of the following:

Grade Meaning
0    Not Funny
1    Slightly Funny
2    Moderately Funny
3    Funny

The ground truth funniness of each headline is the mean of its five funniness grades. Sample datapoints from the training set are shown below:

 Original Headline  Substitute  Grade
 Kushner to visit Mexico following latest Trump tirades  therapist  2.8
 Hilllary Clinton Staffers Considered Campaign Slogan `Because It's Her Turn'  fault  2.8
 The Latest: BBC cuts ties with Myanmar TV station  pies  1.8
 Oklahoma isn't working. Can anyone fix this failing American state?  okay  0.0
 4 soldiers killed in Nagorno-Karabakh fighting: Officials  rabbits  0.0

 

There will be two sub-tasks that you can participate in:

  1. Regression: Given the original and the edited headline, the participant is required to predict the mean funniness of the edited headline.
  2. Predict the funnier of the two edited headlines: Given the original headline and two edited versions, the participant has to predict which edited version is the funnier of the two.

This dataset was introduced in the following publication:

Evaluation Criteria

Note: Evaluations on Test Set will happen during Jan 10-31, 2020 according to SemEval rules.

Sub-Task 1: Regression.

Systems will be ranked using the Root Mean Squared Error (RMSE) on the overall test set. The file uploaded for evaluation must be a zip file containing a csv file called "task-1-output.csv" having two columns in the following order:

  • id: the ID of the edited headline as provided in the dataset
  • pred: the estimated funniness for the headline, a real number in the 0-3 funniness interval.

Please include the column headers and name them exactly as above, and in the order mentioned. A sample output (for the baseline system) can be found here.

We will additionally report RMSE by taking the N% most funny headlines and N% least funny headlines in the test set, for N ∈ {10,20,30,40}. For example, N=30 implies sorting the test set from most funny to least funny and using the top 30% and the bottom 30% of this sorted data, for a total of 60% of the test set, to calculate the RMSE. These are meant to be additional evaluation metrics, and they will not be used to rank systems.

 

Sub-Task 2: Predict funnier of the two edited versions of an original headline.

Systems will be ranked based on the accuracy in predicting the funnier of the two edited versions of the same original headline according to the ground truth mean funniness on the test set. System outputs will be ignored for cases where the two edited headlines have the same ground truth mean funniness.

The file uploaded for evaluation must be a zip file containing a csv file called "task-2-output.csv" having two columns in the following order:

  • id: the ID of the two edited headlines separated by "-" as provided in the dataset.
  • pred: the edited headline which is predicted the funnier of the two.
    • 1 implies headline 1 is predicted funnier.
    • 2 implies headline 2 is predicted funnier.

Please include the column headers and name them exactly as above, and in the order mentioned. A sample output (for the baseline system) can be found here.

We will also report another evaluation metric called the reward, calculated as follows:

  • For a correct prediction, the pair-wise reward is the positive difference between the mean grades of the two headlines.
  • For a wrong prediction, the pair-wise reward is the negative difference between the mean grades of the two headlines.

Overall reward is the mean of these pair-wise rewards (here we also ignore cases where the two edited headlines have the same ground truth mean funniness). This metric will not be used to rank teams.

Terms and Conditions

SemEval 2020 terms and conditions coming soon!

Task Organizers

 

Nabil Hossain

nhossain@cs.rochester.edu

University of Rochester

 

John Krumm

jckrumm@microsoft.com

Microsoft Research AI

 

Michael Gamon

mgamon@microsoft.com

Microsoft Research AI

 

Henry Kautz

kautz@cs.rochester.edu

University of Rochester

Resources

Participants are encouraged to look into the following paper which introduces the dataset:

Baseline systems and evaluation scripts are available in GitHub.

Baseline for Task 1: This baseline always predicts the overall mean funniness grade in the training set.

Baseline for Task 2: This baseline always predicts the most frequent label in the training set (i.e., headline 2).

Development-Task-2

Start: May 28, 2019, midnight

Development-Task-1

Start: May 28, 2019, midnight

Evaluation-Task-2

Start: Jan. 10, 2020, midnight

Evaluation-Task-1

Start: Jan. 10, 2020, midnight

Post-Evaluation-Task-2

Start: Feb. 1, 2020, midnight

Post-Evaluation-Task-1

Start: Feb. 1, 2020, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 Ferryman 0.52473
2 Pramodith 0.52487
3 will_go 0.52763