Assessing the Funniness of Edited News Headlines (SemEval-2020)

Organized by nabilhossain - Current server time: Jan. 20, 2020, 12:28 a.m. UTC

Previous

Development-Task-2
May 28, 2019, midnight UTC

Current

Development-Task-1
May 28, 2019, midnight UTC

Next

Evaluation-Task-2
Feb. 20, 2020, midnight UTC

Overview

SemEval-2020 Task 7: Assessing Humor in Edited News Headlines

Join the task mailing list: semeval-2020-task-7-all@googlegroups.com

Background and Significance: Nearly all existing humor datasets are annotated to study whether a chunk of text is funny. However, it is interesting to study how short edits applied to a text can turn it from non-funny to funny. Such a dataset helps us focus on the humorous effects of atomic changes and the tipping point between regular and humorous text. The goal of our task is to determine how machines can understand humor generated by such short edits.

In addition, almost all humor datasets are annotated categorically, with the shared tasks being humor classification. However, humor occurs in various intensities, that is, certain jokes are much more funnier than others. A system's ability to assess the intensity of humor makes it useful in various applications, for example, humor generation where such a system can be used in a generate-and-test scheme to generate many potentially humorous texts and rank them in terms of funniness.

Tasks: In this competition, participants will estimate the funniness of news headlines that have been modified by humans using a micro-edit to make them funny. We define a headline micro-edit as any of the following replacements:

Replaced Replacement
entity noun
noun noun
verb verb

Each edited headline is scored by five judges, each of whom assigned a grade from one of the following:

Grade Meaning
0    Not Funny
1    Slightly Funny
2    Moderately Funny
3    Funny

The ground truth funniness of each headline is the mean of its five funniness grades. Sample datapoints from the training set are shown below:

 Original Headline  Substitute  Grade
 Kushner to visit Mexico following latest Trump tirades  therapist  2.8
 Hilllary Clinton Staffers Considered Campaign Slogan `Because It's Her Turn'  fault  2.8
 The Latest: BBC cuts ties with Myanmar TV station  pies  1.8
 Oklahoma isn't working. Can anyone fix this failing American state?  okay  0.0
 4 soldiers killed in Nagorno-Karabakh fighting: Officials  rabbits  0.0

 

There will be two sub-tasks that you can participate in:

  1. Regression: Given the original and the edited headline, the participant is required to predict the mean funniness of the edited headline.
  2. Predict the funnier of the two edited headlines: Given the original headline and two edited versions, the participant has to predict which edited version is the funnier of the two.

This dataset was introduced in the following publication:

Evaluation Criteria

Note: Evaluations on Test Set will happen during Jan 10-31, 2020 according to SemEval rules.

Sub-Task 1: Regression.

Systems will be ranked using the Root Mean Squared Error (RMSE) on the overall test set. The file uploaded for evaluation must be a zip file containing a csv file called "task-1-output.csv" having two columns in the following order:

  • id: the ID of the edited headline as provided in the dataset
  • pred: the estimated funniness for the headline, a real number in the 0-3 funniness interval.

Please include the column headers and name them exactly as above, and in the order mentioned. A sample output (for the baseline system) can be found here.

We will additionally report RMSE by taking the N% most funny headlines and N% least funny headlines in the test set, for N ∈ {10,20,30,40}. For example, N=30 implies sorting the test set from most funny to least funny and using the top 30% and the bottom 30% of this sorted data, for a total of 60% of the test set, to calculate the RMSE. These are meant to be additional evaluation metrics, and they will not be used to rank systems.

 

Sub-Task 2: Predict funnier of the two edited versions of an original headline.

Systems will be ranked based on the accuracy in predicting the funnier of the two edited versions of the same original headline according to the ground truth mean funniness on the test set. System outputs will be ignored for cases where the two edited headlines have the same ground truth mean funniness.

The file uploaded for evaluation must be a zip file containing a csv file called "task-2-output.csv" having two columns in the following order:

  • id: the ID of the two edited headlines separated by "-" as provided in the dataset.
  • pred: the edited headline which is predicted the funnier of the two.
    • 1 implies headline 1 is predicted funnier.
    • 2 implies headline 2 is predicted funnier.

Please include the column headers and name them exactly as above, and in the order mentioned. A sample output (for the baseline system) can be found here.

We will also report another evaluation metric called the reward, calculated as follows:

  • For a correct prediction, the pair-wise reward is the positive difference between the mean grades of the two headlines.
  • For a wrong prediction, the pair-wise reward is the negative difference between the mean grades of the two headlines.

Overall reward is the mean of these pair-wise rewards (here we also ignore cases where the two edited headlines have the same ground truth mean funniness). This metric will not be used to rank teams.

Terms and Conditions

By participating in this task you agree to these terms and conditions. If, however, one or more of this conditions is a concern for you, send us an email and we will consider if an exception can be made.

  • By submitting results to this competition, you consent to the public release of your scores at this website and at SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  • You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  • A participant can be involved in exactly one team (no more). If there are reasons why it makes sense for you to be on more than one team, then email us before the evaluation period begins. In special circumstances this may be allowed.
  • Each team must create and use exactly one CodaLab account.
  • Team constitution (members of a team) cannot be changed after the evaluation period has begun.
  • During the evaluation period:
    • Each team can submit as many as 50 submissions . However,only the final submission will be considered as the official submission to the competition.
    • You will not be able to see results of your submission on the test set.
    • You will be able to see any warnings and errors for each of your submission.
    • Leaderboard is disabled.
  • Once the competition is over, we will release the gold labels and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.
  • We will make the final submissions of the teams public at some point after the evaluation period.
  • The organizers and their affiliated institutions makes no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • Each task participant will be assigned another teams’ system description papers for review, using the START system. The papers will thus be peer reviewed.
  • The dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
  • The datasets must not be redistributed or shared in part or full with any third party. Redirect interested parties to this website.
  • If you use any of the datasets provided here, cite these papers: 
    • Nabil Hossain, John Krumm, and Michael Gamon. "President Vows to Cut <Taxes> Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 133-142).
    • Nabil Hossain, John Krumm, and Michael Gamon.. 2020. Semeval-2020 Task 7: Assessing Humor in Edited News Headlines. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, September 2020.

Task Organizers

 

Nabil Hossain

nhossain@cs.rochester.edu

University of Rochester

 

John Krumm

jckrumm@microsoft.com

Microsoft Research AI

 

Michael Gamon

mgamon@microsoft.com

Microsoft Research AI

 

Henry Kautz

kautz@cs.rochester.edu

University of Rochester

Resources

Participants are encouraged to look into the following paper which introduces the dataset:

Baseline systems and evaluation scripts are available in GitHub.

Baseline for Task 1: This baseline always predicts the overall mean funniness grade in the training set.

Baseline for Task 2: This baseline always predicts the most frequent label in the training set (i.e., headline 2).

Participants will be given the opportunity to write a system description paper that describes their system, resources used, results, and analysis. This paper will be part of the official SemEval-2020 proceedings. The paper is to be four pages long plus two pages at most for references. The papers are to follow the format and style files provided by ACL/NAACL/EMNLP 2020.

Development-Task-2

Start: May 28, 2019, midnight

Development-Task-1

Start: May 28, 2019, midnight

Evaluation-Task-2

Start: Feb. 20, 2020, midnight

Evaluation-Task-1

Start: Feb. 20, 2020, midnight

Post-Evaluation-Task-2

Start: March 8, 2020, midnight

Post-Evaluation-Task-1

Start: March 8, 2020, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 alonzorz 0.51568
2 vgtomahawk 0.51622
3 BramVanroy 0.51800