Assessing the Funniness of Edited News Headlines (SemEval-2020)

Organized by nabilhossain - Current server time: July 5, 2020, 4:36 p.m. UTC

Previous

Post-Evaluation-Task-2
March 9, 2020, midnight UTC

Current

Post-Evaluation-Task-1
March 9, 2020, midnight UTC

End

Competition Ends
Never

Overview

SemEval-2020 Task 7: Assessing Humor in Edited News Headlines

Join the task mailing list: semeval-2020-task-7-all@googlegroups.com

Background and Significance: Nearly all existing humor datasets are annotated to study whether a chunk of text is funny. However, it is interesting to study how short edits applied to a text can turn it from non-funny to funny. Such a dataset helps us focus on the humorous effects of atomic changes and the tipping point between regular and humorous text. The goal of our task is to determine how machines can understand humor generated by such short edits.

In addition, almost all humor datasets are annotated categorically, with the shared tasks being humor classification. However, humor occurs in various intensities, that is, certain jokes are much more funnier than others. A system's ability to assess the intensity of humor makes it useful in various applications, for example, humor generation where such a system can be used in a generate-and-test scheme to generate many potentially humorous texts and rank them in terms of funniness.

Tasks: In this competition, participants will estimate the funniness of news headlines that have been modified by humans using a micro-edit to make them funny. We define a headline micro-edit as any of the following replacements:

Replaced Replacement
entity noun
noun noun
verb verb

Each edited headline is scored by five judges, each of whom assigned a grade from one of the following:

Grade Meaning
0    Not Funny
1    Slightly Funny
2    Moderately Funny
3    Funny

The ground truth funniness of each headline is the mean of its five funniness grades. Sample datapoints from the training set are shown below:

 Original Headline  Substitute  Grade
 Kushner to visit Mexico following latest Trump tirades  therapist  2.8
 Hilllary Clinton Staffers Considered Campaign Slogan `Because It's Her Turn'  fault  2.8
 The Latest: BBC cuts ties with Myanmar TV station  pies  1.8
 Oklahoma isn't working. Can anyone fix this failing American state?  okay  0.0
 4 soldiers killed in Nagorno-Karabakh fighting: Officials  rabbits  0.0

 

There will be two sub-tasks that you can participate in:

  1. Regression: Given the original and the edited headline, the participant is required to predict the mean funniness of the edited headline.
  2. Predict the funnier of the two edited headlines: Given the original headline and two edited versions, the participant has to predict which edited version is the funnier of the two.

This dataset was introduced in the following publication:

Evaluation Criteria

Note: Evaluations on Test Set will happen during Jan 10-31, 2020 according to SemEval rules.

Sub-Task 1: Regression.

Systems will be ranked using the Root Mean Squared Error (RMSE) on the overall test set. The file uploaded for evaluation must be a zip file containing a csv file called "task-1-output.csv" having two columns in the following order:

  • id: the ID of the edited headline as provided in the dataset
  • pred: the estimated funniness for the headline, a real number in the 0-3 funniness interval.

Please include the column headers and name them exactly as above, and in the order mentioned. A sample output (for the baseline system) can be found here.

We will additionally report RMSE by taking the N% most funny headlines and N% least funny headlines in the test set, for N ∈ {10,20,30,40}. For example, N=30 implies sorting the test set from most funny to least funny and using the top 30% and the bottom 30% of this sorted data, for a total of 60% of the test set, to calculate the RMSE. These are meant to be additional evaluation metrics, and they will not be used to rank systems.

 

Sub-Task 2: Predict funnier of the two edited versions of an original headline.

Systems will be ranked based on the accuracy in predicting the funnier of the two edited versions of the same original headline according to the ground truth mean funniness on the test set. System outputs will be ignored for cases where the two edited headlines have the same ground truth mean funniness.

The file uploaded for evaluation must be a zip file containing a csv file called "task-2-output.csv" having two columns in the following order:

  • id: the ID of the two edited headlines separated by "-" as provided in the dataset.
  • pred: the edited headline which is predicted the funnier of the two.
    • 1 implies headline 1 is predicted funnier.
    • 2 implies headline 2 is predicted funnier.

Please include the column headers and name them exactly as above, and in the order mentioned. A sample output (for the baseline system) can be found here.

We will also report another evaluation metric called the reward, calculated as follows:

  • For a correct prediction, the pair-wise reward is the positive difference between the mean grades of the two headlines.
  • For a wrong prediction, the pair-wise reward is the negative difference between the mean grades of the two headlines.

Overall reward is the mean of these pair-wise rewards (here we also ignore cases where the two edited headlines have the same ground truth mean funniness). This metric will not be used to rank teams.

Terms and Conditions

By participating in this task you agree to these terms and conditions. If, however, one or more of this conditions is a concern for you, send us an email and we will consider if an exception can be made.

  • By submitting results to this competition, you consent to the public release of your scores at this website and at SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  • You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  • A participant can be involved in exactly one team (no more). If there are reasons why it makes sense for you to be on more than one team, then email us before the evaluation period begins. In special circumstances this may be allowed.
  • Each team must create and use exactly one CodaLab account.
  • Team constitution (members of a team) cannot be changed after the evaluation period has begun.
  • During the evaluation period:
    • Each team can submit as many as 50 submissions . However,only the final submission will be considered as the official submission to the competition.
    • You will not be able to see results of your submission on the test set.
    • You will be able to see any warnings and errors for each of your submission.
    • Leaderboard is disabled.
  • Once the competition is over, we will release the gold labels and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.
  • We will make the final submissions of the teams public at some point after the evaluation period.
  • The organizers and their affiliated institutions makes no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • Each task participant will be assigned another teams’ system description papers for review, using the START system. The papers will thus be peer reviewed.
  • The dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
  • The datasets must not be redistributed or shared in part or full with any third party. Redirect interested parties to this website.
  • If you use any of the datasets provided here, cite these papers: 
    • Nabil Hossain, John Krumm, and Michael Gamon. "President Vows to Cut <Taxes> Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 133-142).
    • Nabil Hossain, John Krumm, and Michael Gamon.. 2020. Semeval-2020 Task 7: Assessing Humor in Edited News Headlines. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, September 2020.

Task Organizers

 

Nabil Hossain

nhossain@cs.rochester.edu

University of Rochester

 

John Krumm

jckrumm@microsoft.com

Microsoft Research AI

 

Michael Gamon

mgamon@microsoft.com

Microsoft Research AI

 

Henry Kautz

kautz@cs.rochester.edu

University of Rochester

Resources

Participants are encouraged to look into the following paper which introduces the dataset:

Baseline systems and evaluation scripts are available in GitHub.

Baseline for Task 1: This baseline always predicts the overall mean funniness grade in the training set.

Baseline for Task 2: This baseline always predicts the most frequent label in the training set (i.e., headline 2).

Note: Instructions updated on April 3, 2020 after COLING 2020 was postponed to December 2020 and SemEval revised its submission rules.

Participants who made a submission on the CodaLab website during the official evaluation period are given the opportunity to submit a system-description paper that describes their system, resources used, results, and analysis. This paper will be part of the official SemEval-2020 proceedings.

Here are important details regarding system description paper submissions. There you will find the submission site, best system description paper award criteria, the paper formatting template. Papers should follow the COLING camera-ready formatting. SemEval guidelines for writing system description papers are here.

Papers are due Friday, May 15, 2020 by 23:59 UTC-12h ("Anywhere on earth"). Check out other important dates here.

Page limits: If describing only one sub-task, then up to 5 pages + unlimited references. If you took part in both sub-tasks then you can go up to 8 pages + references if your models for both tasks are significantly different. For example, simply using the model for sub-task 1 in sub-task 2 to assign funniness scores to the two headlines for a test example and then choosing the headline with the max score does not qualify for an 8 page paper. If you believe that your two models are significantly different, send us an email requesting approval for an 8 page paper. An extra page will be given for the camera-ready version to incorporate reviewer suggestions.

You do not have to repeat details of the task and data. Just cite the Task paper (details below) and quickly summarize the tasks you made submissions to and then you can get into details of the related work, your submissions, experiments, and results.

  • Nabil Hossain, John Krumm, Michael Gamon and Henry Kautz. 2020. Semeval-2020 Task 7: Assessing Humor in Edited News Headlines. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, September 2020.
    @InProceedings{SemEval2020Task7, author = {Hossain, Nabil and Krumm, John and Gamon, Michael and Kautz,Henry}, title = {SemEval-2020 {T}ask 7: {A}ssessing Humor in Edited News Headlines}, booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2020)}, address = {Barcelona, Spain}, year = {2020}}

A copy of this paper will be made available in early May. This is after the deadline for your paper submission but you will be able to see this paper well-before the camera-ready deadline. So after access to the task paper, you can still update your paper as you see fit.

The paper below describes how the dataset for the task was created:


The following paper describes how the additional training data (FunLines) was created:

  • Hossain, Nabil, et al. "Stimulating Creativity with FunLines: A Case Study of Humor Generation in Headlines." In Proceedings of ACL 2020, System Demonstrations. 2020.
  • @inproceedings{hossain-etal-2020-funlines,
        title = "{S}timulating Creativity With FunLines: A Case Study of Humor Generation in Headlines",
        author = "Hossain, Nabil  and Krumm, John and Sajed, Tanvir and Kautz, Henry",
        booktitle = "Proceedings of {ACL} 2020, System Demonstrations",
        month = jul,
        year = "2020",
        address = "Seattle, Washington",
        publisher = "Association for Computational Linguistics"
    }

Important Notes:

  • You are not obligated to submit a system-description paper, however, we strongly encourage all participating teams to do so.
  • SemEval seeks to have all participants publish a paper, unless the paper does a poor job of describing their system. Your system rank and scores will not impact whether the paper is accepted or not.
  • Note that SemEval submission is not anonymous; author names should be included.
  • Later each participating team will be assigned up to 2 other teams’ system description papers for review, using the START system. We request the papers be reviewed by the most-experienced reviewer(s) in your team.
  • Assuming we have a physical conference, all task participant teams should prepare a poster for display at SemEval. One selected team will be asked to prepare a short talk. Details will be provided at a later date.
  • Please do not dwell too much on system rankings. Focus instead on analysis and the research questions that your system can help address.
  • It may also be helpful to look at some of the papers from past SemEval competitions, e.g., from here.

What to include in a system-description paper? Here are some key pointers specific to this task:

  • Replicability: Present all details that will allow someone else to replicate your system.
  • Analysis: Focus more on results and analysis and less on discussing rankings. Report results on several variants of the system (even beyond the official submission); present sensitivity analysis of your system's parameters, network architecture, etc.; present ablation experiments showing usefulness of different features and techniques; show comparisons with baselines. Use the gold test labels for the extra analysis. However, clearly mark what the official submission results were and what the ranks were. Discuss where your models failed, and why they failed, with qualitative and quantitative analysis where possible.
  • Data: Discuss any quirks on the data. What interesting things did you discover about the data that contributed to your experiments in a positive or negative way?
  • Humor Theories: Discuss any theories of humor that you used as part of your approach. For example, modeling surprise, setup-and-punchline, etc.
  • Justification: Include justifications for model choices. Explain why you chose one model over others, why you felt this was the right model to use, and persuade readers that this is a good approach.
  • Related work: Place your work in context of previously published related work. Cite all data and resources used in your submission.

FAQ

  • Q. My system did not get a good rank. Should I still write a system-description paper?
    Ans. We encourage all participants to submit a system description paper. The goal is to record all the approaches that were used and how effective they were. Do not dwell too much on system rankings. Focus instead on analysis and the research questions that your system can help address. What has not worked is also useful information. You can also write a paper with a focus on testing a hypothesis that your system and this task allows you to explore.
  • Q. Can we describe results of new techniques that we haven't submitted to the eval phase?
    Ans. Yes, you are allowed, and even encouraged. But: clearly mark what the official submission results were and what the ranks were.
  • Q. What should the title prefix look like?
    Ans. Your title should be something like this: "<team name> at SemEval-2020 Task 7: [Some More Title Text]"
  • Q. How do I cite the task?
    Ans. All system papers must cite the task paper. Additionally, we will be grateful if you also cite our NAACL 2019 paper above which describes how the data was created, and also cite the FunLines paper above that provided the additional training data (see above for all 3 citations).

Development-Task-2

Start: May 28, 2019, midnight

Development-Task-1

Start: May 28, 2019, midnight

Evaluation-Task-2

Start: Feb. 20, 2020, midnight

Evaluation-Task-1

Start: Feb. 20, 2020, midnight

Post-Evaluation-Task-2

Start: March 9, 2020, midnight

Post-Evaluation-Task-1

Start: March 9, 2020, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In