HaHackathon: Detecting and Rating Humor and Offense

Organized by jam - Current server time: March 30, 2025, 4:28 a.m. UTC

Evaluation

Jan. 10, 2021, midnight UTC

Current

Post-Evaluation

Feb. 1, 2021, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions
schedule
organizers

Task 7: Hahackathon: Linking Humor and Offense Across Different Age Groups

Join our mailing list: hahackathon@googlegroups.com

Background and Motivation

Humor, like most figurative language, poses interesting linguistic challenges to NLP, due to its emphasis on multiple word senses, cultural knowledge, and pragmatic competence. Humor appreciation is also a highly subjective phenomenon, with age, gender and socio-economic status known to have an impact on the perception of a joke. In this task, we collected labels and ratings from a balanced set of age groups from 18-70. Our annotators also represented a variety of genders, political stances and income levels.

We asked annotators:

Is the intention of this text to be humorous? (0 or 1)
[If it is intended to be humorous] How humorous do you find it? (1-5)

With the above questions, we classify the genre of the text, and the humor score related to it. We take the majority label assigned by annotators, and the average of the ratings. Notably, we also allowed annotators to label a text as intended to be humorous (e.g. due to its content or structure) but also to give ‘I don’t get it’ as a rating. In this case, the humor rating for this annotator is 0.

We represent the subjectivity of humor appreciation with a controversy score. This examines the variance in humor ratings for each text. If the variance of a text was higher than the median variance of all texts, we labelled the humor of the text as controversial. Prediction of this value is a binary classification task.

This is also the first task to combine humor and offensive detection. This is down to the observation that what is humorous to one user, may be offensive to another. To explore this, we add a further layer of annotation by asking raters:

Is this text generally offensive? (0 or 1)
[If the rater considers the text to be generally offensive] How generally offensive is the text?’(1-5)
By generally offensive, we mean that the text targets a person or group simply for belonging to a specific group, and ask users if they think that a significant number of people would find this offensive. As we saw much more variety in the offensiveness ratings, we calculate an offensiveness score for each text. In this case, we consider the ratings 1-5, and also consider a no rating to be 0.

Tasks

Task 1 emulates previous humor detection tasks in which all ratings were averaged to provide mean classification and rating scores.

Task 1a: predict if the text would be considered humorous (for an average user). This is a binary task.
Task 1b: if the text is classed as humorous, predict how humorous it is (for an average user). The values vary between 0 and 5.
Task 1c: if the text is classed as humorous, predict if the humor rating would be considered controversial, i.e. the variance of the rating between annotators is higher than the median. This is a binary task.

Task 2 aims to predict how offensive a text would be (for an average user) with values between 0 and 5.

Task 2a: predict how generally offensive a text is for users. This score was calculated regardless of whether the text is classed as humorous or offensive overall.

Evaluation criteria

The main metric for the classification tasks will be f1-measure, and the metric for the regression tasks will be root mean squared error.

For all tasks, please submit a zipped csv file with a row for each text and a column for each task you are participating in. The csv file format should be like the following:

id	is_humor	humor_rating	humor_controversy	offense_rating
1	1	1.126	0	3.098
2	0	4.527	1	1.282
3	1	3.983	1	1.644

Your csv file should always include the 'id' column, and can include one or more of the other columns corresponding to the different subtasks. The columns for the different tasks are the following:

Task 1a: is_humor (binary classification 0-1)
Task 1b: humor_rating (regression between 0 to 5)
Task 1c: humor_controversy (binary classification 0-1)
Task 2: offense_rating (regression between 0 to 5)

IMPORANT: Notice that, if you include the humor_rating or humor_controversy columns, you must provide a value for all rows (whether your system considers them humorous or not), and the system will only take in consideration the values for the rows that are humorous according to the gold standard.

Terms

By submitting results to this competition, you consent to the public release of your scores at this website and at the SemEval 2021 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
This task has a single evaluation phase. To be considered a valid participation/submission in the task's evaluation, you agree to submit a single (possibly empty) list of character offsets (as in the task overview) per test text (post), for every test text.
Each team must create and use exactly one CodaLab account.
Team constitution (members of a team) cannot be changed after the evaluation phase has begun.
During the evaluation phase, each team can submit as many as ten submissions; the top-scoring submission will be considered as the official submission to the competition.
The organizers and the organizations they are affiliated with make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
Each task participant will be assigned at least one other teams' system description paper for review, using the START system. The papers will thus be peer reviewed.

Schedule

Trial data ready: July 31, 2020
Task website ready: August 14, 2020
Training data ready: October 1, 2020 Training and development data ready: October 31, 2020
Test data ready: December 3, 2020
Evaluation start: January 10, 2021
Evaluation end: January 31, 2021
Paper submission due: February 23, 2021
Notification to authors: March 29, 2021
Camera ready due: April 5, 2021
SemEval workshop: Summer 2021

Organizers

J. A. Meaney
University of Edinburgh
jameaney@ed.ac.uk

Steven Wilson
University of Edinburgh
steven.wilson@ed.ac.uk

Luis Chiruzzo
Universidad de la República, Montevideo
luischir@fing.edu.uy

Walid Magdy
University of Edinburgh
wmagdy@inf.ed.ac.uk

Development

Start: Oct. 1, 2020, midnight

Description: Development phase for all tasks.

Evaluation

Start: Jan. 10, 2021, midnight

Description: Evaluate your trained system on our test data.

Post-Evaluation

Start: Feb. 1, 2021, midnight

Description: Open Post-Evaluation phase that lasts forever.

Competition Ends

Never

You must be logged in to participate in competitions.

#	Username	Score
1	DeepBlueAI	0.9676
2	dalya	0.9675
3	ThisIstheEnd	0.9655

Competition

HaHackathon: Detecting and Rating Humor and Offense

Previous

Current

End

Evaluation criteria

Terms

Schedule

Organizers

Development

Evaluation

Post-Evaluation

Competition Ends