HaHackathon: Detecting and Rating Humor and Offense

Organized by jam - Current server time: Jan. 19, 2021, 9:10 p.m. UTC


Oct. 1, 2020, midnight UTC


Jan. 10, 2021, midnight UTC


Feb. 1, 2021, midnight UTC

Task 7: Hahackathon: Linking Humor and Offense Across Different Age Groups

Join our mailing list: hahackathon@googlegroups.com 

Background and Motivation

Humor, like most figurative language, poses interesting linguistic challenges to NLP, due to its emphasis on multiple word senses, cultural knowledge, and pragmatic competence.  Humor appreciation is also a highly subjective phenomenon, with age, gender and socio-economic status known to have an impact on the perception of a joke. In this task, we collected labels and ratings from a balanced set of age groups from 18-70. Our annotators also represented a variety of genders, political stances and income levels. 

We asked annotators:

  • Is the intention of this text to be humorous? (0 or 1)
  • [If it is intended to be humorous] How humorous do you find it? (1-5)

With the above questions, we classify the genre of the text, and the humor score related to it. We take the majority label assigned by annotators, and the average of the ratings. Notably, we also allowed annotators to label a text as intended to be humorous (e.g. due to its content or structure) but also to give I dont get it as a rating. In this case, the humor rating for this annotator is 0. 

We represent the subjectivity of humor appreciation with a controversy score. This examines the variance in humor ratings for each text. If the variance of a text was higher than the median variance of all texts, we labelled the humor of the text as controversial. Prediction of this value is a binary classification task. 

This is also the first task to combine humor and offensive detection. This is down to the observation that what is humorous to one user, may be offensive to another. To explore this, we add a further layer of annotation by asking raters:

  • Is this text generally offensive? (0 or 1)
  • [If the rater considers the text to be generally offensive] How generally offensive is the text?(1-5)
  • By generally offensive, we mean that the text targets a person or group simply for belonging to a specific group, and ask users if they think that a significant number of people would find this offensive. As we saw much more variety in the offensiveness ratings, we calculate an offensiveness score for each text. In this case, we consider the ratings 1-5, and also consider a no rating to be 0. 


Task 1 emulates previous humor detection tasks in which all ratings were averaged to provide mean classification and rating scores. 

  • Task 1a: predict if the text would be considered humorous (for an average user). This is a binary task.
  • Task 1b: if the text is classed as humorous, predict how humorous it is (for an average user). The values vary between 0 and 5.
  • Task 1c: if the text is classed as humorous, predict if the humor rating would be considered controversial, i.e. the variance of the rating between annotators is higher than the median. This is a binary task.

Task 2 aims to predict how offensive a text would be (for an average user) with values between 0 and 5. 

  • Task 2a: predict how generally offensive a text is for users. This score was calculated regardless of whether the text is classed as humorous or offensive overall. 

Evaluation criteria

The main metric for the classification tasks will be f1-measure, and the metric for the regression tasks will be root mean squared error.

For all tasks, please submit a zipped csv file with a row for each text and a column for each task you are participating in. The csv file format should be like the following:

1 1 1.126 0 3.098
2 0 4.527 1 1.282
3 1 3.983 1 1.644

Your csv file should always include the 'id' column, and can include one or more of the other columns corresponding to the different subtasks. The columns for the different tasks are the following:

  • Task 1a: is_humor (binary classification 0-1)
  • Task 1b: humor_rating (regression between 0 to 5)
  • Task 1c: humor_controversy (binary classification 0-1)
  • Task 2: offense_rating (regression between 0 to 5)

IMPORANT: Notice that, if you include the humor_rating or humor_controversy columns, you must provide a value for all rows (whether your system considers them humorous or not), and the system will only take in consideration the values for the rows that are humorous according to the gold standard.


  • By submitting results to this competition, you consent to the public release of your scores at this website and at the SemEval 2021 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  • You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  • This task has a single evaluation phase. To be considered a valid participation/submission in the task's evaluation, you agree to submit a single (possibly empty) list of character offsets (as in the task overview) per test text (post), for every test text. 
  • Each team must create and use exactly one CodaLab account.
  • Team constitution (members of a team) cannot be changed after the evaluation phase has begun.
  • During the evaluation phase, each team can submit as many as ten submissions; the top-scoring submission will be considered as the official submission to the competition.
  • The organizers and the organizations they are affiliated with make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • Each task participant will be assigned at least one other teams' system description paper for review, using the START system. The papers will thus be peer reviewed.


  • Trial data ready: July 31, 2020
  • Task website ready: August 14, 2020
  • Training data ready: October 1, 2020 Training and development data ready: October 31, 2020
  • Test data ready: December 3, 2020
  • Evaluation start: January 10, 2021
  • Evaluation end: January 31, 2021
  • Paper submission due: February 23, 2021
  • Notification to authors: March 29, 2021
  • Camera ready due: April 5, 2021
  • SemEval workshop: Summer 2021



Start: Oct. 1, 2020, midnight

Description: Development phase for all tasks.


Start: Jan. 10, 2021, midnight

Description: Evaluate your trained system on our test data.


Start: Feb. 1, 2021, midnight

Description: Open Post-Evaluation phase that lasts forever.

Competition Ends


You must be logged in to participate in competitions.

Sign In