SemEval 2021 Task 5: Toxic Spans Detection

Organized by ipavlopoulos - Current server time: Jan. 19, 2021, 8:52 p.m. UTC


Oct. 1, 2020, midnight UTC


Jan. 10, 2021, 11 p.m. UTC


Competition Ends
Jan. 31, 2021, 2 p.m. UTC

SemEval 2021 Task 5: Toxic Spans Detection

Moderation is crucial to promoting healthy online discussions. Although several toxicity (abusive language) detection datasets and models have been released, most of them classify whole comments or documents, and do not identify the spans that make a text toxic. But highlighting such toxic spans can assist human moderators (e.g., news portals moderators) who often deal with lengthy comments, and who prefer attribution instead of just a system-generated unexplained toxicity score per post. The evaluation of systems that could accurately locate toxic spans within a text is thus a crucial step towards successful semi-automated moderation.

The Shared Task

As a complete submission for the Shared Task, systems will have to extract a list of toxic spans, or an empty list, per text. As toxic span we define a sequence of words that attribute to the text's toxicity. Consider, for example, the following text:

  • "This is a stupid example, so thank you for nothing a!@#!@."

It comprises two toxic spans, "stupid" and "a!@#!@", which have character offsets from 10 to 15 (counting starts from 0) and from 51 to 56 respectively. Systems are then expected to return the following list for this text:

  • [10,11,12,13,14,15,51,52,53,54,55,56]



To evaluate the responses of a system participating in the challenge, we employ the F1 score, as in [1]. Let system Ai return a set StAi of character offsets, for parts of the post found to be toxic. Let Gt be the character offsets of the ground truth annotations of t. We compute the F1 score of system Ai with respect to the ground truth G for post t as follows, where |·| denotes set cardinality.

If StG is empty for some post t (no gold spans are given for t), we set F1t(Ai, G) = 1 if StAi is also empty, and F1t(AiG) = 0 otherwise. We finally average F1t(AiG) over all the posts t of an evaluation dataset T to obtain a single score for system Ai.



[1] G. Da San Martino, S. Yu, A. Barr ́on-Cedeno, R. Petrov, and P. Nakov. 2019. Fine-grained analysisof propaganda in news article. In EMNLP-IJCNLPpages 5640–5650.


We used posts (comments) from the publicly available Civil Comments dataset, which already comprises post-level toxicity annotations, i.e., annotations indicating which (entire) posts are toxic, instead of annotations of particular toxic spans in toxic posts. We retained only posts that had been found toxic (or severely toxic) by at least half of the crowd-raters from Borkan et al's annotation [1]. This left approximately 30k toxic posts, out of a total of 1.2M posts in the original dataset.


We selected a random 10k subset of the 30k posts for toxic spans annotation. We used a crowd-annotation platform, and we employed three crowd-raters per post, all of whom were warned for explicit content (only coders who allowed adult content were selected). Coders were selected from the smallest group of the most experienced and accurate contributors (i.e., `highest quality’). The annotators were given the following instructions: ``Extract the toxic word sequences (spans) of the comment, by highlighting each such span and then clicking the right button. If the comment is not toxic or if the whole comment should be annotated, check the appropriate box and do not highlight any span.'' 


Note that we do not claim it is possible to annotate toxic spans in all toxic posts. For example, in some toxic posts the core message being conveyed may be inherently toxic (e.g., a sarcastic post indirectly claiming that people of a particular origin are inferior) and, hence, it may be difficult to attribute the toxicity of those posts to particular spans. In such cases, the corresponding posts may have no toxic span annotations (see, for example, the fourth post of the following table).

Table 1. Examples of toxic posts and their ground truth toxic spans (also shown in bold red).  In the left column,toxic spans are shown as lists of character offsets. No toxic spans are included in the ground truth of the last post.


Inter-annotator agreement

In an initial experiment, we employed five crowd raters per post for a sample of 35 posts, to measure inter-annotator agreement. We computed the mean pairwise Cohen's Kappa per post (using character offsets as instances being classified in two classes, toxic and non-toxic) and averaged over the 35 posts, which yielded a Kappa score of 0.61.


Ground truth

To obtain the ground truth of our dataset, we used the following process: for each post t, first we mapped each annotated span to its character offsets. Then we merged the annotated spans of each rater per post, to obtain a single set of character offsets per rater and post. We assigned a toxicity score to each character offset of t, computed as the fraction of raters who annotated that character offset as toxic (included it in their toxic spans). We then retained only character offsets with toxicity scores higher than 50%; i.e., at least two raters must have included each character offset in their toxic spans for the offset to be included in the ground truth. 


Trial dataset

A trial dataset of 690 texts has been released. Note that some texts do not include any annotations while others include one or more toxic spans. Please find the respective CSV file in the "Participate" tab. We suggest the use of Pandas (data loading/processing) and ast.literal_eval (restoring the lists after loading). Some useful lines of code follow:

>>> import pandas as pd
>>> from ast import literal_eval
>>> trial = pd.read_csv("tsd_trial.csv")
>>> trial["spans"] = trial.spans.apply(literal_eval)
>>> trial.head(2)


Training dataset

The CSV file for the training set is available on GitHub and it can also be found here (Phase #2). 



[1] D. Borkan, L. Dixon, J. Sorensen, N. Thain, andL. Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. In WWW, pages 491–500, San Francisco, USA.  

Terms and Conditions

  • By submitting results to this competition, you consent to the public release of your scores at this website and at the SemEval 2021 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  • You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  • This task has a single evaluation phase. To be considered a valid participation/submission in the task's evaluation, you agree to submit a single (possibly empty) list of character offsets (as in the task overview) per test text (post), for every test text. 
  • Each team must create and use exactly one CodaLab account.
  • Team constitution (members of a team) cannot be changed after the evaluation phase has begun.
  • During the evaluation phase, each team can submit as many as ten submissions; the top-scoring submission will be considered as the official submission to the competition.
  • The organizers and the organizations they are affiliated with make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • Each task participant will be assigned at least one other teams' system description paper for review, using the START system. The papers will thus be peer reviewed.
  • Our datasets are released under CC0, similarly to the underlying comment texts.


Start: July 31, 2020, midnight

Description: Trial data released.


Start: Oct. 1, 2020, midnight

Description: Training data released.


Start: Jan. 10, 2021, 11 p.m.

Description: Evaluation starts

Competition Ends

Jan. 31, 2021, 2 p.m.

You must be logged in to participate in competitions.

Sign In