Moderation is crucial to promoting healthy online discussions. Although several toxicity (abusive language) detection datasets and models have been released, most of them classify whole comments or documents, and do not identify the spans that make a text toxic. But highlighting such toxic spans can assist human moderators (e.g., news portals moderators) who often deal with lengthy comments, and who prefer attribution instead of just a system-generated unexplained toxicity score per post. The evaluation of systems that could accurately locate toxic spans within a text is thus a crucial step towards successful semi-automated moderation.
The Shared Task
As a complete submission for the Shared Task, systems will have to extract a list of toxic spans, or an empty list, per text. As toxic span we define a sequence of words that attribute to the text's toxicity. Consider, for example, the following text:
It comprises two toxic spans, "stupid" and "a!@#!@", which have character offsets from 10 to 15 (counting starts from 0) and from 51 to 56 respectively. Systems are then expected to return the following list for this text:
To evaluate the responses of a system participating in the challenge, we employ the F1 score, as in [1]. Let system Ai return a set StAi of character offsets, for parts of the post found to be toxic. Let Gt be the character offsets of the ground truth annotations of t. We compute the F1 score of system Ai with respect to the ground truth G for post t as follows, where |·| denotes set cardinality.
If StG is empty for some post t (no gold spans are given for t), we set F1t(Ai, G) = 1 if StAi is also empty, and F1t(Ai, G) = 0 otherwise. We finally average F1t(Ai, G) over all the posts t of an evaluation dataset T to obtain a single score for system Ai.
References
[1] G. Da San Martino, S. Yu, A. Barr ́on-Cedeno, R. Petrov, and P. Nakov. 2019. Fine-grained analysisof propaganda in news article. In EMNLP-IJCNLP, pages 5640–5650.
Overview
We used posts (comments) from the publicly available Civil Comments dataset, which already comprises post-level toxicity annotations, i.e., annotations indicating which (entire) posts are toxic, instead of annotations of particular toxic spans in toxic posts. We retained only posts that had been found toxic (or severely toxic) by at least half of the crowd-raters from Borkan et al's annotation [1]. This left approximately 30k toxic posts, out of a total of 1.2M posts in the original dataset.
We selected a random 10k subset of the 30k posts for toxic spans annotation. We used a crowd-annotation platform, and we employed three crowd-raters per post, all of whom were warned for explicit content (only coders who allowed adult content were selected). Coders were selected from the smallest group of the most experienced and accurate contributors (i.e., `highest quality’). The annotators were given the following instructions: ``Extract the toxic word sequences (spans) of the comment, by highlighting each such span and then clicking the right button. If the comment is not toxic or if the whole comment should be annotated, check the appropriate box and do not highlight any span.''
Note that we do not claim it is possible to annotate toxic spans in all toxic posts. For example, in some toxic posts the core message being conveyed may be inherently toxic (e.g., a sarcastic post indirectly claiming that people of a particular origin are inferior) and, hence, it may be difficult to attribute the toxicity of those posts to particular spans. In such cases, the corresponding posts may have no toxic span annotations (see, for example, the fourth post of the following table).
Inter-annotator agreement
In an initial experiment, we employed five crowd raters per post for a sample of 35 posts, to measure inter-annotator agreement. We computed the mean pairwise Cohen's Kappa per post (using character offsets as instances being classified in two classes, toxic and non-toxic) and averaged over the 35 posts, which yielded a Kappa score of 0.61.
Ground truth
To obtain the ground truth of our dataset, we used the following process: for each post t, first we mapped each annotated span to its character offsets. Then we merged the annotated spans of each rater per post, to obtain a single set of character offsets per rater and post. We assigned a toxicity score to each character offset of t, computed as the fraction of raters who annotated that character offset as toxic (included it in their toxic spans). We then retained only character offsets with toxicity scores higher than 50%; i.e., at least two raters must have included each character offset in their toxic spans for the offset to be included in the ground truth.
Trial dataset
A trial dataset of 690 texts has been released. Note that some texts do not include any annotations while others include one or more toxic spans. Please find the respective CSV file in the "Participate" tab. We suggest the use of Pandas (data loading/processing) and ast.literal_eval (restoring the lists after loading). Some useful lines of code follow:
Training dataset
The CSV file for the training set is available on GitHub and it can also be found here (Phase #2).
References
[1] D. Borkan, L. Dixon, J. Sorensen, N. Thain, andL. Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. In WWW, pages 491–500, San Francisco, USA.
Start: July 31, 2020, midnight
Description: Trial data released.
Start: Oct. 1, 2020, midnight
Description: Training data released.
Start: Jan. 10, 2021, 11 p.m.
Description: Evaluation starts
Jan. 31, 2021, 2 p.m.
You must be logged in to participate in competitions.
Sign In