Detecting Toxic Spans with spaCy

8 min readApr 9, 2022

Introduction

An expression is toxic if it uses rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion. Toxic language can be short like “idiot” or longer like “your ‘posts’ are as usual ignorant of reality”.

We will use the SpanCategorizer from spaCy to detect toxic spans. For illustration we will use a well-researched dataset. The present article focuses on the spaCy configuration and usage. Part two, “Custom evaluation of spans in spaCy”, focuses on integrating the metric calculations presented here as a scorer in the spaCy pipeline.

You can find the full code for this article tagged as release 1.0.0 at: https://github.com/ceesroele/toxic_spans
To download this release:
git clone https://github.com/ceesroele/toxic_spans.git —-branch v1.0.0

Categorizing spans

A span is a fragment of a text which we may label as belonging to some category. A text can consists of multiple spans which may have different labels. Spans are similar to Named Entity Recognition (NER).

However, in addition to tokens representing entities, spans may overlap: a word may be part of two different spans at the same time.

SpaCy contains a SpanCategorizer since version 3.1. This is an element — briefly named spancat — in the spaCy pipeline that can detect spans for which it has been trained.

Toxic spans dataset

We use the dataset from task 5 of SemEval2021. You can find a description of the task and a discussion of the outcome for different participants in “SemEval-2021 Task 5: Toxic Spans Detection” (2021) by John Pavlopoulos et al¹. The dataset and baseline code are available at github.

The dataset contains some ten thousand posts with annotations for toxic spans. These annotations consist of a list of all indexes of the characters of the post that are part of a toxic span.

Example from the CSV file containing the data:

“[11, 12, 13, 14, 16, 17, 18, 19, 20]”,you’re one sick puppy.

The character indexes 11 to 20 refer to “sick puppy”.

Generally, about the dataset:

Many of the posts contain one span
The vast majority of the spans consists of one one word, e.g. “idiot”.

Note that there is only one category (“toxic”) and the identification of spans is per character, so even though it would theoretically be possible to have two partially overlapping toxic spans, in practice in our dataset it doesn’t occur. One result is that it is possible to treat the problem as a NER-problem, rather than as a span problem. Actually, this is how detecting toxic spans was implemented in a spaCy NER example script provided with the dataset.

Project

Our goal is to create a spaCy pipe that detects toxic spans in a text. To do this, we must train a model using our dataset. Once the model is trained we want to evaluate it. If it is to our satisfaction, we will deploy our model.

Here are our steps:

Download the CSV data files (already divided in train, dev, and test sets)
Convert the data into a format that can be used for training.
Train
Evaluate
Deploy the resulting model

SpaCy provides standard functionality for training and evaluating models. This can be called as part of programming code, but also be called through a project.

We will go the latter way and define a spaCy project for our steps. As a basis we take the project.yml from the experimental NER-spancat project. This project provides NER for the Indonesian language using SpanCategorizer functionality.

Download datafiles

In our project file we define the location of our assets, that is the data we are looking for. Here we get all files from a path in a repository in github and place them in the subdirectory or destination “assets”.

Let’s download the assets:

We see that the “1 asset(s)” consists of three CSV files. These are the ones that were in the path SemEval2021/data in the repository defined above.

Convert

Now that we have our data files, we need to convert the data to a format suitable for training. Here we merge the list of indexes and text strings in the rows of our three CSV files into Doc objects with defined spans.

These Doc objects we bundle into a single DocBin object which we save to a file in the custom binaryspacy format.

We call these methods from a file named make_corpus.py . Note that we transform the original names of the data files to train, dev, and eval.

In our project definition we define a command from calling the above make_corpus.py:

To actually create the corpus of spacy files from the CSV assets by:

Train

We now have train and dev datasets in a format we can process. In order to actually train a model to recognise toxic span using this data, we need to configure such a model and some parameters for our training process first.

SpaCy provides an easy way to define such a configuration through a quickstart widget on its webpage documenting the training process.

Training configuration widget. Bottom left shows part of generate configuration code.

The only change compared to the default settings we make in the widget above is that we check spancat to enable the SpanCategorizer pipe. Note that selecting accuracy for “optimize for” would lead to a large language model being used.

The displayed part of the generated configuration is only part of the entire file.

We copy the contents of the configured widget and paste it into a file configs/base_config.cfg . We then fill it in with all default values and save the result as configs/config.cfg with the following command:

Note on the next step:
The default spans_key for SpanCategorizer is sc . You can use this default, which will spare you some of the extra steps I have made below. However, I think it is good practice to let each application adding spans, here toxic spans, set its own spans_key as it allows you to mix spans from different sources. E.g. SpaCy’s SpanRuler uses default spans_key=ruler .

We make a few modifications in our new configs/config.cfg:

We set our spans_key to txs so as not to use the default sc value.
We define sizes for ngrams for the Suggester to everything from 1 to 8. ( Suggesteris beyond the scope of this article. For info see spaCy SpanCategorizer documentation.)
We define columns for representing scores during training by excluding those for the default sc spans_key and including those for our newly definedtxs spans_key.

In our project file we define how to run our training. Most importantly, we refer to our config file ( configs/${vars.config}.cfg , where vars.config was defined as “config”), use our created corpus/train.spacy for basic training, and corpus/dev.spacy for the development set.

Definition of the training command in project.yml

Now, let’s start the training process and follow the results as the training is executed.

Start training and follow the the output

We see that we have a highest score — equaling spans_txs_f — of 0.57. The model that was used to generate the best score is saved in ./training/model-best.

Evaluate

As we already saw during the training, the standard metric used for evaluating spans is Precision — Recall — F1.

F1 combines precision and recall into a single measure that captures both properties.

Precision = TruePositives / (TruePositives + FalsePositives)
Recall = TruePositives / (TruePositives + FalseNegatives)
F1 = (2 * Precision * Recall) / (Precision + Recall)

Standard, the scorer for the spaCy SpanCategorizer calculates these metrics over entire tokens. As we want to compare our outcome with the published scores of participants in the SemEval2021 Task where a character-based score was used, we also want to generate a character-based F1 score for our evaluation. This means that in “Hello idiot world!” we count idiot not as 1 TruePositive, but as 5 TruePositives, that is, one for each character of the token idiot. There is a slight numerical difference between these two methods of calculation.

In our project file we define how to run our evaluation. Most importantly, we use our created corpus/eval.spacy to provide the data for evaluating the model, use training/best-model , and we invoke scripts for the standard token-based — and the custom character-based scores.

Definition of the evaluate command in project.yml

Let’s start the evaluation:

At the bottom we see that our character-based F1 score: F1 = 63.01 which is one percent point lower than the token-based F1 score.

How does this compare to the results of participants of SemEval2021 Task 5: Toxic Span Detection? Let’s see:

Participant results of SemEval2021 Task 5: Toxic Span Detection. Table from Pavlopoulos et al (2021)¹.

We see that our outcome trails at the bottom of the ranking, but not below it. Interpretation of the outcome will be done in a separate article, but for now let it suffice that we created the functionality without any optimization.

Deploy the resulting model

For illustration purposes it suffices to just load the best model from the directory where the training process has saved it, in training/model-best .

The script below uses the trained model and displays predicted spans using displaCy.

Script for displaying predicted toxic spans using displaCy

Running the above script in a Jupyter notebook we get:

Good, this works!

Conclusion

We implemented toxic span detection using spaCy and came to a usable but non-optimized result. Although our dataset covers only the simplest case with only non-overlapping spans and only one category, our steps illustrate how to train and use the spaCy SpanCategorizer .

Interpretation of the results was not part of the scope of the article. It will be covered in a follow-up article.

References

[1] “SemEval-2021 Task 5: Toxic Spans Detection” (2021) by John Pavlopoulos et al.

Update history

8 May 2022: linked to part 2: “Custom evaluation of spans in spaCy”

25 Aug 2022:
1. Added link to clone release v1.0.0 at the top of the article.

2. Explained why I use a custom spans_key, rather than SpanCategorizer’s default sc .

6 Sept 2022:

Displaying spans now uses displaCy. Removed reference to custom-code for displaying spans.