Detecting Toxic Spans with spaCy


An expression is toxic if it uses rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion. Toxic language can be short like “idiot” or longer like “your ‘posts’ are as usual ignorant of reality”.

We will use the SpanCategorizer from spaCy to detect toxic spans. For illustration we will use a well-researched dataset. The present article focuses on the spaCy configuration and usage. Part two, “Custom evaluation of spans in spaCy”, focuses on integrating the metric calculations presented here as a scorer in the spaCy pipeline.

You can find the full code for this article tagged as release 1.0.0 at:

Categorizing spans

A span is a fragment of a text which we may label as belonging to some category. A text can consists of multiple spans which may have different labels. Spans are similar to Named Entity Recognition (NER).

However, in addition to tokens representing entities, spans may overlap: a word may be part of two different spans at the same time.

SpaCy contains a SpanCategorizer since version 3.1. This is an element — briefly named spancat — in the spaCy pipeline that can detect spans for which it has been trained.

Toxic spans dataset

We use the dataset from task 5 of SemEval2021. You can find a description of the task and a discussion of the outcome for different participants in SemEval-2021 Task 5: Toxic Spans Detection (2021) by John Pavlopoulos et al¹. The dataset and baseline code are available at github.

The dataset contains some ten thousand posts with annotations for toxic spans. These annotations consist of a list of all indexes of the characters of the post that are part of a toxic span.

Example from the CSV file containing the data:

“[11, 12, 13, 14, 16, 17, 18, 19, 20]”,you’re one sick puppy.

The character indexes 11 to 20 refer to “sick puppy”.

Generally, about the dataset:

  • Many of the posts contain one span
  • The vast majority of the spans consists of one one word, e.g. “idiot”.

Note that there is only one category (“toxic”) and the identification of spans is per character, so even though it would theoretically be possible to have two partially overlapping toxic spans, in practice in our dataset it doesn’t occur. One result is that it is possible to treat the problem as a NER-problem, rather than as a span problem. Actually, this is how detecting toxic spans was implemented in a spaCy NER example script provided with the dataset.


Our goal is to create a spaCy pipe that detects toxic spans in a text. To do this, we must train a model using our dataset. Once the model is trained we want to evaluate it. If it is to our satisfaction, we will deploy our model.

Here are our steps:

  1. Download the CSV data files (already divided in train, dev, and test sets)
  2. Convert the data into a format that can be used for training.
  3. Train
  4. Evaluate
  5. Deploy the resulting model

SpaCy provides standard functionality for training and evaluating models. This can be called as part of programming code, but also be called through a project.

We will go the latter way and define a spaCy project for our steps. As a basis we take the project.yml from the experimental NER-spancat project. This project provides NER for the Indonesian language using SpanCategorizer functionality.

Download datafiles

In our project file we define the location of our assets, that is the data we are looking for. Here we get all files from a path in a repository in github and place them in the subdirectory or destination “assets”.

Let’s download the assets:

We see that the “1 asset(s)” consists of three CSV files. These are the ones that were in the path SemEval2021/data in the repository defined above.


Now that we have our data files, we need to convert the data to a format suitable for training. Here we merge the list of indexes and text strings in the rows of our three CSV files into Doc objects with defined spans.

These Doc objects we bundle into a single DocBin object which we save to a file in the custom binaryspacy format.

We call these methods from a file named . Note that we transform the original names of the data files to train, dev, and eval.

In our project definition we define a command from calling the above

To actually create the corpus of spacy files from the CSV assets by:


We now have train and dev datasets in a format we can process. In order to actually train a model to recognise toxic span using this data, we need to configure such a model and some parameters for our training process first.

SpaCy provides an easy way to define such a configuration through a quickstart widget on its webpage documenting the training process.

Training configuration widget. Bottom left shows part of generate configuration code.

The only change compared to the default settings we make in the widget above is that we check spancat to enable the SpanCategorizer pipe. Note that selecting accuracy for “optimize for” would lead to a large language model being used.

The displayed part of the generated configuration is only part of the entire file.

We copy the contents of the configured widget and paste it into a file configs/base_config.cfg . We then fill it in with all default values and save the result as configs/config.cfg with the following command:

Running init fill-config

We make a few modifications in our new configs/config.cfg:

  • We set our spans_key to txs so as not to use the default sc value.
  • We define sizes for ngrams for the Suggester to everything from 1 to 8. ( Suggesteris beyond the scope of this article. For info see spaCy SpanCategorizer documentation.)
  • We define columns for representing scores during training by excluding those for the default sc spans_key and including those for our newly definedtxs spans_key.
Changes in configs/config.cfg

In our project file we define how to run our training. Most importantly, we refer to our config file ( configs/${vars.config}.cfg , where vars.config was defined as “config”), use our created corpus/train.spacy for basic training, and corpus/dev.spacy for the development set.

Definition of the training command in project.yml

Now, let’s start the training process and follow the results as the training is executed.

Start training and follow the the output

We see that we have a highest score — equaling spans_txs_f — of 0.57. The model that was used to generate the best score is saved in ./training/model-best.


As we already saw during the training, the standard metric used for evaluating spans is Precision — Recall — F1.

F1 combines precision and recall into a single measure that captures both properties.

Precision = TruePositives / (TruePositives + FalsePositives)

Recall = TruePositives / (TruePositives + FalseNegatives)

F1 = (2 * Precision * Recall) / (Precision + Recall)

Standard, the scorer for the spaCy SpanCategorizer calculates these metrics over entire tokens. As we want to compare our outcome with the published scores of participants in the SemEval2021 Task where a character-based score was used, we also want to generate a character-based F1 score for our evaluation. This means that in “Hello idiot world!” we count idiot not as 1 TruePositive, but as 5 TruePositives, that is, one for each character of the token idiot. There is a slight numerical difference between these two methods of calculation.

In our project file we define how to run our evaluation. Most importantly, we use our created corpus/eval.spacy to provide the data for evaluating the model, use training/best-model , and we invoke scripts for the standard token-based — and the custom character-based scores.

Definition of the evaluate command in project.yml

Let’s start the evaluation:

Start evaluation and follow the output

At the bottom we see that our character-based F1 score: F1 = 63.01 which is one percent point lower than the token-based F1 score.

How does this compare to the results of participants of SemEval2021 Task 5: Toxic Span Detection? Let’s see:

Participant results of SemEval2021 Task 5: Toxic Span Detection. Table from Pavlopoulos et al (2021)¹.

We see that our outcome trails at the bottom of the ranking, but not below it. Interpretation of the outcome will be done in a separate article, but for now let it suffice that we created the functionality without any optimization.

Deploy the resulting model

For illustration purposes it suffices to just load the best model from the directory where the training process has saved it, in training/model-best .

To actually use it, we will use a short script that will mark up toxic spans in sentences provided on the command line. Here is the script:

Script saved as ‘’

The script will display any predicted toxic spans with their token indexes and will mark the toxic spans in the sentences with a different color. Here are two results of running the script:

Running our best model on example sentences

Good, this works!


We implemented toxic span detection using spaCy and came to a usable but non-optimized result. Although our dataset covers only the simplest case with only non-overlapping spans and only one category, our steps illustrate how to train and use the spaCy SpanCategorizer .

Interpretation of the results was not part of the scope of the article. It will be covered in a follow-up article.


[1] “SemEval-2021 Task 5: Toxic Spans Detection (2021) by John Pavlopoulos et al.

Update history

8 May 2022: linked to part 2: “Custom evaluation of spans in spaCy”



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store