Custom Evaluation of Spans in spaCy

8 min readMay 6, 2022

Toxic fragments in the ‘Toxic Spans Detection’ dataset [1]

Introduction

Spans are fragments of a text belonging to a particular class. To evaluate the prediction of spans we chop a text into atomic pieces and determine for each of them if they have been correctly classified. Out-of-the-box spaCy provides token-based metrics for span-prediction. In my earlier article “Detecting Toxic Spans with spaCy” I provided a script to create character-based metrics. The present article will go into more depth on the different methods of calculation and will replace the standard spaCy token-based scorer with a character-based one in the created spaCy pipeline.

You can find the full code for this article tagged as release 2.0.0 at: https://github.com/ceesroele/toxic_spans
To download this release:
git clone https://github.com/ceesroele/toxic_spans.git --branch v2.0.0

Precision, Recall, and F1

We evaluate predictions of spans for a given text by splitting the text into atomic entities and checking for each of these if they have been correctly labelled.

We quantify the outcome using the following measures:

Precision — fraction of correctly predicted labels as part of all predicted labels
Recall — fraction of correctly predicted labels as part of the number of all correctly applied labels
F1 — harmonic mean of precision and recall

The actual calculation depends on how we split up the text. In the next section you first see the standard way for spaCy to do this and in the section after that our replacement for it.

SpaCy’s span metrics: token-based

SpaCy’s standard precision, recall, and F1 metrics are based on atomisation through spaCy’s tokenisation.

Let’s identify spans in a text — “We are on the misty battlements of Elsinore Castle.” — and categorise them. We won’t worry about the meaning of categories, just say that “misty battlements” and “Elsinore Castle” are correct spans. The predicted spans are “on” and “Castle”.

In the table headers, correct definitions of spans are called “gold” (in green), “pred” is called the prediction (in yellow), and under “compare” we find how the correct spans compare with the predicted spans.

Let’s go through the comparison:

“on” is predicted, but not correct. It is a False Positive (FP)
“misty battlements” and “Elsinore ” are correct, but not predicted. They are False Negatives (FN)
“Castle” is both correct and predicted. Is is a True Positive (TP)
anything neither correct nor predicted is a True Negative (TN)

We calculate Precision, Recall, and F1 as follows:

Precision = TP / (TP + FP) (correct predictions divided by all predictions)
Recall = TP / (TP + FN) (correct predictions divided by all correctly labelled entities)
F1 = (2 * Precision * Recall) / (Precision + Recall) (harmonic mean)

Note that TN, the correctly not predicted absence of labels, doesn’t influence the calculations. This means that the length of the total text doesn’t matter, only the predictions and correct labels.

Now, let’s calculate the values for the above example. We find:

FN=3
FP=1
TP=1

Precision = TP / (TP + FP) = 1 / (1 + 1) = 0.5
Recall = TP / (TP + FN) = 1 / (1 + 3) = 0.25
F1 = (2 * Precision * Recall) / (Precision + Recall) = 2 * 0.5 * 0.25 / (0.5 + 0.25) = 0.33

SemEval-2021 Task 5 metrics: character-based

For the “Detecting Toxic Spans” task of SemEval-2021[1] a character-based metric was used. As we want to compare the outcome of our SpanCategorizer based system with the published performance of the participants of that task, we want to use the same metric. Note that for a contest with different participants using different system, it is good not to assume any specific method of tokenisation, hence using a character-based metric.

Let’s see how the same prediction as in the previous section works out per character:

When we calculate values for our metrics now, we find:

FP=2
FN=17+9=26
TP=6

Precision = TP / (TP + FP) = 6/ (6 + 2) = 0.75
Recall = TP / (TP + FN) = 6/ (6 + 26) = 0.19
F1 = (2 * Precision * Recall) / (Precision + Recall) = 0.24

We see that the outcome for the F1-score differs significantly : 0.24 vs 0.33 for the character-based calculation versus the token-based calculation.

To evaluate a model we typically look at a set of examples, rather than just one. SpaCy effectively treats multiple records as one big text: it sums up all values of TP, FP, and FN and then calculates the precision, recall, and F1 values.

The SemEval calculation takes the precision, recall, and F1 values of each individual text in a set and then takes their mean value.

Integration

In the earlier article on “Detecting Toxic Spans with spaCy” we added a character-based evaluation as an additional script to our project, running after the standard evaluation.

What we will do now is integrate this method of evaluation into our training and evaluation so that the character-based method will be:

Displayed using training
Basis of evaluation
Packaged with the created pipeline as standard evaluation method

Rather than create a separate system of evaluation, which includes reporting, we will replace the method performing the actual calculation, the scorer.

Implementing the scoring method

We need to implement a method that takes a set of Example objects as its argument and returns a dictionary with Precision, Recall, and F1 values.

Examples here are spaCy objects that consist of a two spaCy Doc objects:

reference : document with correct or “gold” spans
prediction : document with the spans predicted by the model

Here is a prototype of the method we seek to implement, for now with constants as values:

Prototype of scoring method: it takes examples and a spans_key and returns a dictionary with metric values.

Registering our new scorer

Now that we have implemented a scoring algorithm, we need to register it. In our configuration file we define the name we want to use for our scorer. Note that we use a custom spans_key value of txs to set apart toxic spans from any other spans you may want to attribute to a document. If you don’t want to mix sources of spans, you can stick with SpanCategorizer’s default spans_key=sc.

Fragment of our adapted configuration file

This name must match the name we use to register a method in our code:

Registration of method *returning our actual scoring method*

We can now implement the method that does the actual calculation of the scores. The algorithm is based on the original code from the SemEval 2021 Toxic Spans Detection task.

Note that the scoring function we implemented here is not only different from the standard spaCy function but also more limited: it assumes there is only one label and it doesn’t provide a score per label.

To actually display our new metrics during training, we need to define output columns in our configuration for our spans_key with the value txs and silence the default sc. (No need to do this is you stick with the default spans_key=sc.)

Defining training score weights for our spans_key ‘txs’ and silencing default spans_key ‘sc’.

Now that we have integrated our new scorer into the pipeline, the standard evaluation method will calculate the metrics according to the custom way.

Loading the generated model

In the previous article we loaded the model from ./training/model-best . When we try this now, we will get an error:

Last part of error message after: spacy.load(‘training/model-best’): RegistryError: Could not find function…

What is happening? Remember that we “registered” our scorer method in our configuration file? That file ends up as part of our pipeline:

Configuration file as part of generated pipeline ‘model-best’

To be able to load the model, we need to import the registered scorer method into the scope of our code, e.g.:

Import the registered scorer method to load model without errors

Packaging

Once we are confident that the generated model is satisfactory, we can “package” it using spaCy’s package command . Basically, it takes a directory with a model, like our ./training/model-best , and transforms this into a single .tar.gz or .whl file. As the next step, we can use pip to install the file locally after which it is available for loading as a model in spaCy.

First, we need to add the code with our customised scorer using the --code parameter, similar to the manual import we enacted above.

To add our custom code and define a name and version, we can package our pipeline as follows, writing the output to the ./packages directory:

Note that the name will be prefixed with the used language, in this case en. The result will be: ./packages/en_toxic_detector-1.0.0/dist/en_toxic_detector-1.0.0.tar.gz

You can find additional variables in ./training/model-best/meta.json . Here is the originally generated version for our project:

Generated ./training/model-best/meta.json

You can run the package command with your own version of meta.json through --meta-pathor you can create a new one interactively through --create-meta . Check out the documentation of the package command.

As spaCy won’t load a pipeline from a file, we must install it where it can be read. We do this using pip:

After this, we can load our model using:

nlp = spacy.load(“en_toxic_detector”)

That’s it! We created our span detecting pipeline with a customised evaluation scorer.

Conclusion

In this article you saw how to create a custom scorer function for the spaCy SpanCategorizer in order to provide your method of calculating metrics during training and evaluation. You also saw how to package the resulting model together with the custom code, so you can load and invoke it as a pipeline in spaCy.

References

[1] “SemEval-2021 Task 5: Toxic Spans Detection” (2021) by John Pavlopoulos et al.

Update history

25 Aug 2022:
1. Added instruction to clone release v2.0.0 at the top of the article.

2. Explained why I use a custom spans_key, rather than SpanCategorizer’s default sc .