Detecting Persuasion with spaCy


Persuasion techniques express shortcuts in the argumentation process, e.g., by leveraging on the emotions of the audience or by using logical fallacies to influence it. In this article we will create a spaCy pipeline with a SpanCategorizer to detect and classify spans in which persuasion techniques are used in a text.

Our training data identifies 20 categories. Spans may overlap, that is, a word can be part of different spans.

Here is a partial example:

We will train different models with the dataset, use different spaCy configurations, and compare the results.

Table of Contents:

  • Persuasion techniques
  • Dataset
  • Creating the corpus
  • Training the model
  • Suggesting spans
  • Results
  • Conclusion

SpaCy configuration and python code to create the corpus used for this article are available on github.

Looking for a “HOWTO” on using spaCy for span categorisation?

If you are looking for a walk-through on how to use spaCy for span categorisation, check out my “Detecting Toxic Spans with spaCy”. There you will find relevant spaCy project configuration, training configuration, additional python code, and interpretation of evaluation metrics. Toxicity is only one category, but extending the approach to multiple categories and overlapping spans is technically trivial. You can find how it is done in the code accompanying the present article on github.

Persuasion techniques

Unwarranted reasoning goes by different names. Philosophers talk of fallacies, psychologists focus on manipulation, political scientists speak of propaganda, and linguists interested in the venerable tradition of rhetorics address persuasion. Each domain has its own focus on what the relevant impact of unwarranted reasoning is.

Detecting and explaining unwarranted reasoning might require epistemology, logic, estimation of intention, psychological biases, knowledge of pre-existing narrative, and even physical context. As all this doesn’t fit a feasible machine learning problem description, we convert unwarranted reasoning into a problem of classification: given a set of categories and a dataset of texts with marked spans belonging to categories, we train a model to detect such spans and classify them. We call these categories persuasion techniques.

Different studies have come up with different sets of persuasion techniques, e.g. ranging from a single classification of a whole text as “propaganda” to distinguishing 69 different techniques.[4] Presently, we will not discuss whether some categorisation is “better” than another or whether any categorisation is fit for purpose. Here we adopt one set of twenty techniques, with the understanding that a different set would be possible.

Let’s look at the description of some of the techniques described in [1]:

Loaded language: Using specific words and phrases with strong emotional implications (either positive or negative) to influence an audience.

Slogans: A brief and striking phrase that may include labelling and stereotyping. Slogans tend to act as emotional appeals.

Causal oversimplification: Assuming a single cause or reason when there are actually multiple causes for an issue. This includes transferring blame to one person or group of people without investigating the complexities of the issue.

Presenting irrelevant data (Red Herring): Introducing irrelevant material to the issue being discussed, so that everyone’s attention is diverted away from the points made.

These techniques are described in the context of subtask 2 of SemEval-2021 Task 6: “Detection of Persuasion Techniques in Texts and Images” [1].

The present article focuses on the implementation of a system detecting and classifying spans using the spaCy SpanCategorizer and will pay only cursory attention to the meaning of techniques and to the used dataset. For detailed information on the dataset and on the meaning of the different techniques into which we classify spans, please check out the mentioned article.


The dataset we use was created for SemEval-2021 task 6 “Detection of Persuasion Techniques in Texts and Images”. It can be found on github [3].

The dataset consists of a total of 951 “memes”, short texts taken from social media posts, in which 2083 spans with persuasion techniques are identified by a team of annotators. The texts are overlaid on an image — hence they contain numerous line breaks— and many are written in uppercase. We ignore the images.

Here is an example:

Our elders were called to war to save lives.\nWe are being called to sit on the couch to save theirs.\nWe can do this.\n

Persuasion spans in this text:

  • Exaggeration/Minimisation: “Our elders were called to war to save lives.\nWe are being called to sit on the couch to save theirs.”
  • Appeal to fear/prejudice: “war”
  • Appeal to fear/prejudice: “sit on the couch to save theirs
  • Slogans: “We can do this.

Below the overview of persuasion techniques distinguished in the dataset with for each the number of occurrences in the dataset and the average number of tokens in a span. To determine the number of words in a span the standard spaCy tokeniser was used. Note that interpunction and whitespace tokens are included in the counting.

In total there are 2083 spans.

Looking at the table you see:

  • The first two categories make up more than half the dataset and are on average under three words long which is much shorter than almost all other categories.
  • Almost two thirds of the categories have ten or more tokens on average in the spans in which they occur.


  • The dataset is rather small
  • Distribution over classes is uneven. Most classes in the dataset have less than 50 samples, which is small. Only Loaded Language, Name Calling/Labeling, and Smears have a sizeable number of instances.

Creating a corpus

Defining spans is like taking a coloured marker and highlighting a fragment of the original text. As we need exact fragments, we will not modify the original text by pre-processing it in any way.

Our dataset is already divided into train, dev, and test parts. We convert each of these files separately into a binary .spacy file which is used as input for training.

Training the model

As it is an open question what base model would best fit our requirements we will try small, large, and transformer models and compare the results.

Once our corpus is defined we can start a training using the spacy train command. For readability and repeatability we define this in a spaCy project. You can find an example of a regular pipeline consisting of corpus, train, and evaluate steps in the project.yml of my article Detecting Toxic Spans with Spacy.

Such a pipeline makes fine-tuning training of one model easy. Here we will not tune, but instead use three basis models and see how they perform. To define their configuration we use the “Quickstart” dialog in the spaCy documentation.

We select the spancat component and generate three configurations:

  1. CPU + efficiency (small)
  2. CPU + accuracy (large)
  3. GPU (doesn’t distinguish between efficiency and accuracy) (transformer)

This will result in small, large, and transformer models. The default spaCy transformer model is RoBERTa-base.

Using default configuration values, we will train each of these models and compare the results.

Suggesting spans

To detect spans, spaCy first generates a set of possible spans for a document. This is done by a component named Suggester.

Spans can only be detected if they are first generated or “suggested”.

SpaCy 3.3 comes with two implementations of Suggesters, both based on generating n-grams, that is, spans of n tokens. The ngram_suggester is configured with a list of lengths of n-grams, e.g. [1, 2, 3, 4]. The ngram_range_suggester is configured with a minimum and maximum of a range of lengths, e.g. min_size=1, max_size=4.

Corollary: Spans longer than the maximum length of suggest n-grams will not be detected.

Named entities typically consist of only a few tokens. With a token-length of 5 the named entity “Berlin Brandenburg Airport Willy Brandt” is relatively long. In our current dataset, however, we deal with spans that might even range across sentences. Here is an example fragment from our dataset of the category Causal Simplification: “Childish Trump Won’t Meet With Pelosi On Coronavirus\nBecause He Doesn’t Like Her”. SpaCy tokenises this fragment into 15 tokens, including one for the newline, where “won’t” is broken up into [“wo”, “n’t”].

As training, evaluation, and prediction of any span can only succeed if it doesn’t contain more tokens than generated by the Suggester, we must look into our dataset and see how many samples are cut off.

In the table below, we take 8-grams, 16-grams, and 32-grams as maximums for the ngram-range-suggester.

We apply some markup based on arbitrary limits:

  • < 10%: satisfactory (greyed)
  • > 10% & < 30%: cause for concern (black)
  • > 30% : bad (red)

What we see:

  • Loaded Language and Name Calling/Labeling satisfy all chosen n-gram maxima
  • For a maximum of 8-grams only 6 techniques have a smaller cut rate than 30%
  • For a maximum of 16-grams only 10 techniques have a smaller rate cut than 30%
  • For a maximum of 32-grams there are still 4 techniques having a cut rate between 10% and 30%

We will train with suggesters with maximums of 16-grams and 32-grams. We will ignore a suggester with a maximum of 8-grams as this looks unpromising given that so many techniques are not covered satisfactorily by that suggester.

Note: More suggester functions are currently experimentally available, e.g. subtree_suggester, (noun) chunk_suggester, and sentence_suggester.


Looking at F1 scores for the n-gram suggester set to maximums of 16 and of 32 tokens we find:

As expected, the large (lg) model does better than the small (sm) model and the transformer (trf) model does better than the large model. Surprisingly, however, the small and large models do worse with 32-grams than with 16-grams. Also we find that the transformer model is doing significantly better with 32-grams than with 16-grams.

In the table below the data for the F1 score used in the diagram above plus precision and recall measure. Values that have decreased for the 32-grams suggester are marked in red.

What we see is that for the small and large models the recall decreases for the 32-gram suggester, but the precision increases.

This means that the small and large models label too many tokens that shouldn’t be labelled for the 32-gram suggester, although they do cover more of those that should be labelled. Let’s call is over-optimistic labelling. The transformer models don’t suffer from that defect.

It would be interesting to determine why this is happening, but that would require further research.

Let’s look at the F1 scores for each individual persuasion technique:

We see:

  • Name Calling/Labeling and Loaded Language are predicted best. This is understandable given that for these categories we have the largest number of samples while they also have on average among the fewest tokens per span
  • Prediction for 32-grams for small and large models gets worse or remains on a par for these two techniques compared to 16-grams.
  • Remarkably, the small model picks up two technique with 16-grams which the others don’t but it doesn’t pick them up with 32-grams.
  • Transformer models pick up additional techniques in 32-grams suggesters, but the other models don’t.
  • Transformer models do better even for the short spans of Name Calling/Labeling and Loaded Language when using a 32-gram suggester than when using a 16-gram suggester.

Duplicate predictions

Created models should be able to predict spans for different classes that overlap each other. However, what happens — and what should not happen — is that there are predictions of spans with the same label that overlap each other. The following image illustrates the prediction of a 32-grams transformer-based model.

I consider this an error of the SpanCategorizer.

Comparison with other span detection systems

Metrics are merely numbers unless serving to compare different systems producing them. As we effectively implemented subtask 2 of Task 6 of SemEval 2021, we can compare our outcome with other systems on the leaderboard for that task. SpaCy uses token-based metrics, but the mentioned contest uses character-based metrics.

Using the best model to produce a prediction for the test dataset and having the result evaluated by the scorer method provided for the contest leads to a character-based F1 of 0.449. That would place it second in the Task 6 ranking published in [1]!

This encourages future research comparing the architecture of the spaCy model-with-suggester with the models participating in Task 6.


We have seen that spaCy’s SpanCategorizer can be used to detect spans and classify them. As the dataset used for the present article contains spans of widely varying length, we needed to take the functionality and configuration of the spaCy Suggester into account, which is the function generating spans. For this dataset transformer models proved significantly more accurate than spaCy’s small and large models. The resulting model ranked well within alternative systems for detecting and classifying spans.


[1] “SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images” (2021) D. Dimitrov et al

[2] “WVOQ at SemEval-2021 Task 6: BART for Span Detection and
(2021) Cees Roele

[3] “Data for SemEval-2021 Task 6: Detection of Persuasive Techniques in Texts and Images”, github

[4] “Fine-Grained Analysis of Propaganda in News Articles” (2019) G. Da San Martino et al



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store