Exploring “disinformation” at EUvsDisinfo

Abstract

“Disinformation” is a term used by governments to justify regulation of public communication. The present article presents a quantitative exploration of the actual use of the term “disinformation” based on data from the EUvsDisinfo website. Over the past six years this website published over nine hundred news articles and over thirteen thousand “disinformation cases”.

From this body of articles a set of sentences was extracted containing the word “disinformation”. Using Wordcloud and TF-IDF it was found that central to “disinformation” is the word “narrative”. We therefore changed the dataset to be studied to be those of sentences containing the word “narrative”.

To identify narratives we tried to distinguish which ones are mentioned in the dataset. K-means clustering was applied on a TF-IDF matrix of the dataset resulting in sixteen clusters of sentences.

The challenge was to learn something about narratives from our data on clusters. For this purpose, we presented the clusters as summaries using BART-large-CNN, on a 2D plan using word vectors and PCA, and in a custom structured format using spaCy tokenisation into entities and part-of-sentence analysis.

As this is an exploration, we end with an overview of findings and suggestions for continuing research.

Goal, methodology, and dataset

A short definition of “disinformation” is “verifiably false or misleading information”.

EUvsDisinfo uses the definition of the European Commission, which defines disinformation as

“the creation, presentation and dissemination of verifiably false or misleading information for the purposes of economic gain or intentionally deceiving the public, and which may cause public harm. Such harm may include undermining democratic processes or threats to public goods such as health, the environment and security. As opposed to illegal content (which includes hate speech, terrorist content or child sexual abuse material), disinformation covers content that is legal. It therefore intersects with fundamental core European Union (EU) values of freedom of expression and the press. Under the Commission’s definition, disinformation does not include misleading advertising, reporting errors, satire and parody, or clearly identified partisan news and commentary.”

Source: “Action Plan Against Disinformation”, European Commission, 5 December 2018, https://ec.europa.eu/info/sites/default/files/eu-communication-disinformation-euco-05122018_en.pdf

Apart from “verifiably false or misleading information”, this definition contains numerous exceptions and complications, e.g. “disinformation does not include … clearly identified partisan news and commentary”.

Our goal is to get a better understanding of the notion of “disinformation”. We do this by quantitatively exploring a dataset and creating a number of proof-of-concept implementations to present the results. This article only sketches the technical steps in the exploration. The python code is presented in the accompanying notebook at github.

The original body of data comes from the euvsdisinfo.eu website. It consists of downloaded news & analysis articles and disinformation cases. The news set contains over nine hundred items, the cases set over thirteen thousand items. To focus on “disinformation”, this body of data is reduced to 13,190 sentences containing the word “disinformation” in news articles and in the disproofs of disinformation cases.

Some examples of sentences in our dataset:

- The latest disinformation from that show we have in our database was about the West pushing Poroshenko to power again and him using a military clash with Russia to postpone the election.

- The world of disinformation indeed is a dark and alienating one.

- This message is consistent with recurring pro-Kremlin disinformation narratives about the moral decay of the West and the propaganda of the morally corrupt West, aiming to encourage Russophobia.

- Most pro-Kremlin disinformation focusses on promoting the safety and efficacy of the Sputnik V vaccine while disparaging the safety of Western vaccines, in articles particular against the Pfizer/BioNTech vaccine.

- In other words, disinformation has claimed total dominance in the Kremlin’s current communications strategy, whereas even Soviet authorities were capable of seeing a long-term reason for not systematically lying.

Drawback of the dataset is that it comes from a single source. On the positive side, given that this represents the European Union’s primary output for the lengthy period of six years, we can assume that this is at least how the European Union uses and wishes to use the term “disinformation”.

Wordcloud

A Wordcloud is a visual presentation of the prevalence of words in a body of text. As a word occurs relatively more frequently it will be presented larger. Let’s see what a Wordcloud of our dataset looks like:

Wordcloud visualising the prevalence of words in the dataset

Obviously, the word “disinformation” is most prevalent, after all, our dataset consists of sentences that all contain it. Other prominent words are “pro-Kremlin”, “recur” (the lemma of “recurring”), “narrative”, “Russia”, and “Ukraine”.

As you see, the domain is relatively limited. Given the name “EU vs Disinformation” one might be misled about the scope of this website. It is published by the European Union’s East StratCom Task Force and is in practice and according to its mission statement exclusively focused on “revealing pro-Kremlin disinformation”. StratCom stands for Strategic Communication.

Note that the limitation of the domain is not an issue for the present exploration: our question is not “what disinformation is out there?” but “how is the word ‘disinformation’ used in practice?” For the latter question, this extensive dataset is still suitable.

TF-IDF

Term Frequency / Inverse Document Frequency is a metric for the prevalence of terms in a document in the context of other documents. If a term occurs in a document that raises the term frequency. But if it also occurs in many other documents, it is less exceptional and the score is diminished. Reversed: if a term occurs in one document but not in others, its score for that one document is raised.

We use the TfidfVectorizer from sklearn to create a TF-IDF score for our dataset. We treat our dataset as a single document, so for now the “Inverse Document Frequency”, a calculation over the entire set of documents, is not used.

TF-IDF score for dataset

We get basically the same information as in the Wordcloud, but now displayed with clear ranking and scale.

Ngrams / noun chunks

So far, we looked only at individual words. But in the Wordcloud we saw both “chemical” and “attack”, we saw “Alexei” and “Navalny”. Let’s look at combinations of words. We can look at strict combinations like bigrams and trigrams, respectively combinations of two words (like “chemical attack”) or three words (“Joint Investigation Team”). Here we used not such trigrams but the “noun chunks” from the spaCy tokeniser. These are basically nouns with a variable number of words describing the noun, e.g. “the lavish green grass”. Tokenising not into words, but into noun chunks of at least two words, we get the following Wordcloud for our dataset.

Prevalent noun chunks in the dataset

Here we see many compound terms worth making visible, e.g. “George Soros”, “Alexei Navalny”, “Baltic States”, “flight MH17”, and “White Helmets”.

But the biggest presence is for “recur(ring) pro-Kremlin disinformation narrative”. What is going on here? Let’s look at the actual occurrences.

Number of occurrences of noun chunks

We see that “recur(ring) pro-Kremlin disinformation narrative” occurs over three thousand times in the dataset of of about thirteen thousand sentences. That comes down to about one in four sentences!

Additionally, we see a few other compounds:

  • disinformation message
  • disinformation case
  • disinformation campaign

Let’s look at the most frequently occurring compounds of the format “disinformation (term)”:

Occurrences of compounds like “disinformation (term)”

We see that when used as a modifier, “disinformation” mostly occurs in combination with “narrative”. In a future article I will explore how “disinformation” is used where it is not coupled with “narrative”, but for now I consider pursuing narratives as the most promising tack to continue the exploration.

Narratives

In the above exploration we saw that “disinformation” is most often used as a modifier of “narrative”. This is an important finding because it helps us narrow down how “disinformation” is actually used.

In the EU definition of “disinformation” we don’t find the word “narrative” and neither does “narrative” occur in the leading document “EU Action Plan Against Disinformation”.

Given the prevalence of the word “narrative” in the actual usage of the term “disinformation” as we found studying our dataset, this is puzzling.

Let’s look up the Oxford Languages definition of “narrative”:

narrative ~ a spoken or written account of connected events; a story.

Compare this with the European Union’s definition of “disinformation” which we saw above:

disinformation ~ the creation, presentation and dissemination of verifiably false or misleading information for the purposes of economic gain or intentionally deceiving the public.

Whereas information — and hence disinformation — can be about anything, we find that in practice disinformation is about connected events. As events are typically determined by people, locations, and actions, we have reason to focus on these concepts.

For our current exploration, this means that we have reason to narrow down our data in a different way than a purely statistical on word occurrences. Specifically, named entities — like people and locations — are taking a central place as they are defining for a “narrative”.

Let’s continue our exploration of “disinformation” by focusing on “narratives”. For that purpose we change our dataset: instead of taking sentences with the word “disinformation” from the original body of articles and disinformation cases, we extract sentences with the word “narrative”. Let’s compare the number of sentences in the two datasets.

Two datasets, partially overlapping: sentences with “disinformation” and those with “narrative”

We see that the number of sentences containing both “disinformation” and “narrative” is significantly larger than those containing only “narrative”.

Let’s look at the progress over the period that EUvsDisinfo has been publishing. The numbers for “disinformation”, “narrative” and “pro-Kremlin disinformation narrative” are occurrences in our datasets. For reference a column with the number of publications — news and disinformation cases — is added.

Occurrences over time during 2016–2021. Also with number of publications.

With regard to relative occurrences over time, we see:

  • the relative use of the term “narrative” has greatly increased since 2019
  • the phrase “pro-Kremlin disinformation narrative” has become increasingly prevalent among all uses of “narrative” in the last three years
  • we see a peak in the the occurrences of both “disinformation” and “narrative” in 2020, then in 2021 the usage of these words becomes less frequent in the publications of EUvsDisinfo

Recurrent narratives

We found that “recurrent disinformation narratives” is mentioned thousands of times in our dataset. We will now try to identify such recurrent narratives. The process of the identification of narratives serves to explore the structure of narratives, which might give us clues about the nature of “disinformation”.

Our steps:

  1. Extended tokenisation: not just splitting sentences in individual words, but taking compounds into account.
  2. TF-IDF scores for sentences in the dataset
  3. K-means clustering analysis: “elbow method” to determine optimal amount of clusters.
  4. Actual K-means clustering for optimal number of clusters
  5. Analysing resulting clusters

Extended tokenisation

Standard tokenisation converts a sentence, e.g. “Hello, world!”, into a list of tokens, e.g. [‘Hello’, ‘,’, ‘world’, ‘!’].

In our exploration above, we saw that we want to recognise certain compounds as a single token.

We want to:

  1. Recognise a full name as a single token, rather than as separate tokens for first name and last name.
  2. Make sure that certain entities are recognised as entities, e.g. “Joint Investigation Team”.
  3. Recognise a number of domain specific compounds, e.g. “conspiracy theory” and “chemical attack”. Note that these choices are based on an understanding of the domain, not on any generic and explicit criteria.
  4. Normalise entities to a single formulation, e.g. “European Union” and “EU” both resolve to “EU”. Again, this is based on an understanding of our domain.
  5. Normalise names to only the last name the compound tokens “George Soros” and “Soros” will refer to the same entity.

Technically, we do this by a combination of customising the spaCy tokeniser and by post-processing the outcomes of the tokeniser in accordance with the requirements specified above.

TF-IDF scores for sentences and k-means clustering

Earlier we created a TF-IDF score for the entire dataset, which means that every word in the dataset gets a score. Now, we do the same per sentence. That means that now scores for terms in sentences are partly determined by the frequency of these terms in all documents.

Sentences are now transformed into a list of all tokens in the dataset with for each token a score specific for that sentence.

Once the sentences are transformed we can make an optimal division of sentences into groups based on their TF-IDF representation. We do this through the k-means algorithm. This algorithm starts out by setting arbitrary “centres” for clusters, finding the nearest centre for all items (sentences in our case), and then reset the centre to the means of the new found clusters. From that point, with new cluster centre, the steps are repeated.

With our set of tokens with their scores, we have a complex dataset for which k-means might not give a division into clusters as good as e.g. spectral clustering. The reason to opt for k-means is that it gives us not only a notion of cluster membership, but also a notion of cluster centres. In the present case, this comes down to listing the most salient terms of a cluster.

Before starting to divide our dataset into clusters using k-means it is necessary to define the number of clusters we seek.

Elbow method

One way to estimate a good number of clusters is through the “elbow-method”. Here we iterate over values k for the number of clusters and determine for each of them the “sum of the squared distance”, which is a metric for the total distance of items to cluster centres. Obviously, with more clusters, for a larger k, this total distance decreases. (Think of the extreme: if there are as many clusters as there are items, the distance will be zero as each item is a cluster of one element.)

The idea behind the elbow-method is that an optimum number of clusters is there where an initial steep improvement by adding more clusters, changes into a more gradual improvement.

When trying this with the “sum of squared distance” measurement for our data and with k=5 to k=40 clusters, we get the following diagram:

Elbow method

Based on the diagram — using the helplines — we select k=16 because this looks best as the inflection point.

Performing k-means clustering with k=16 results in the following distribution of sentences over clusters:

We see that one cluster consists of almost half the dataset and the other clusters are at most about one tenth of its size. Let’s look closer at the result.

Top terms

We chose the k-means algorithm because this has a notion of the “centre” of a cluster. As we had transformed our original sentences into a matrix of TF-IDF scores for each term in the dataset, our cluster centres are not sentences, but terms. What this means is that out of the process we get not only a grouping of the sentences of our dataset, but also the terms most characteristic for each cluster.

As it is tedious to identify clusters by their id, we use the first three “top terms” of each cluster to generate a title for the cluster. You have already seen this in the diagram in the previous section.

Here are the twenty ranked “top terms” for two clusters:

Ranked “top terms” (cluster centres) for two clusters

Bringing in domain knowledge, both of these lists look quite coherent for each of the named clusters.

Summarising narratives

The natural way to present a narrative is through stringing together words to form a story.

The outcome of our clustering process consists of:

  • a label for every sentence in the dataset indicating to what cluster it is assigned
  • a set of cluster centres for each cluster, the “top terms” that are the most salient terms for that cluster

We will now use the labelling of the sentences of the dataset to generate a brief description of each cluster.

Generating text

We will generate a brief description for each narrative that we have found by generating a summary of the sentences for each cluster.

We will use BART-large-CNN, which is a version of the autoregressive decoder BART fine-tuned for summarisation on the CNN / DailyMail dataset. This is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail.

For a general introduction on generating text see Patrick von Platen’s “How to generate text: using different decoding methods for a generation with Transformers”

When generating text with Huggingface transformers there are a couple of parameters relevant to us:

  • max_length and min_length: these determine the size of the summary. We create a short summary as experiments showed that longer summaries led to incomprehensible results.
  • no_repeat_ngram_size: we saw earlier that phrases in our dataset are very repetitive (remember "recurrent pro-Kremlin disinformation narrative"?) We don't want our summary to be as repetitive as the dataset. We set this parameter to 2 as we don't even want bigrams to be repeated in the summary.
  • repetition_penalty: as above, we don't want terms consisting of single words to be repeated in our summary, so we set a high penalty on repetition.
  • do_sample and top_p: to be set together to let the algorithm find the next word through sampling from the words with at least a certain probability defined by top_p.
  • num_beams: number of paths to consider
  • num_return_sequences: number of summaries to be returned. We will use this as we will not leave the generation fully to the language model: we will evaluate the different generated outcomes to pick the best.

Pruning results

The generated summaries tend to favour a journalistic story over just presenting the most relevant information in terms of a few sentences. An example of this is that the generated summaries often contain quotes, as in “Mr. X said …”. That might result in lively summaries of journalistic articles, but it is not suitable for our present dataset of context-free sentences.

To mitigate the bias, we generate several summaries (“return sequences”) and then select the one that has a cosine similarity of its TF-IDF score closest to the sentences of the cluster as a whole.

Summarisation of “EU / collapse / imminent”:

The narrative of “imminent collapse’ is used to spread demoralisation and self-doubt among target audiences. The claim that external actors, including the EU and the US are stirring up protests in Russia’s neighbourhood is a recurring pro-Kremlin disinformation tale. This story also aims to pit EU member states against each other in an attempt to undermine European solidarity.

The generated summaries are partly good and partly confusing. The latter shouldn’t surprise us as what we summarise are not stories but sentences taken out of the context of articles on the ground that they contain the word “narrative”.

We are looking for a representation of narratives. The present result of summarisation to come to such a result is promising as a proof-of-concept, but the current quality is not satisfactory.

Word vectors, PCA, and projection onto the 2D plan

How do our clusters relate to each other? In a vector representation of terms in a vocabulary, a term doesn’t get a number, but gets represented as a vector, that is, a series of numbers. In such vectors we might find non-referential patterns, e.g. the 21st and 37th numbers of the vector are relatively high if a person or a role is female.

What we will look for is the following: the “top terms” of a cluster are different terms, but their vector representations may be similar for the terms of that cluster compared to the vector representations of the “top terms” for other clusters.

To explore this we will do the following:

  1. Determine the word vectors for the “top terms” of each cluster
  2. Use Principal Component Analysis to reduce the original 300 dimensions of the vector space to 2 dimensions.
  3. Display the 2 dimensional representation in a diagram.

To display the diagram we add some extra steps:

  • Give each cluster its own colour
  • Set the size of the shape representing each term based on the TF-IDF score for it.
  • If a term appears in several clusters, change its shape to indicate an overlap. The legend for “duplicates” shows what shape is used for how many clusters have that term among their “top terms”.

Ideally, such a 2D-representation would be a diagram with a number of islands of terms. Looking at all clusters we see this:

We see no obvious division of clusters as mutually exclusive clouds of same-coloured dots.

For a clearer picture, let’s look at only the two clusters “MH17 / downing / flight” and “Crimea / illegal annexation / referendum”. We get the following diagram:

We see that entities are displayed mostly on the right in the diagram, meaning that our 2D projection moves entities belonging to a cluster far away from the other elements of a cluster. But we see no clear grouping of items using this method.

I conclude that the exploration of clusters by mapping them to the 2D plan is not helpful for understanding how clusters are constituted and related.

Cosine similarity and cluster distance

In the previous section we tried to compare cluster centres, that is, the “top terms” of clusters, by mapping them to a 2D plane. Now we will try to compare clusters as a whole by looking at their “distance”.

Earlier, we created TF-IDF scores for the entire dataset and later for individual sentences. Now we will do the same for clusters, resulting in a score per cluster.

On these scores we can apply a “cosine similarity”. We compare the scores of all terms in the vocabulary of the different clusters from which we get to a single metric for comparing the “distance” between clusters.

We can display the result in a heatmap. Note that cosine similarity results in principle in values between -1 and 1, but in practice most values in our matrix are close to 0. To emphasise the differences we take the log of the values.

In the diagram, lighter means closer. We see the lightest colour on the diagonal, meaning that clusters are closest to themselves. Darker means further removed. We see for instance that — as a cluster — “Belarus / colour revolution / attempt” is distant from “Crimea / illegal annexation / referendum” and “protest / Euromaidan / coup” is relatively close to “Ukraine / war / civil”.

Looking at similarities of one cluster with all other clusters — looking at an entire row or column — we see that the cluster “MH17 / downing / flight” is relatively far removed from other clusters. We also see that the largest cluster “West / Ukraine / Russia” is relatively close to all other clusters.

We can take the cumulative cosine similarity for any cluster:

Looking just at the first three “top terms” — the ones in the title — we can understand this distribution. “Ukraine” and “Russia” appear in numerous clusters so the “West / Ukraine / Russia” cluster resembles all others. But terms like “Navalny”, “Crimea”, “chemical attack”, and “MH17” are relatively unique. That means that the latter clusters are more distinctive than the former.

We have gained some insight into the relationship between clusters, so far only identified by their generated names, but we have not gotten much nearer to understanding how “narratives” figure in all this.

Structuring top terms

Earlier we saw that from clustering follows a ranked list of “top terms”, that is, terms most representing the centre of the cluster. We expect that these are the terms most central to the narratives we try to construct.

The division of names and non-names we see in the scatterplot suggests using entities to categorise terms as a further step to understanding clusters.

We will use spaCy's Name-Entity Recognition (NER) to categorise tokens. There are more types of entities, but the ones we care about are relevant for narratives. Tokens not categorised as entities, we divide into nouns, verbs, and adjectives.

We will now divide the terms for each cluster in a number of categories. Between parentheses in the first four items you find how spaCy classifies the term if it is an entity. The last three items show how a term is classified according to Part-of-Sentence.

  • Geographical (GRP)
  • Group (NORP)
  • Organisation (ORG)
  • Person (PERSON)
  • Noun (NOUN, PROPN, NUM, DATE)
  • Verb (VERB)
  • Adjective (ADJ)

What we will do now is take the top terms, assign them to a category, and display these categories. Additionally:

  • The font size of the term is determined by the TF-IDF score of the term: a term with a high score is represented large
  • Terms that occur in other clusters too are presented in the same colour in all clusters.

A partial result is this:

Partial result of structured overview of clusters

At first sight, the above presentation does help to give a quick indication of what a cluster is about. E.g. in cluster 5 we see “Navalny”, “Skripal”, and “Litvinenko” in the PERSON category and “poisoning” in the VERB category. This gives a clear indication of a claim of a pattern of poisonings.

We find that none of our categories is the leading aspect of clustering, that is, we see that our top terms can be Geographical, Person, Noun, Verb, and even Adjective categories. This presentation gives us a quick glance at what matters most, but it doesn’t show us a pattern we might use to simplify what really matters about narratives generally.

Where from here?

We have engaged in an exploration of a specific dataset, consisting of a set of sentences with either the word ‘disinformation’ or ‘narrative’, extracted from a larger body of articles and disinformation cases.

Before coming to a conclusion, let’s consider what are possible follow-ups to our exploration.

  1. Fine-tune existing exploration.
    Examples:
    - For the goal of clustering, get rid of generic sentences in the dataset which definitely don’t add content to any narrative, e.g. the sentence “What seemed to be most effective, however, was to switch narratives according to the trending topics of discussion.”.
    -
    Use a better language model for summarisation, e.g. GPT-3.
    - Clustering treats each token equally. An adjective gets the same treatment as a person. But for narratives, persons are more important than adjectives. Let the tokeniser remove words that are unlikely to be important to narratives before creating TF/IDF scores and clustering.
  2. Extend exploration by enlarging the dataset.
    Examples:
    - Our dataset was extracted from a larger body of articles which will contain more information on the narratives we are trying to describe. Mine these texts for additional data to give more content to the narratives.
    - Go beyond the publications of the European Union and crawl news articles for other mentions of “disinformation”.
  3. Redefine goals.
    In the present article we were interested only in exploring one dataset. We can redefine goals.
    Examples:
    - Create a classification system for — through clustering — defined narratives, e.g. to identify fragments of a text which match one of the defined narratives.
    - Fine-tune the definition of “disinformation” based on research into the actual use of the term. E.g. an outcome of our current exploration is that “narrative” is a central concept to “disinformation”, but it is not currently part of the official definition of the term.
    - Identify within the body of text other aspects of the EU’s definition of “disinformation”, e.g. verification, intention, and actual harm done.

Conclusion

We looked into the usage of “disinformation” in the body of about 14,000 articles from EUvsDisinfo by first reducing these articles to a dataset of all sentences containing the term “disinformation” and then exploring that dataset with a variety of approaches.

Using Wordcloud and TF-IDF analysis we found that “disinformation” is strongly correlated with the notion of “narrative”. Additionally, we found that such narratives are typically recurrent, that is, they are repeated.

Based on this finding we created a new dataset consisting of all sentences in our original body of text containing the word “narrative” and used that dataset for further exploration.

We used k-means clustering to determine which narratives we can identify in our dataset.

Based on the “elbow method” we distinguished 16 different clusters. The challenge is to transform those clusters into a presentation that gives us a grip on narratives.

Summarising the sentences of each cluster using a BART-large-CNN model did result in summaries that got some of the point of each narrative, but none of them were sufficiently good to be acceptable. This is understandable given that our set of sentences per cluster are not a developing story, but individual sentences taken out of the original body of text.

“Top terms” are terms that designate centres of clusters, that is, they are relatively specific for a cluster. Projecting these top terms onto a 2D plan using vector representation of tokens and PCA didn’t lead to visually separate clusters, so for our purpose.

Using cosine similarity on TF-IDF scores per cluster and presenting the result in a heat-map showed which clusters are relatively alike and which ones are relatively dissimilar. We found that the largest cluster, containing about half the items in the dataset, is similar to all others. We see that it contains many of the terms like “Russia”, “Russian”, “Ukraine”, and “Kremlin” that we find throughout our dataset, but no very specific terms. Also we found that several clusters are defined by very specific entities, e.g. “MH17”, “Crimea”, and “Navalny”.

Our last approach to displaying clusters is based on using spaCy to assign the top terms for each cluster to named entities or part-of-sentence categories. The result gives a quick insight in what is central to each cluster and we see that the clustering agrees with narratives that a human might distinguish in the texts.

The main finding of our exploration is that studying “disinformation” through studying narratives gives promising results. Given that the set of narratives encompasses that of “disinformation narratives”, this can be seen as an encouragement to first study relevant narratives and only within that context try to pin-point where disinformation creeps in.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store