Disinformation Topics over Time

Cees Roele
11 min readOct 20, 2022

Visualising the development of topics over time in the EUvsDisinfo database

Visualising topics over time in the EUvsDisinfo database (image by author)

Introduction

The EUvsDisinfo database is a treasure trove of the European Union’s strategic communication about Russia. It contains over 14,000 reports expressing the EU’s political contention of “pro-Kremlin” claims.

In this article we will model the topics in that database, scrutinise their nature, and track them over time.

Data

Since 2015 the East StratCom Task Force of the European External Action Service has created over 14,000 reports on “pro-Kremlin disinformation cases” at its website EU vs Disinformation.

Cases in the EUvsDisinfo database focus on messages in the international information space that are identified as providing a partial, distorted, or false depiction of reality and spread key pro-Kremlin messages. This does not necessarily imply, however, that a given outlet is linked to the Kremlin or editorially pro-Kremlin, or that it has intentionally sought to disinform.

Presently, we are interested only in the topics we can distinguish in the data and will not touch on the notion of disinformation.

Below an image of what a report of a case in the database looks like:

Annotated case report (image by author)

For the purpose of identifying topics we are interested in:

  • title
  • summary: a representation of the claim that is disputed
  • disproof: disputation of the claim
  • publication date: date on which the report was published (not on which the claim was made in the “publication/media” source)

By extracting the data from the EUvsDisinfo webpages containing the reports we have come to a dataset with 14,382 records with publication dates between 6 Jan 2015 and 13 Sept 2022.

Distribution of publication dates in EUvsDisinfo database (image by author)

In tabular form, the data looks as follows:

Dataset in tabular format (image by author)

Topics

We are looking for patterns in what the cases are about. For this, we don’t need the urls in our data and we concatenate the words of the title, summary, and disproof into a single text or document. What the cases are about we call a topic. We assume that each document has a single topic.

Our goal is to identify and characterise a set of topics which recur in the documents.

To automatically identify and visualise topics, we will use BERTopic which …

… is a topic modelling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

The present article is focused on analysis, not on techniques. Let me phrase what happens in a relatively untechnical way.

Using a general model of the English language, the words in the documents of our dataset are encoded as numbers in a space with hundreds of dimensions. To be manageable this multi-dimensional space is reduced to a few dimensions. Now an algorithm divides the documents into clusters based on the similarity of their encodings. For each cluster, we determine a list pairing the most representative terms with a score for how representative they are. The topmost terms are used to automatically create a name for the cluster. With this added information the cluster — which is just a set of documents — gains a characterisation and we call it a topic.

Finding topics

When we use default parameters for BERTopic we find 225 topics in our dataset, with each having a minimum of 10 matching documents.

This might be a desirable result e.g. if want want to create a recommendation system to enrich the information on the documents: “If you like this report, you might also like the following related reports”. But right now we are interested in a general overview of topics in our data, not in relations between documents, and we’d like to cut down on the list of resulting topics.

We therefor set the minimum size of records per topic to 100. Additionally, we remove stopwords (e.g. ‘and’, ‘or’, ‘the’) from the topics as they don’t contribute to their meaning.

The result is the following list of 27 topics with their respective number of publications. Topic stands for the id of a topic, Count for the number of publications for that topic, and Name is generated on the basis of the Topic id and the most prevalent terms of the topic.)

Note that -1 signifies the group of outliers. These are documents which the algorithm did not include in a topic.

Overview of topics (image by author)

For a better insight in the number of publications per topic, we can display the Count, the number of publications per topic, in a bar chart:

Number of publications per topic (image by author)

About the numbers

Let’s look at some percentages based on the data presented above:

Vital statistics (image by author)

In words:

  • 22% of all documents is considered an outlier, that is, it is not assigned to any topic
  • almost a quarter of all documents (24.3%) is allocated to the first topic (id=0), which is over 30% of all documents that are not outliers
  • Little over 50% of all documents is allocated to any other topic than the first

We see that the topic named ukraine_ukrainian_donbas_war is by far the most frequent in our dataset. Based on a notion of similarity — more on that later in this article — we can find the document in the database that is most representative for the topic, and I conclude hence, for the entire EUvsDisinfo database:

Most representative document of EUvsDisinfo database: https://euvsdisinfo.eu/report/ukraine-lost-its-independence-and-ability-to-determine-its-own-foreign-policy

It is from the 14th of September 2021.

Characterising topics

Remember that our dataset consists of 14,382 documents. Basic clustering would do nothing more than creating a list of cluster ids and attributing such a cluster id to each document.

Attributing a cluster id to each document (image by author)

Topic modelling adds to clustering a characterisation of each cluster by a list of pairs of representative terms with their scores. This score is based not just on a term occurring many times in the documents of the cluster, but also on it not being so prevalent in the documents of other clusters. They indicate what makes this cluster special.

Below you see the table of terms and scores for one topic. Note how the generated name for the topic is derived from the highest ranking terms.

Top terms and their scores for one topic (image by author)

The overview of topics with their generated names already gave us insight in what the different topics are about. Now we extend this to more terms and using the score to indicate how relatively representative any term is. Bright red indicates a high score, a colour towards white indicates a low score.

Topic Term matrix (image by author)

Two examples:

  1. Topic 14_vaccine_sputnik_vaccines_ema displays a bright red vaccine and 24_soros_george_foundation_society displays soros and george in bright red. It means that these topics can pretty much be identified by only these terms.
  2. In 16_interference_election_2016_elections we see that none of the terms is marked very red. It means that to identify whether this is the topic for a document, the combination of a number of terms is important. If we find in a document interference, and election, and meddling, and trump, and presidential, then there is a good chance that this is the topic. Note that this topic is “competing” with the very similar topic 23_election_trump_biden_fraud.

Relating topics

Looking at the table showing term scores for all topics, we see the numerous terms occur in different topics. This leads to a thought:

If topics are similar, we might conflate them and end up with fewer topics.

Terms and scores are used to define topics and allocate documents to them. Additionally, a notion of distance can be defined as a measure of how similar two topics are.

We can project the topics onto a two-dimensional plane which makes such a distance visible.

Intertopic Distance Map (image by author)

In the diagram we see seven groups of circles. Each circle represents a topic — and the size of the circle the number of documents allocated to it.

This suggests we might reduce we number of topics to seven and conflate the topics in each group. Trying this, the result was seven topics which were mostly the same as the seven topics with the highest frequency above, but now about half of all documents ended up as outlier. This approach therefore didn’t lead to a simplification, but to a truncation.

Hierarchically relating topics

Another approach is to calculate the distance between any two topics, pick the two topics with the shortest distance, and merge them. We then calculate the distance for all other topics to the topic newly created by merging and repeating the process of picking the two topics with the shortest distance, and merge those. We continue doing this until all topics have merged into one single topic.

We can make this visible using a dendrogram:

Dendrogram: hierarchy of topics (image by author)

Note that the column with topic names is now reordered in order to be able to draw the dendrogram. The more the dot of connecting topics is located to the left, the smaller the distance between them.

In the marked green rectangle we see that 21_italy_coronavirus_eu, 7_coronivirus_virus_covid19, and 14_vaccine_sputnik_vaccines are closely together. The red arrows indicate the nodes where the topics are in two steps merged together in a new topic.

As terms and scores can also be calculated for the topics created by merging, we can automatically generate names for them. In another view of the same information the nodes that are connecting topics are named in the same way that the most salient words of any topic are used to create the name of a topic.

Note that the second dendrogram is ordered differently: the leaves are on the right side and the root is in the top-left corner. We see here that the three earlier mentioned topics are combined into one which we can name coronavirus_vaccine_sputnik_covid19_pandemic.

To distinguish merged topics from the — very similar- topics they are merged out of, in this diagram five terms are used to generate the name of topics.

Merged topicsin the hierarchy are “virtual” and have no topic id.

Topics over time

Above we looked at topics, their characterisation in terms of top terms, and the similarity of topics. We will now look at the number of documents per interval in the EUvsDisinfo database for each topic. Arbitrarily, we select time intervals of two months so we get forty-two intervals for the seven years EUvsDisinfo has been active. We place this in a line chart, with one line per topic.

Top 10 topics over time in the EUvsDinfo database (image by author)

We see that 0_ukraine_ukrainian_donbas_war has been the topic consistently having the most publications, except for a peak in early 2020 when 7_coronavirus_virus_covid19_china had a peak.

In BERTopic the above diagram is interactive and it is possible to get a better overview of the development over time in each topic.

Here we chose to display only the ten topics with the most documents published. If all topics were included, there would have been many more lines near the bottom of the diagram, which would make them difficult to distinguish and compare.

Let’s look at two topics that according to the hierarchical clustering diagram are closely related:

Navalny/Skripal cases over time (image by author)

Given that the bases for these cases are in very specific events, we see that there are clear peaks for each of them. As a word of caution: we also see that the interpolation in the diagram — which might be appropriate in other cases — is misleading: the Skripal-line makes a long ascent from early 2017 to early 2018, but the event occurred only in 2018.

In the Hierarchy Chart with Named Nodes we can track interesting nodes at which several topics are merged. Taking one of such nodes, e.g. coronavirus_vaccine_sputnik_covid19_vaccines, we can look at how the different topics coming together in that node are developing:

coronavirus_vaccine_sputnik_covid19_vaccines node over time (image by author)

Here we see clear peaks for the period in early 2020 when COVID19 required the most understanding and decision-making and another peak for vaccines around 2021.

By using our knowledge of the topic hierarchy we can find selections of topics for interesting comparisons of their development over time.

A third way of selecting topics — after taking the top 10 and taking nodes from the hierarchy — is to find topics for some keywords that interest us.

Just as the relative similarity of topics is based on scores for terms, individual terms or search strings can be scored, after which the topics can be selected which most closely match them.

Here we search for “european union and sanctions” and find three topics which relatively closely match it:

Similar topics for the word 'european union and sanctions' (similarity threshold=0.40):
- 6_sanctions_russia_eu (0.609)
- 12_eu_european_member (0.489)
- 5_nato_alliance_russia (0.438)

Let’s visualise them:

Topics similar to “european union and sanctions” over time (image by author)

Note that the above is not a filter for the search string: we see just the frequencies for the topics. But what is interesting is that these three topics have a relatively similar pattern of frequencies over time.

Conclusions

We defined 27 topics to which we assigned three-quarters of our dataset of 14,000 documents from the EUvsDisinfo database of “disinformation cases”, leaving the rest of the documents as outliers.

The topic ukraine_ukrainian_donbas_war dominates during almost the entire existence of EUvsDisinfo in terms of frequency, making up over 30% of the documents we assigned to topics. It shows that the focus of East StratCom Task Force, the producer of EUvsDisinfo, has been very much on Ukraine.

Based on a notion of similarity, we found that the document most representative for the dominating topic, and hence for the entire EUvsDisinfo database, is “Ukraine lost its independence and ability to determine its own foreign policy” (careful: the point of the title is for the EU to disprove it).

Using term scores we characterised the different topics by the terms that are most prevalent for them and most outstanding compared to other topics.

We displayed what topics are similar using a hierarchical tree.

Based on relevance by most publications, nodes in the hierarchical tree, and search for terms, we selected subsets of topics to display the number of their documents over time in order to find interesting patterns. What we found is that certain topics are really time/event bound, like Navalny/Skripal or Covid19, while ukraine_ukrainian_donbas_war kept dominating the output for most of the seven year period.

Post scriptum

You can find the code I used for the above research here:

Great thanks to Maarten Grootendorst for the creation & publication of the splendid BERTopic topic modelling tool!

--

--

Cees Roele

Language Engineer, Python programmer, Scrum Master, Writer