A Term Score Matrix for BERTopic

Cees Roele
6 min readOct 25, 2022

An improved visualisation of term scores for topics with BERTopic

Term Score Matrix (image by author)

In short

This article demonstrates a Term Score Matrix, a visualisation of pairs of terms and scores characterising topics in BERTopic. Focus is on functionality. You can find a link to the notebook containing the implementation of the stylised dataframe for the Term Score Matrix at the bottom of this article.

The used dataset is based on the over 14,000 reports on “pro-Kremlin disinformation cases” at the website EU vs Disinformation. This reflects political contention between the EU and Russia over the past seven years. You can find more information on the dataset and on topic modelling with BERTopic of this dataset in the following article:

Topic modelling: terms and scores

Based on a notion of distance calculation, the process of clustering defines a number of clusters and assigns each sample of the dataset to a cluster.

Assuming we have assigned our dataset to a variable docs, we can model topics with BERTopic by:

from bertopic import BERTopic
docs = [ ... list of documents/strings ...]
topic_model = BERTopic(min_topic_size=100, top_n_words=20)
topic_model.fit_transform(docs)

The result is a set of thirty clusters with a size of at least 100 samples. Here are the resulting topics with the largest frequencies:

Top 8 of 30 topics after topic modelling with BERTopic (image by author)

Topic modelling applies clustering to linguistic samples and provides a characterisation of the resulting clusters as a ranked list of terms with their c-TF-IDF scores. We see an example in the table below. Note that the name of the topic — above the table — is generated on the basis of the topic id — here 2— and the highest ranking terms.

Term Score table for topic 2 (image by author)

Term Score Decline

Having names for topics based on relevant terms helps identifying and understanding the different topics. But what are we otherwise to do with the list? Just considering the sequence of terms gives us limited understanding of the relevance of these terms to the topic.

The Term score decline diagram gives us insight in how the scores of terms decrease to a level where their influence is hardly distinguishable from that of other terms.

For the illustration below we have configured BERTopic to create term-scores for twenty terms by initialising it with top_n_words=20. For most topics, we see a steep decline of the c-TF-IDF score from the term ranked first to the term ranked third. After the third rank the scores are gradually flattening with increasing rank.

To get this diagram, we run: topic_model.visualize_term_rank() The annotations in red are manually added.

Term Score Decline (image by author)

Only by using tooltips in BERTopics plotly-based diagram can we see what line in the diagram corresponds to what topic (see the little arrow on the left of the tooltip in the image). At the eleventh ranked term we see that topics 29_kerch_strait and 12_MH17_JIT stand out by having higher c-TF-IDF values than the others. Their values at this term are even higher than the scores for the first items of several other topics.

The Term Score Decline diagram helps us decide whether we can cut off the number of terms we want to distinguish. E.g. we might consider that only the top 11 terms are important enough for us to consider. But how to deal with the outliers, like the two topics having relatively high c-TF-IDF values even at the eleventh rank?

Bar chart

Let’s take a closer look at these topics and their terms. We use BERTopic’s visualize_barchart() to take a closer look at these topics and their terms. In the following diagram we look at the 15 highest ranked terms and include two other topics as reference.

Ignoring arguments for setting the title, height, and custom labels, we run: topic_model.visualize_barchart(topics=[0,29,12,30], n_words=12)

Barchart for selected topics (image by author)

We see — with added annotation and a tooltip — the ranked terms with their relative scores for four topics, including the two we saw standing out in the Term decline score diagram. Looking at the terms we can understand why the scores for these topics decline relatively little, that is, they keep standing out compared to other topics: both involve specific events.

Topic 29 is about the waterway Kerch Strait and we see terms like sea, waters, vessels, boats, crews, and ships. Topic 12 is about the downing of the airplane MH17 and we see terms like BUK (anti-aircraft missile), missile, JIT (Joint Investigation Team), flight, crash, downing.

Term Score Matrix

Above we saw that the bar chart can help us answer the question: “What are the highest ranking terms of any topic and how do their scores decline?”

But it doesn’t show us the other direction: terms for any topic by rank. First, because the scales of the different bar charts are different, as indicated by the annotation in the image. Second, because the length of bars in sequence — rather than in parallel — is hard to compare visually. Comparison in the multiple bar chart layout works only in one dimension: between the scores of terms of one topic in one bar chart.

Additionally, the bar chart uses space to represent a scalar value. If we would represent all thirty topics in a layout of multiple bar charts, it would take us eight rows of up to four bar charts.

We can address these drawbacks of the bar chart representation by having colour represent the score, similar to a heatmap.

We can use a pandas DataFrame to do this for the same topics as we displayed in the bar chart above:

Term Score Matrix for topics 0, 12, 29, and 30 (image by author)

We see that now our ability to compare is two-dimensional: we can horizontally compare relative value of ranked terms for one topic — as in the bar chart — and vertically the relative value for any topic of a specific rank. As with BERTopic’s plotly-based diagrams, we display a tooltip with the c-TF-IDF value — in the image with the example for boat.

Note that the bar chart allows for infinite extension of the number of terms as it stacks them vertically, and here we are restricted to the width of the screen. We are displaying only eight ranks instead of fifteen.

Let’s look at the Term Score Matrix for all topics now. Colours may vary as long as they indicate the scale. In the next image we use a colour scheme of yellow, orange, and red. As the whole image is zoomed out, we can now include fifteen ranks.

Term Score Matrix for EUvsDisinfo (image by author)

Much more information than we get from the multiple bar chart layout fits into this single matrix and it is much easier to compare values in all directions.

Conclusion

The Term Score Matrix is a space-efficient diagram representing c-TF-IDF scores for terms along the two dimensions of topics and ranks. It enables us to quickly identify term decline for topics and what terms are prevalent in topics.

Programming code

A Jupyter notebook with the python code used in this article is available at github.

You can find a simple explanation of the styling technique behind the Term Score Matrix in another article of mine:

--

--

Cees Roele

Language Engineer, Python programmer, Scrum Master, Writer