Back to Article
Topic Modeling Using BERTopic
Download Notebook

Import CrowdTangle Data

For this example we import a CrowdTangle dataframe, which has been preprocessing using the OCR Notebook. We are only dealing with one image per post, there are no videos (= no transcriptions). In this example, we have up to two text columns per Post, Description which contains the caption, and ocr_text. When exploring the textual content of the posts, we see each of those columns as one document. Thus, we transform our table and create new_df as a Text Table that contains a reference to the post (shortcode), the actual Text, and a Text Type column.

In [1]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/2023-11-30-Export-Posts-Crowd-Tangle.csv')

Next, we want to transform the DataFrame from one post per row, to one text document per row (Think tidydata!)

In [2]:
df.head()
Unnamed: 0 Account User Name Followers at Posting Post Created Post Created Date Post Created Time Type Total Interactions Likes ... Photo Title Description Image Text Sponsor Id Sponsor Name Overperforming Score (weighted β€” Likes 1x Comments 1x ) shortcode image_file ocr_text
0 0 FREIE WÄHLER Bayern fw_bayern 9138 2023-10-09 20:10:19 CEST 2023-10-09 20:10:19 Photo 566 561 ... https://scontent-sea1-1.cdninstagram.com/v/t51... NaN #Landtagswahl23 πŸ€©πŸ§‘πŸ™ #FREIEWΓ„HLER #Aiwanger #Da... FREIE WAHLER 15,8 % NaN NaN 2.95 CyMAe_tufcR media/images/fw_bayern/CyMAe_tufcR.jpg FREIE WAHLER 15,8 %
1 1 Junge Liberale JuLis Bayern julisbayern 4902 2023-10-09 19:48:02 CEST 2023-10-09 19:48:02 Album 320 310 ... https://scontent-sea1-1.cdninstagram.com/v/t51... NaN Die Landtagswahl war für uns als Liberale hart... NaN NaN NaN 1.41 CyL975vouHU media/images/julisbayern/CyL975vouHU.jpg Freie EDP Demokraten BDB FDP FB FDP DANKE FÜR ...
2 2 Junge Union Deutschlands junge_union 44414 2023-10-09 19:31:59 CEST 2023-10-09 19:31:59 Photo 929 925 ... https://scontent-sea1-1.cdninstagram.com/v/t39... NaN Nach einem starken Wahlkampf ein verdientes Er... HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris... NaN NaN 1.17 CyL8GWWJmci media/images/junge_union/CyL8GWWJmci.jpg HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris...
3 3 Katharina Schulze kathaschulze 37161 2023-10-09 19:29:02 CEST 2023-10-09 19:29:02 Photo 1,074 1009 ... https://scontent-sea1-1.cdninstagram.com/v/t51... NaN So viele Menschen am Odeonsplatz heute mit ein... NaN NaN NaN 1.61 CyL7wyJtTV5 media/images/kathaschulze/CyL7wyJtTV5.jpg Juo I W
4 4 Junge Union Deutschlands junge_union 44414 2023-10-09 18:01:34 CEST 2023-10-09 18:01:34 Album 1,655 1644 ... https://scontent-sea1-1.cdninstagram.com/v/t39... NaN Herzlichen GlΓΌckwunsch zu diesem grandiosen Wa... NaN NaN NaN 2.34 CyLxwHuvR4Y media/images/junge_union/CyLxwHuvR4Y.jpg 12/12 der hessischen JU-Kandidaten ziehen in d...

5 rows Γ— 25 columns

We restructure df to focus on two key text-based columns: β€˜Description’ and β€˜ocr_text’. The goal is to create a streamlined DataFrame where each row corresponds to an individual text entry, either from the β€˜Description’ or the β€˜ocr_text’ fields. To achieve this, we first split the original DataFrame into two separate DataFrames, one for each of these columns. We then rename these columns to β€˜Text’ for uniformity. Additionally, we introduce a new column, β€˜Text Type’, to categorize each text entry as either β€˜Caption’ (originating from β€˜Description’) or β€˜OCR’ (originating from β€˜ocr_text’). The β€˜shortcode’ column is retained as a unique identifier for each entry. Finally, we concatenate these two DataFrames into a single DataFrame, ensuring a clean and organized structure. This restructured DataFrame facilitates easier analysis and processing of the text data, segregating it by source while maintaining a link to its original post via the β€˜shortcode’. The code also includes a step to remove any rows with empty or NaN values in the β€˜Text’ column, ensuring data integrity and cleanliness.

In [3]:
import pandas as pd

# Creating two separate dataframes
df_description = df[['shortcode', 'Description']].copy()
df_ocr_text = df[['shortcode', 'ocr_text']].copy()

# Renaming columns
df_description.rename(columns={'Description': 'Text'}, inplace=True)
df_ocr_text.rename(columns={'ocr_text': 'Text'}, inplace=True)

# Adding 'Text Type' column
df_description['Text Type'] = 'Caption'
df_ocr_text['Text Type'] = 'OCR'

# Concatenating the dataframes
new_df = pd.concat([df_description, df_ocr_text])

# Dropping any rows where 'Text' is NaN or empty
new_df.dropna(subset=['Text'], inplace=True)
new_df = new_df[new_df['Text'].str.strip() != '']

# Resetting the index
new_df.reset_index(drop=True, inplace=True)
In [4]:
new_df.head()
shortcode Text Text Type
0 CyMAe_tufcR #Landtagswahl23 πŸ€©πŸ§‘πŸ™ #FREIEWΓ„HLER #Aiwanger #Da... Caption
1 CyL975vouHU Die Landtagswahl war fΓΌr uns als Liberale hart... Caption
2 CyL8GWWJmci Nach einem starken Wahlkampf ein verdientes Er... Caption
3 CyL7wyJtTV5 So viele Menschen am Odeonsplatz heute mit ein... Caption
4 CyLxwHuvR4Y Herzlichen GlΓΌckwunsch zu diesem grandiosen Wa... Caption

BERTopic

At this stage, the data is reading for Topic Modeling. We are using the BERTopic package and follow the tutorial notebook provided by the author.

In [5]:
!pip install -q bertopic
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.1/154.1 kB 3.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.2/5.2 MB 25.0 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.9/90.9 kB 12.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/86.0 kB 11.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 36.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.8/55.8 kB 7.3 MB/s eta 0:00:00
  Building wheel for hdbscan (pyproject.toml) ... done
  Building wheel for sentence-transformers (setup.py) ... done
  Building wheel for umap-learn (setup.py) ... done

In the following cells we download a stopword dictionary for the German language and applied it according to the documentation

In [10]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

STOPWORDS = stopwords.words('german')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=STOPWORDS)

Now we’re ready to create our corpus in docs, a list of text documents to pass to BERTopic.

In [6]:
# We create our corpus
docs = new_df['Text']
In [30]:
from bertopic import BERTopic

# We're dealing with German texts, therefore we choose 'multilingual'. When dealing with English texts exclusively, choose 'english'
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)
2023-12-01 08:52:44,038 - BERTopic - Embedding - Transforming documents to embeddings.
2023-12-01 08:52:50,561 - BERTopic - Embedding - Completed βœ“
2023-12-01 08:52:50,563 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-01 08:53:04,597 - BERTopic - Dimensionality - Completed βœ“
2023-12-01 08:53:04,599 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-01 08:53:05,103 - BERTopic - Cluster - Completed βœ“
2023-12-01 08:53:05,109 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-01 08:53:05,639 - BERTopic - Representation - Completed βœ“

The following cells have been copied from the BERTopic Tutorial. Please check the linked notebook for more functions and the documentation for more background information.

Extracting Topics

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [31]:
freq = topic_model.get_topic_info(); freq.head(5)
Topic Count Name Representation Representative_Docs
0 -1 860 -1_bayern_csu_uhr_mehr [bayern, csu, uhr, mehr, menschen, mΓΌnchen, te... [Wir gehen mit #herzstatthetze in den Wahlkamp...
1 0 137 0_wΓ€hlen_fdp_hessen_heute [wΓ€hlen, fdp, hessen, heute, stimme, stimmen, ... [Unser MinisterprΓ€sident @markus.soeder steigt...
2 1 104 1_energie_co2_klimaschutz_habeck [energie, co2, klimaschutz, habeck, wasserstof... [Habeck tΓ€uscht Γ–ffentlichkeit mit Zensur: RΓΌc...
3 2 103 2_zuwanderung_migration_grenzpolizei_migration... [zuwanderung, migration, grenzpolizei, migrati... [Wir sagen Ja zu #Hilfe und #Arbeitsmigration,...
4 3 89 3_uhr_starke mitte_bayerns starke_bayerns [uhr, starke mitte, bayerns starke, bayerns, b... ["Deutschland-Pakt" aus Scholz der Krise komme...

-1 refers to all outliers and should typically be ignored. Next, let’s take a look at a frequent topic that were generated:

In [32]:
len(freq)
52

We have a total of 52 topics

In [33]:
topic_model.get_topic(0)  # Select the most frequent topic
[('wΓ€hlen', 0.01628736425293884),
 ('fdp', 0.01626632927971954),
 ('hessen', 0.013634118460503969),
 ('heute', 0.013441948777152065),
 ('stimme', 0.011907460231710654),
 ('stimmen', 0.011505832701270827),
 ('landtagswahl', 0.011272934711858047),
 ('wahlkampf', 0.01059385752962746),
 ('sonntag', 0.01057520846171656),
 ('bayern', 0.010322807358750668)]

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

In [12]:
topic_model.visualize_topics()

Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [15]:
topic_model.visualize_barchart(top_n_topics=15)

Topic Reduction

We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, is that you can decide the number of topics after knowing how many are actually created. It is difficult to predict before training your model how many topics that are in your documents and how many will be extracted. Instead, we can decide afterwards how many topics seems realistic:

In [36]:
topic_model.reduce_topics(docs, nr_topics=15)
2023-12-01 08:53:07,148 - BERTopic - Topic reduction - Reducing number of topics
2023-12-01 08:53:07,642 - BERTopic - Topic reduction - Reduced number of topics from 52 to 15
<bertopic._bertopic.BERTopic at 0x794041658ca0>

Visualize Terms After Reduction

In [19]:
topic_model.visualize_barchart(top_n_topics=15)

Saving the model

The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [38]:
# Save model
topic_model.save("/content/drive/MyDrive/2023-12-01-LTW23-CrowdTangle-Posts-model")
2023-12-01 08:53:54,135 - BERTopic - WARNING: When you use `pickle` to save/load a BERTopic model,please make sure that the environments in which you saveand load the model are **exactly** the same. The version of BERTopic,its dependencies, and python need to remain the same.