import pandas as pd
= pd.read_csv('/content/drive/MyDrive/2023-11-30-Export-Posts-Crowd-Tangle.csv') df
Import CrowdTangle Data
For this example we import a CrowdTangle dataframe, which has been preprocessing using the OCR Notebook. We are only dealing with one image per post, there are no videos (= no transcriptions). In this example, we have up to two text columns per Post, Description
which contains the caption, and ocr_text
. When exploring the textual content of the posts, we see each of those columns as one document. Thus, we transform our table and create new_df
as a Text Table that contains a reference to the post (shortcode
), the actual Text
, and a Text Type
column.
In [1]:
Next, we want to transform the DataFrame from one post per row, to one text document per row (Think tidydata
!)
In [2]:
df.head()
Unnamed: 0 | Account | User Name | Followers at Posting | Post Created | Post Created Date | Post Created Time | Type | Total Interactions | Likes | ... | Photo | Title | Description | Image Text | Sponsor Id | Sponsor Name | Overperforming Score (weighted β Likes 1x Comments 1x ) | shortcode | image_file | ocr_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | FREIE WAΜHLER Bayern | fw_bayern | 9138 | 2023-10-09 20:10:19 CEST | 2023-10-09 | 20:10:19 | Photo | 566 | 561 | ... | https://scontent-sea1-1.cdninstagram.com/v/t51... | NaN | #Landtagswahl23 π€©π§‘π #FREIEWΓHLER #Aiwanger #Da... | FREIE WAHLER 15,8 % | NaN | NaN | 2.95 | CyMAe_tufcR | media/images/fw_bayern/CyMAe_tufcR.jpg | FREIE WAHLER 15,8 % |
1 | 1 | Junge Liberale JuLis Bayern | julisbayern | 4902 | 2023-10-09 19:48:02 CEST | 2023-10-09 | 19:48:02 | Album | 320 | 310 | ... | https://scontent-sea1-1.cdninstagram.com/v/t51... | NaN | Die Landtagswahl war fΓΌr uns als Liberale hart... | NaN | NaN | NaN | 1.41 | CyL975vouHU | media/images/julisbayern/CyL975vouHU.jpg | Freie EDP Demokraten BDB FDP FB FDP DANKE FΓR ... |
2 | 2 | Junge Union Deutschlands | junge_union | 44414 | 2023-10-09 19:31:59 CEST | 2023-10-09 | 19:31:59 | Photo | 929 | 925 | ... | https://scontent-sea1-1.cdninstagram.com/v/t39... | NaN | Nach einem starken Wahlkampf ein verdientes Er... | HERZLICHEN GLΓCKWUNSCH! Unsere JUler im bayris... | NaN | NaN | 1.17 | CyL8GWWJmci | media/images/junge_union/CyL8GWWJmci.jpg | HERZLICHEN GLΓCKWUNSCH! Unsere JUler im bayris... |
3 | 3 | Katharina Schulze | kathaschulze | 37161 | 2023-10-09 19:29:02 CEST | 2023-10-09 | 19:29:02 | Photo | 1,074 | 1009 | ... | https://scontent-sea1-1.cdninstagram.com/v/t51... | NaN | So viele Menschen am Odeonsplatz heute mit ein... | NaN | NaN | NaN | 1.61 | CyL7wyJtTV5 | media/images/kathaschulze/CyL7wyJtTV5.jpg | Juo I W |
4 | 4 | Junge Union Deutschlands | junge_union | 44414 | 2023-10-09 18:01:34 CEST | 2023-10-09 | 18:01:34 | Album | 1,655 | 1644 | ... | https://scontent-sea1-1.cdninstagram.com/v/t39... | NaN | Herzlichen GlΓΌckwunsch zu diesem grandiosen Wa... | NaN | NaN | NaN | 2.34 | CyLxwHuvR4Y | media/images/junge_union/CyLxwHuvR4Y.jpg | 12/12 der hessischen JU-Kandidaten ziehen in d... |
5 rows Γ 25 columns
We restructure df
to focus on two key text-based columns: βDescriptionβ and βocr_textβ. The goal is to create a streamlined DataFrame where each row corresponds to an individual text entry, either from the βDescriptionβ or the βocr_textβ fields. To achieve this, we first split the original DataFrame into two separate DataFrames, one for each of these columns. We then rename these columns to βTextβ for uniformity. Additionally, we introduce a new column, βText Typeβ, to categorize each text entry as either βCaptionβ (originating from βDescriptionβ) or βOCRβ (originating from βocr_textβ). The βshortcodeβ column is retained as a unique identifier for each entry. Finally, we concatenate these two DataFrames into a single DataFrame, ensuring a clean and organized structure. This restructured DataFrame facilitates easier analysis and processing of the text data, segregating it by source while maintaining a link to its original post via the βshortcodeβ. The code also includes a step to remove any rows with empty or NaN values in the βTextβ column, ensuring data integrity and cleanliness.
In [3]:
import pandas as pd
# Creating two separate dataframes
= df[['shortcode', 'Description']].copy()
df_description = df[['shortcode', 'ocr_text']].copy()
df_ocr_text
# Renaming columns
={'Description': 'Text'}, inplace=True)
df_description.rename(columns={'ocr_text': 'Text'}, inplace=True)
df_ocr_text.rename(columns
# Adding 'Text Type' column
'Text Type'] = 'Caption'
df_description['Text Type'] = 'OCR'
df_ocr_text[
# Concatenating the dataframes
= pd.concat([df_description, df_ocr_text])
new_df
# Dropping any rows where 'Text' is NaN or empty
=['Text'], inplace=True)
new_df.dropna(subset= new_df[new_df['Text'].str.strip() != '']
new_df
# Resetting the index
=True, inplace=True) new_df.reset_index(drop
In [4]:
new_df.head()
shortcode | Text | Text Type | |
---|---|---|---|
0 | CyMAe_tufcR | #Landtagswahl23 π€©π§‘π #FREIEWΓHLER #Aiwanger #Da... | Caption |
1 | CyL975vouHU | Die Landtagswahl war fΓΌr uns als Liberale hart... | Caption |
2 | CyL8GWWJmci | Nach einem starken Wahlkampf ein verdientes Er... | Caption |
3 | CyL7wyJtTV5 | So viele Menschen am Odeonsplatz heute mit ein... | Caption |
4 | CyLxwHuvR4Y | Herzlichen GlΓΌckwunsch zu diesem grandiosen Wa... | Caption |
BERTopic
At this stage, the data is reading for Topic Modeling. We are using the BERTopic package and follow the tutorial notebook provided by the author.
In [5]:
!pip install -q bertopic
ββββββββββββββββββββββββββββββββββββββββ 154.1/154.1 kB 3.6 MB/s eta 0:00:00
ββββββββββββββββββββββββββββββββββββββββ 5.2/5.2 MB 25.0 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
ββββββββββββββββββββββββββββββββββββββββ 90.9/90.9 kB 12.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
ββββββββββββββββββββββββββββββββββββββββ 86.0/86.0 kB 11.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
ββββββββββββββββββββββββββββββββββββββββ 1.3/1.3 MB 36.3 MB/s eta 0:00:00
ββββββββββββββββββββββββββββββββββββββββ 55.8/55.8 kB 7.3 MB/s eta 0:00:00
Building wheel for hdbscan (pyproject.toml) ... done
Building wheel for sentence-transformers (setup.py) ... done
Building wheel for umap-learn (setup.py) ... done
In the following cells we download a stopword dictionary for the German language and applied it according to the documentation
In [10]:
import nltk
from nltk.corpus import stopwords
'stopwords')
nltk.download(
= stopwords.words('german') STOPWORDS
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
In [11]:
from sklearn.feature_extraction.text import CountVectorizer
= CountVectorizer(ngram_range=(1, 2), stop_words=STOPWORDS) vectorizer_model
Now weβre ready to create our corpus in docs
, a list of text documents to pass to BERTopic
.
In [6]:
# We create our corpus
= new_df['Text'] docs
In [30]:
from bertopic import BERTopic
# We're dealing with German texts, therefore we choose 'multilingual'. When dealing with English texts exclusively, choose 'english'
= BERTopic(language="multilingual", calculate_probabilities=True, verbose=True, vectorizer_model=vectorizer_model)
topic_model = topic_model.fit_transform(docs) topics, probs
2023-12-01 08:52:44,038 - BERTopic - Embedding - Transforming documents to embeddings.
2023-12-01 08:52:50,561 - BERTopic - Embedding - Completed β
2023-12-01 08:52:50,563 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-01 08:53:04,597 - BERTopic - Dimensionality - Completed β
2023-12-01 08:53:04,599 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-01 08:53:05,103 - BERTopic - Cluster - Completed β
2023-12-01 08:53:05,109 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-01 08:53:05,639 - BERTopic - Representation - Completed β
The following cells have been copied from the BERTopic Tutorial. Please check the linked notebook for more functions and the documentation for more background information.
Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.
In [31]:
= topic_model.get_topic_info(); freq.head(5) freq
Topic | Count | Name | Representation | Representative_Docs | |
---|---|---|---|---|---|
0 | -1 | 860 | -1_bayern_csu_uhr_mehr | [bayern, csu, uhr, mehr, menschen, mΓΌnchen, te... | [Wir gehen mit #herzstatthetze in den Wahlkamp... |
1 | 0 | 137 | 0_wΓ€hlen_fdp_hessen_heute | [wΓ€hlen, fdp, hessen, heute, stimme, stimmen, ... | [Unser MinisterprΓ€sident @markus.soeder steigt... |
2 | 1 | 104 | 1_energie_co2_klimaschutz_habeck | [energie, co2, klimaschutz, habeck, wasserstof... | [Habeck tΓ€uscht Γffentlichkeit mit Zensur: RΓΌc... |
3 | 2 | 103 | 2_zuwanderung_migration_grenzpolizei_migration... | [zuwanderung, migration, grenzpolizei, migrati... | [Wir sagen Ja zu #Hilfe und #Arbeitsmigration,... |
4 | 3 | 89 | 3_uhr_starke mitte_bayerns starke_bayerns | [uhr, starke mitte, bayerns starke, bayerns, b... | ["Deutschland-Pakt" aus Scholz der Krise komme... |
-1 refers to all outliers and should typically be ignored. Next, letβs take a look at a frequent topic that were generated:
In [32]:
len(freq)
52
We have a total of 52 topics
In [33]:
0) # Select the most frequent topic topic_model.get_topic(
[('wΓ€hlen', 0.01628736425293884),
('fdp', 0.01626632927971954),
('hessen', 0.013634118460503969),
('heute', 0.013441948777152065),
('stimme', 0.011907460231710654),
('stimmen', 0.011505832701270827),
('landtagswahl', 0.011272934711858047),
('wahlkampf', 0.01059385752962746),
('sonntag', 0.01057520846171656),
('bayern', 0.010322807358750668)]
Visualize Topics
After having trained our BERTopic
model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:
In [12]:
topic_model.visualize_topics()
Visualize Terms
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.
In [15]:
=15) topic_model.visualize_barchart(top_n_topics
Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, is that you can decide the number of topics after knowing how many are actually created. It is difficult to predict before training your model how many topics that are in your documents and how many will be extracted. Instead, we can decide afterwards how many topics seems realistic:
In [36]:
=15) topic_model.reduce_topics(docs, nr_topics
2023-12-01 08:53:07,148 - BERTopic - Topic reduction - Reducing number of topics
2023-12-01 08:53:07,642 - BERTopic - Topic reduction - Reduced number of topics from 52 to 15
<bertopic._bertopic.BERTopic at 0x794041658ca0>
Visualize Terms After Reduction
In [19]:
=15) topic_model.visualize_barchart(top_n_topics
Saving the model
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.
In [38]:
# Save model
"/content/drive/MyDrive/2023-12-01-LTW23-CrowdTangle-Posts-model") topic_model.save(
2023-12-01 08:53:54,135 - BERTopic - WARNING: When you use `pickle` to save/load a BERTopic model,please make sure that the environments in which you saveand load the model are **exactly** the same. The version of BERTopic,its dependencies, and python need to remain the same.