Text Exploration


Michael Achmann-Denkler


December 4, 2023

In the previous session we talked about text as data, suggesting that text data offers a rich source of insights. This chapter concentrates on the advanced tools for exploring these textual dimensions: BERTopic and the Generative Pre-trained Transformer (GPT). These technologies stand at the forefront of computational text analysis and are intersting tools to unlock the meanings and patterns hidden within the vast textual content of social media.

Topic Modeling with BERTopic

BERTopic (Grootendorst 2022) is a transformer-based topic modeling tool. It uses the BERT (Bidirectional Encoder Representations from Transformers) framework, an advanced method for natural language processing (NLP) that understands the context of words in text. BERTopic is adept at identifying and clustering topics within short text documents Egger and Yu (2022), making it an interesting tool to analyze and categorize text data from social media. The author is actively working on the documentation and keeps improving the topic modeling technique to adapt the latest advances of Large Language Models (LLMs), just recently a Zero-Shot topic modeling approach has been added. I have used BERTopic for a first exploration of stories and posts published by politicians and parties during the 2021 Federal Election in Germany (Achmann and Wolff 2023). Past research has used LDA, another topic modeling algorithm, to explore themes and topics in Instagram posts by politicians (Rodina and Dligach 2019).

For this example we import a CrowdTangle dataframe, which has been preprocessing using the OCR Notebook. We are only dealing with one image per post, there are no videos (= no transcriptions). In this example, we have up to two text columns per Post, Description which contains the caption, and ocr_text. When exploring the textual content of the posts, we see each of those columns as one document. Thus, we transform our table and create new_df as a Text Table that contains a reference to the post (shortcode), the actual Text, and a Text Type column.

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/2023-11-30-Export-Posts-Crowd-Tangle.csv')

Next, we want to transform the DataFrame from one post per row, to one text document per row (Think tidydata!)

Unnamed: 0 Account User Name Followers at Posting Post Created Post Created Date Post Created Time Type Total Interactions Likes ... Photo Title Description Image Text Sponsor Id Sponsor Name Overperforming Score (weighted — Likes 1x Comments 1x ) shortcode image_file ocr_text
0 0 FREIE WÄHLER Bayern fw_bayern 9138 2023-10-09 20:10:19 CEST 2023-10-09 20:10:19 Photo 566 561 ... https://scontent-sea1-1.cdninstagram.com/v/t51... NaN #Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da... FREIE WAHLER 15,8 % NaN NaN 2.95 CyMAe_tufcR media/images/fw_bayern/CyMAe_tufcR.jpg FREIE WAHLER 15,8 %
1 1 Junge Liberale JuLis Bayern julisbayern 4902 2023-10-09 19:48:02 CEST 2023-10-09 19:48:02 Album 320 310 ... https://scontent-sea1-1.cdninstagram.com/v/t51... NaN Die Landtagswahl war für uns als Liberale hart... NaN NaN NaN 1.41 CyL975vouHU media/images/julisbayern/CyL975vouHU.jpg Freie EDP Demokraten BDB FDP FB FDP DANKE FÜR ...
2 2 Junge Union Deutschlands junge_union 44414 2023-10-09 19:31:59 CEST 2023-10-09 19:31:59 Photo 929 925 ... https://scontent-sea1-1.cdninstagram.com/v/t39... NaN Nach einem starken Wahlkampf ein verdientes Er... HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris... NaN NaN 1.17 CyL8GWWJmci media/images/junge_union/CyL8GWWJmci.jpg HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris...
3 3 Katharina Schulze kathaschulze 37161 2023-10-09 19:29:02 CEST 2023-10-09 19:29:02 Photo 1,074 1009 ... https://scontent-sea1-1.cdninstagram.com/v/t51... NaN So viele Menschen am Odeonsplatz heute mit ein... NaN NaN NaN 1.61 CyL7wyJtTV5 media/images/kathaschulze/CyL7wyJtTV5.jpg Juo I W
4 4 Junge Union Deutschlands junge_union 44414 2023-10-09 18:01:34 CEST 2023-10-09 18:01:34 Album 1,655 1644 ... https://scontent-sea1-1.cdninstagram.com/v/t39... NaN Herzlichen Glückwunsch zu diesem grandiosen Wa... NaN NaN NaN 2.34 CyLxwHuvR4Y media/images/junge_union/CyLxwHuvR4Y.jpg 12/12 der hessischen JU-Kandidaten ziehen in d...

5 rows × 25 columns

We restructure df to focus on two key text-based columns: ‘Description’ and ‘ocr_text’. The goal is to create a streamlined DataFrame where each row corresponds to an individual text entry, either from the ‘Description’ or the ‘ocr_text’ fields. To achieve this, we first split the original DataFrame into two separate DataFrames, one for each of these columns. We then rename these columns to ‘Text’ for uniformity. Additionally, we introduce a new column, ‘Text Type’, to categorize each text entry as either ‘Caption’ (originating from ‘Description’) or ‘OCR’ (originating from ‘ocr_text’). The ‘shortcode’ column is retained as a unique identifier for each entry. Finally, we concatenate these two DataFrames into a single DataFrame, ensuring a clean and organized structure. This restructured DataFrame facilitates easier analysis and processing of the text data, segregating it by source while maintaining a link to its original post via the ‘shortcode’. The code also includes a step to remove any rows with empty or NaN values in the ‘Text’ column, ensuring data integrity and cleanliness.

import pandas as pd

# Creating two separate dataframes
df_description = df[['shortcode', 'Description']].copy()
df_ocr_text = df[['shortcode', 'ocr_text']].copy()

# Renaming columns
df_description.rename(columns={'Description': 'Text'}, inplace=True)
df_ocr_text.rename(columns={'ocr_text': 'Text'}, inplace=True)

# Adding 'Text Type' column
df_description['Text Type'] = 'Caption'
df_ocr_text['Text Type'] = 'OCR'

# Concatenating the dataframes
new_df = pd.concat([df_description, df_ocr_text])

# Dropping any rows where 'Text' is NaN or empty
new_df.dropna(subset=['Text'], inplace=True)
new_df = new_df[new_df['Text'].str.strip() != '']

# Resetting the index
new_df.reset_index(drop=True, inplace=True)
shortcode Text Text Type
0 CyMAe_tufcR #Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da... Caption
1 CyL975vouHU Die Landtagswahl war für uns als Liberale hart... Caption
2 CyL8GWWJmci Nach einem starken Wahlkampf ein verdientes Er... Caption
3 CyL7wyJtTV5 So viele Menschen am Odeonsplatz heute mit ein... Caption
4 CyLxwHuvR4Y Herzlichen Glückwunsch zu diesem grandiosen Wa... Caption


At this stage, the data is reading for Topic Modeling. We are using the BERTopic package and follow the tutorial notebook provided by the author.

In the following cells we download a stopword dictionary for the German language and applied it according to the documentation

import nltk
from nltk.corpus import stopwords


STOPWORDS = stopwords.words('german')
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=STOPWORDS)

Now we’re ready to create our corpus in docs, a list of text documents to pass to BERTopic.

# We create our corpus
docs = new_df['Text']
from bertopic import BERTopic

# We're dealing with German texts, therefore we choose 'multilingual'. When dealing with English texts exclusively, choose 'english'
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

The following cells have been copied from the BERTopic Tutorial. Please check the linked notebook for more functions and the documentation for more background information.

Extracting Topics

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

freq = topic_model.get_topic_info(); freq.head(5)
Topic Count Name Representation Representative_Docs
0 -1 860 -1_bayern_csu_uhr_mehr [bayern, csu, uhr, mehr, menschen, münchen, te... [Wir gehen mit #herzstatthetze in den Wahlkamp...
1 0 137 0_wählen_fdp_hessen_heute [wählen, fdp, hessen, heute, stimme, stimmen, ... [Unser Ministerpräsident @markus.soeder steigt...
2 1 104 1_energie_co2_klimaschutz_habeck [energie, co2, klimaschutz, habeck, wasserstof... [Habeck täuscht Öffentlichkeit mit Zensur: Rüc...
3 2 103 2_zuwanderung_migration_grenzpolizei_migration... [zuwanderung, migration, grenzpolizei, migrati... [Wir sagen Ja zu #Hilfe und #Arbeitsmigration,...
4 3 89 3_uhr_starke mitte_bayerns starke_bayerns [uhr, starke mitte, bayerns starke, bayerns, b... ["Deutschland-Pakt" aus Scholz der Krise komme...

-1 refers to all outliers and should typically be ignored. Next, let’s take a look at a frequent topic that were generated:


We have a total of 52 topics

topic_model.get_topic(0)  # Select the most frequent topic
[('wählen', 0.01628736425293884),
 ('fdp', 0.01626632927971954),
 ('hessen', 0.013634118460503969),
 ('heute', 0.013441948777152065),
 ('stimme', 0.011907460231710654),
 ('stimmen', 0.011505832701270827),
 ('landtagswahl', 0.011272934711858047),
 ('wahlkampf', 0.01059385752962746),
 ('sonntag', 0.01057520846171656),
 ('bayern', 0.010322807358750668)]

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:


Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.


Topic Reduction

We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, is that you can decide the number of topics after knowing how many are actually created. It is difficult to predict before training your model how many topics that are in your documents and how many will be extracted. Instead, we can decide afterwards how many topics seems realistic:

topic_model.reduce_topics(docs, nr_topics=15)
<bertopic._bertopic.BERTopic at 0x794041658ca0>

Visualize Terms After Reduction


Saving the model

The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

# Save model
Source: Topic Modeling Using BERTopic

Exploration Through Prompting

Data Import

Using GPT for Information Extraction

The focus of this chapter lies in demonstrating how GPT can be employed in a loop to analyze text documents. This methodology aligns with the principles of topic modeling but extends further by leveraging the advanced capabilities of the language model. Our approach involves the iterative processing of text, where GPT aids in identifying, categorizing, and interpreting the underlying themes and sentiments expressed in social media texts.

The GPT application presents a significant difference compared to traditional topic modeling. While topic modeling often aims to automatically uncover hidden thematic structures within a text corpus, our approach with GPT is based on a different assumption: We presuppose that there is already a specific theme or a particular question in mind according to which we want to organize and analyze the documents. This approach allows us to navigate through the vast amounts of text in social media in a targeted and efficient manner, identifying specific insights and patterns that are directly related to our predefined areas of interest.

The following workflow outlines how we could use this information extraction process to create a topic list. Using the list we can classify each document.

An example for a GPT based “Topic Modeling” approach. I have used this approach in a current research project, the process is not perfect yet.
!pip install -q openai backoff gpt-cost-estimator

Setup for the OpenAI API

We’re using the new Colab Feature to store keys safely within the Colab Environment. Click on the key on the left to add your API key and enable it for this notebook. Enter the name fpr your API-Key in the api_key_name variable below.

import openai
from openai import OpenAI
from google.colab import userdata
import backoff
from gpt_cost_estimator import CostEstimator

api_key_name = "openai-lehrstuhl-api"
api_key = userdata.get(api_key_name)

# Initialize OpenAI using the key
client = OpenAI(

def query_openai(model, temperature, messages, mock=True, completion_tokens=10):
    return client.chat.completions.create(

# We define the run_request method to wrap it with the @backoff decorator
@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(system_prompt, user_prompt, mock):
  messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}

  return query_openai(

Next, we create a system prompt describing what we want to extract. For further examples of prompts and advice on prompt engineering see e.g. the prompting guide and further resources linked at the bottom of the page.

For the initial example we use social media content shared by politicans and parties. We know, that some of these texts mention policy issues, let’s try to extract these issues across all documents.

Note: The extracted issues are not going to be consistent, because each document is sent as a singular request to the API, thus the previous issues are not going to be used as context.

Modify the following system prompt to extract other types of information. What else could you extract?

  • Locations (based on names)
  • Names (of persons or places)
  • Mentions of Companies

Do not forget the Prompt Archive when experimenting. Share your successfull prompt with us!

system_prompt = """
You are a helpful assistant, an expert for German politics.
**Objective:** Extract policy issues from German language social media texts. Policy issues refer to specific topics or subjects that are the focus of public or governmental debate, analysis, and decision-making. Elections themselves and party slogans or their performance are no policy issues.
**Instructions:** Return each policy issues referenced in user message as a comma-seperated list. Return 'None' if no policy issues are referenced.
**Formatting:** Return a comma-seperated list.

Running the request.

The following code snippet uses my gpt-cost-estimator package to simulate API requests and calculate a cost estimate. Please run the estimation whne possible to asses the price-tag before sending requests to OpenAI! Make sure ‘run_request’ and ‘system_prompt’ are defined before this block by running the two blocks above!

Set the following variables:

  • MOCK: Do you want to mock the OpenAI request (dry run) to calculate the estimated price?
  • RESET_COST: Do you want to reset the cost estimation when running the query?
  • COLUMN: What’s the column name to save the results of the data extraction task to?
  • SAMPLE_SIZE: Do you want to run the request on a smaller sample of the whole data? (Useful for testing). Enter 0 to run on the whole dataset.
from tqdm.auto import tqdm

MOCK = True
COLUMN = 'Policy Issues'

# Initializing the empty column
if COLUMN not in new_df.columns:
  new_df[COLUMN] = None

  # Reset Estimates
  print("Reset Cost Estimation")

filtered_df = new_df.copy()

# Skip previously annotated rows
filtered_df = filtered_df[pd.isna(filtered_df['Policy Issues'])]

  filtered_df = filtered_df.sample(SAMPLE_SIZE)

for index, row in tqdm(filtered_df.iterrows(), total=len(filtered_df)):
        response = run_request(system_prompt, row['Text'], MOCK)

        if not MOCK:
          # Extract the response content
          # Adjust the following line according to the structure of the response
          r = response.choices[0].message.content

          # Convert the string 'r' to a list if it's not 'None', otherwise keep it as None
          if r != 'None':
              r = r.split(', ')
              r = None

          # Update the 'new_df' DataFrame
          new_df.at[index, 'Policy Issues'] = r

    except Exception as e:
        print(f"An error occurred: {e}")
        # Optionally, handle the error (e.g., by logging or by setting a default value)

Reset Cost Estimation
# Save Results

Next we create a set of Policy Issues. Sets are similar to lists in that they are used to store multiple items, but each unique item in a set appears only once, regardless of how many times it is added, as sets inherently enforce uniqueness and do not allow duplicates. Unlike lists, sets are unordered, meaning they do not record element position or order of insertion. This property makes sets highly efficient for checking membership and eliminating repeated entries. We create the list policy_issues to generate a word cloud.

unique_policy_issues = set()
policy_issues = []

for issues in new_df['Policy Issues']:
    if issues is not None:

        for issue in issues:

Let’s quickly generate a wordcloud to check for patterns. See the Simple Corpus Analysis Notebook for more information.

!pip install -q wordcloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import requests

# Retrieve Stopwords from Github
r = requests.get('https://github.com/stopwords-iso/stopwords-de/raw/master/stopwords-de.json')
stop_words = r.json()

# Stopwörter in die WordCloud laden

def generate_wordcloud(text):
    text = ' '.join(list(text))

    # Generate a word cloud image
    wordcloud = WordCloud(background_color="white",width=1920, height=1080).generate(text)

    # Dazugehörige Grafik erstellen
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.figtext(0.5, 0.1, "Policy Issues", wrap=True, horizontalalignment='center', fontsize=12)


Now we’re ready to pass the list to GPT to extract a manageable amount of topics. Note the list might be too long to fit into the GPT context window. In this case we have to split the list into several shorter lists and iterate over them.

This time we are not interested in a specific formatting for the response. We want to print the result for human interpretation.

system_prompt = """
You are a helpful assistant, an expert for German politics. Derive 15 topics of policy issues from this list of keywords provided by the user. Concentrate on overarching topics and avoid overlapping topics. Provide a set of 10 keywords per topic.
keyword_string = ", ".join(unique_policy_issues)
response = run_request(system_prompt, row['Text'], False)
Cost: $0.0010 | Total: $0.3431
1. Sicherheit
- Polizei
- Kriminalität
- Terrorismus
- Überwachung
- Grenzkontrollen

2. Wirtschaftswachstum
- Industrie
- Arbeitsplätze
- Investitionen
- Innovation
- Export

3. Arbeitslosenquote
- Arbeitsmarkt
- Arbeitslosengeld
- Arbeitsvermittlung
- Qualifikationen
- Arbeitslosenversicherung

4. Regierung
- Politik
- Parteien
- Regierungsbildung
- Koalitionen
- Opposition

5. Bayern-Power
- Regionalpolitik
- Infrastruktur
- Bildung
- Kultur
- Tourismus

6. Ampel-Frust
- Politikverdrossenheit
- Koalitionsstreitigkeiten
- Stillstand
- Kompromisse
- Unzufriedenheit

7. Familiengeld
- Familienpolitik
- Kinderbetreuung
- Elternzeit
- Kindergeld
- Unterstützung

8. Pflegegeld
- Pflegepolitik
- Altenpflege
- Pflegeversicherung
- Pflegeheim
- Angehörigenpflege

9. Meisterausbildung
- Berufsausbildung
- Fachkräftemangel
- Handwerk
- Aufstiegschancen
- Weiterbildung

10. Briefwahl
- Wahlrecht
- Wahlbeteiligung
- Demokratie
- Wahlkampf
- Stimmabgabe
Source: Text Exploration Using GPT


We have explored two transformer based approaches for text exploration. BERTopic is an easy to use tool for topic modeling. Using this approach we can quickly explore patterns in the content of (textual) social media content – as long as there is a GPU available (e.g. on Colab). We will come back to this tool in the future, when dealing with images, as we might be able to harness its abilities for visual media.

The text exploration using GPT, on the other hand, does not rely on special hardware, as we query the API and OpenAI is taking care of the heavy computing. Prompting offers the possibilities to explore our data according to endless questions, yet we need some form of question to get started. We have explored policy issues using the gpt-3.5-turbo model, the results are mixed. Looking through the wordcloud we see issues that might have been at the centre of attention, like Education, Climate Protection, and Security. At this point, however, we should be cautious to generalize, other issues might have been named differently between requests, thus disappearing within the wordcloud. Looking at the BERTopic results, we can spot similar topics, like Security and Migration (Topic 2), Climate Protection (Topic 1), and Education (Topic 12).

Today’s prompting marks the tip of the iceberg, over the course of the next weeks we will use more and more prompts, moving from exploration, to classification. One prompting technique which we will not discuss this semester is Retrieval Augmented Generation (RAG), which might also be useful for Text Exploration. This technique combines information retrieval with text generation. LlamaIndex and LangChain are python package that may help to build RAG applications. How to integrate them into our research workflow will be a future project (or topic for a future BA or MA thesis).

More Resources


Achmann, Michael, and Christian Wolff. 2023. Policy issues vs. Documentation: Using BERTopic to gain insight in the political communication in Instagram stories and posts during the 2021 German Federal election campaign.” Digital Humanities in the Nordic and Baltic Countries Publications 5 (1): 11–28. https://doi.org/10.5617/dhnbpub.10647.
Borra, Creators Erik. n.d. ErikBorra/PromptCompass: Updated models.” https://doi.org/10.5281/zenodo.8359916.
Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models are Few-Shot Learners,” May. http://arxiv.org/abs/2005.14165.
Egger, Roman, and Joanne Yu. 2022. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts.” Frontiers in Sociology 7 (May): 886498. https://doi.org/10.3389/fsoc.2022.886498.
Grootendorst, Maarten. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” March. http://arxiv.org/abs/2203.05794.
Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.” ACM Comput. Surv. 55 (9): 1–35. https://doi.org/10.1145/3560815.
Møller, Anders Giovanni, Jacob Aarup Dalsgaard, Arianna Pera, and Luca Maria Aiello. 2023. Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks,” April. http://arxiv.org/abs/2304.13861.
Rodina, Elena, and Dmitriy Dligach. 2019. Dictator’s Instagram: personal and political narratives in a Chechen leader’s social network.” Caucasus Survey 7 (2): 95–109. https://doi.org/10.1080/23761199.2019.1567145.



BibTeX citation:
  author = {Achmann-Denkler, Michael},
  title = {Text {Exploration}},
  date = {2023-12-04},
  url = {https://social-media-lab.net/processing/exploration.html},
  doi = {10.5281/zenodo.10039756},
  langid = {en}
For attribution, please cite this work as:
Achmann-Denkler, Michael. 2023. “Text Exploration.” December 4, 2023. https://doi.org/10.5281/zenodo.10039756.