Text as Data


Michael Achmann-Denkler


November 27, 2023

The analysis of texutal data has a long tradition under the term Natural Language Processing (NLP). As noted by Bengfort, Bilbro, and Ojeda (2018), “Language is unstructured data that has been produced by people to be understood by other people”. This characterization of language as unstructured data highlights its contrast with structured or semi-structured data. Unlike structured data, which is organized in a way that computers can easily parse and analyze, unstructured data like language requires more complex methods to be processed and understood. In the context of e.g. Instagram, CrowdTangle exports contain structured data columns such as ‘User Name’, ‘Like Count’, or ‘Comment Count’. These pieces of data are quantifiable and can be easily sorted, filtered, or counted, e.g. using tools like Excel or Python’s pandas library. For instance, we can quickly determine the most active users by counting the number of rows associated with each username. In contrast, unstructured data is not organized in a predefined manner and is typically more challenging to process and analyze. The ‘Description’ column in our dataset, which contains the captions of Instagram posts, is a prime example of unstructured data. These captions, composed of paragraphs or sentences, require different analytical approaches to extract meaningful insights. Unlike structured data, we cannot simply count or sort these texts in a straightforward manner. In our context, we often refer to the collection of texts we analyze as a “Corpus”. Each individual piece of text is called a “Document”. Each document can be broken down into smaller units known as “features”. Features can be words, phrases, or even patterns of words, which we then use to quantify and analyze the text (compare p. 230 Haim 2023). For the goal of our research seminar, we can follow the three technical perspectives inspired by Haim (2023): 1. Frequency Analysis, 2. Contextual Analysis, and 3. Content Analysis.


  1. In our first session, we begin with frequency analyses of our corpus, which involves counting words or phrases to identify the most common elements. This method provides a foundational understanding of the prominent themes or topics. Additionally, we learn to convert embedded text in images and videos into machine-readable format, using OCR, and automated audio transcription.
  2. Next, we will engage in explorative text analysis. This step enhances our understanding of the corpus and lays the groundwork for quantitative content analysis. We plan to utilize tools like GPT (and possibly BERTopic for an in-depth exploration of our documents.
  3. Finally, we move towards more complex methods like classification or coding. These techniques allow us to categorize text into predefined groups or themes, enabling a more nuanced and quantitative understanding of the content. By applying these methods, we can, for example, classify Instagram captions into categories such as ‘promotional’, ‘personal’, ‘informative’, etc., based on their content and context.


We are working with Python and pandas, our data is structured in tables, also known as DataFrames. Each DataFrame (df) consists of rows and columns. We can store and structure data differently using these two dimensions, one concept for storing research data using tables is Tidy Data (Wickham 2014). According to this standard

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Visualization of the tidy data components, source: R for Data Science.

What, in context of social media data, is an obersavtion? Is it a post? I suggest to start by seeing posts as observations, i.e. rows. Thus, we have one table for our corpus, consisting of one row per post with multiple columns for different variables, including an ID, possibly a link, a referrence to the image / video, and one or more text variables for each post. When dealing with Instagram or TikTok posts, we might have three text columns: caption / description, OCR, and transcription. When dealing with stories two: OCR and transcription.


When dealing with more complex data, e.g. Instagram albums that may contain multiple images per post, we will have to reconsider this choice. In this case we might consider each observation to be one image / video, which has variables like OCR and transcription. Keeping the ID column for images and videos, we have a fixed reference to the original post, thus we may re-merge the data later on with the post metadata or combine variables across media for one post.

All data exported from CrowdTangle, 4CAT, and Zeeschuimer-F are saved as CSV files. Throughout the semester, we keep using this file format to save our progress. We work with multiple Jupyter notebooks, generally one notebook per task. This helps to keep a good structure of our projects. Each time we modified the df, we save the CSV file to our Google Drive / Harddrive. In the two examples below we add an OCR and a Transcription column to our DataFrame, for each task we use one notebook. After completing each task, we store the results in a file. While Google Drive provides file versioning to mitigate data loss in certain scenarios, I recommend to save your results to a new file during the experimental phase. This practice ensures data safety until you have fully verified the functionality of your code. Additionally, I recommend naming your files in a YYYY-MM-DD-descriptive-name.csv fashion. When working with colab notebooks I recommend to keep track of notebooks using notes / lists, e.g. using the Dataloom plugin for Obsidian.

Keeping track of Colab notebooks with Obsidian and the Dataloom plugin.

The CSV files contain only metadata, the actual media files (images / videos) are saved to different locations. The OCR and Transcription notebooks below contain code to import media files from 4CAT and Zeeschuimer-F. I suggest to save the files to media/videos or media/images. Both notebooks introduce a column image_file or video_file where the relative location of the media files is written to. Creating a new ZIP file using the new folder structure and saving the file to Google Drive allows us to use the media files in future notebooks (e.g. for image classification) without modifying the image_file or video_file columns again.


This page and all referenced notebooks deal with 4CAT and Zeeschuimer-F metadata and media files. Generally all information applies to instaloader as well. Its advisable to use the --filename-pattern command line parameter to control the filename of the media files. Mapping JSON metadata to actual media objects becomes easier this way. Once all posts / stories have been loaded using instaloader, I recommend to read all JSON files in a loop and create a DataFrame (see Data Collection / Posts / Instaloader for more information and code examples).

Key Take-Aways

  • We organize our data inspired by TidyData
    • One row per post
    • One column per variable
  • We use one notebook per task
  • We save our progress to CSV files, either on our harddrive or Google Drive
  • We keep a reference to media files as a relative reference in our DataFrame
  • We keep our media files in the structure media/videos, and media/images, which we compress to ZIP and keep on our Google Drive (or central HDD location)
  • When working with experimental code, keep backups of your data file, do not overwrite the original file!

From Images / Videos to Text

Computational approaches for text analyses are established as part of computational sociales science research (Baden et al. 2022), which we may utilize when dealing with visual and multimodal social media. Instagram posts often contain embedded text, TikTok posts often contain an audio layer, both of which we can transform to computer readable text. For the first, we are going to use OCR, for the second we apply Whisper. The following subchapters demonstrate the application of these technique in order to extract textual content from images and videos. In the thirs subchapter, I demonstrate a simple application of corpus analytics for a first analysis of the social media content based on word frequencies.


We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

!pip -q install easyocr
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 29.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 908.3/908.3 kB 57.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.6 MB/s eta 0:00:00
# Imports for OCR
import easyocr
reader = easyocr.Reader(['de','en'])
Progress: |██████████████████████████████████████████████████| 100.0% CompleteProgress: |██████████████████████████████████████████████████| 100.0% Complete

We define a very simple method to receive one string for all text recognized: The readtextmethod returns a list of text areas, in this example we concatenate the string, therefore the order of words is sometimes not correct.

Also, we save the file to Google Drive to save our results.

def run_ocr(image_path):
    ocr_result = reader.readtext(image_path, detail = 0)
    ocr_text = " ".join(ocr_result)
    return ocr_text

df['ocr_text'] = df['image_file'].apply(run_ocr)

# Saving Results to Drive
Unnamed: 0.1 Unnamed: 0 ID Time of Posting Type of Content video_url image_url Username Video Length (s) Expiration ... Is Verified Stickers Accessibility Caption Attribution URL video_file audio_file duration sampling_rate image_file ocr_text
0 0 0 3234500408402516260_1383567706 2023-11-12 15:21:53 Image NaN NaN news24 NaN 2023-11-13 15:21:53 ... True [] Photo by News24 on November 12, 2023. May be a... https://www.threads.net/t/CzjB80Zqme0 NaN NaN NaN NaN media/images/3234500408402516260_1383567706.jpg Keee WEEKEND NEWS24 PLUS: TESTING FORDS RANGER...
1 1 1 3234502795095897337_8537434 2023-11-12 15:26:39 Image NaN NaN bild NaN 2023-11-13 15:26:39 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234502795095897337_8537434.jpg Dieses Auto ist einfach der Horror Du glaubst ...
2 2 2 3234503046678453705_8537434 2023-11-12 15:27:10 Image NaN NaN bild NaN 2023-11-13 15:27:10 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234503046678453705_8537434.jpg Touchdown bei Taylor Swift und Travis Kelce De...
3 3 3 3234503930728728807_8537434 2023-11-12 15:28:55 Image NaN NaN bild NaN 2023-11-13 15:28:55 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234503930728728807_8537434.jpg Horror-Diagnose für Barton Cowperthwaite Netfl...
4 4 4 3234504185910204562_8537434 2023-11-12 15:29:25 Image NaN NaN bild NaN 2023-11-13 15:29:25 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234504185910204562_8537434.jpg 3v Bilde GG JJ Besorgniserregende Ufo-Aktivitä...

5 rows × 21 columns

Source: OCR using easyocr

Automated Audio Transcription (Whisper)


OpenAI offers Whisper transcriptions as a service, see their documentation. The notebook below takes you step-by-step through using the Whisper model on your own computer / colab.

Extract Audio from Video File

After loading the metadta and media files from the Google Drive, we extract the audio from each video file to prepare the automated transcription.

!pip install -q moviepy
import os

# Set audio directory path
audio_path = "media/audio/"

# Check if the directory exists
if not os.path.exists(audio_path):
    # Create the directory if it does not exist
from moviepy.editor import *

for index, row in df.iterrows():
    if row['video_file'] != "":
        # Load the video file
        video = VideoFileClip(row['video_file'])
        filename = row['video_file'].split('/')[-1]

        # Extract the audio from the video file
        audio = video.audio

        if audio is not None:
            sampling_rate = audio.fps
            current_suffix = filename.split(".")[-1]
            new_filename = filename.replace(current_suffix, "mp3")

            # Save the audio to a file
            audio.write_audiofile("{}{}".format(audio_path, new_filename))
            new_filename = "No Audio"
            sampling_rate = -1

        # Update DataFrame inplace
        df.at[index, 'audio_file'] = new_filename
        df.at[index, 'duration'] = video.duration
        df.at[index, 'sampling_rate'] = sampling_rate

        df.at[index, 'video_file'] = row['video_file'].split('/')[-1]

        # Close the video file
MoviePy - Writing audio in media/audio/CzD93SEIi-E.mp3
MoviePy - Done.

We’ve extracted the audio content of each video file to a mp3 file in the media/audio folder. The files keep the name of the video file. We added new columns to the metadata for audio duration and sampling_rate. In case the video did not include an audio file, smapling_rateis set to -1, which we use to filter the df when transcribing the files.

df[df['video_file'] != ""].head()
id thread_id parent_id body author author_fullname author_avatar_url timestamp type url ... num_comments num_media location_name location_latlong location_city unix_timestamp video_file audio_file duration sampling_rate
4 CzD93SEIi-E CzD93SEIi-E CzD93SEIi-E Mitzuarbeiten für unser Land, Bayern zu entwic... markus.soeder Markus Söder https://scontent-fra3-1.cdninstagram.com/v/t51... 2023-10-31 12:06:23 video https://www.instagram.com/p/CzD93SEIi-E ... 227 1 NaN NaN NaN 1698753983 CzD93SEIi-E.mp4 CzD93SEIi-E.mp3 67.89 44100.0

1 rows × 24 columns

Let’s update the ZIPed folder to include the audio files.

!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media
updating: media/ (stored 0%)
updating: media/videos/ (stored 0%)
updating: media/videos/CzD93SEIi-E.mp4 (deflated 0%)
  adding: media/audio/ (stored 0%)
  adding: media/audio/CzD93SEIi-E.mp3 (deflated 1%)

And save the updated metadata file. Change filename when importing stories here!


Transcriptions using Whisper

The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

The abstract from the paper is the following:

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

– https://huggingface.co/docs/transformers/model_doc/whisper

!pip install -q transformers

The next code snippet initializes the Whisper model. The transcribe_aduio method is applied to each row of the dataframe where sampling_rate > 0, thus only to those lines with referencees to audio files. Each audio file is transcribed using Whisper, the result, one text string, is saved to the transcript column.

Adjust the language variable according to your needs! The model is also capable of automated translation, e.g. setting language to english when processing German content results in an English translation of the speech. (Additionally, the task variable accepts translate).

import torch
from transformers import pipeline, WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Set device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Initialize the Whisper model pipeline for automatic speech recognition
pipe = pipeline(

# Load model and processor for multilingual support
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")

# Function to read, transcribe, and handle longer audio files in different languages
def transcribe_audio(filename, language='german'):
        # Load and resample audio file
        audio_path = f"{audio_folder}/{filename}"
        waveform, original_sample_rate = librosa.load(audio_path, sr=None, mono=True)
        waveform_resampled = librosa.resample(waveform, orig_sr=original_sample_rate, target_sr=16000)

        # Get forced decoder IDs for the specified language
        forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")

        # Process the audio file in chunks and transcribe
        transcription = ""
        for i in range(0, len(waveform_resampled), 16000 * 30):  # 30 seconds chunks
            chunk = waveform_resampled[i:i + 16000 * 30]
            input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features
            predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
            chunk_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
            transcription += " " + chunk_transcription

        return transcription.strip()
    except Exception as e:
        print(f"Error processing file {filename}: {e}")
        return ""

# Filter the DataFrame (sampling_rates < 0 identify items without audio)
filtered_index = df['sampling_rate'] > 0

# Apply the transcription function to each row in the filtered DataFrame
df.loc[filtered_index, 'transcript'] = df.loc[filtered_index, 'audio_file'].apply(transcribe_audio)
df[df['video_file'] != ""].head()
id thread_id parent_id body author author_fullname author_avatar_url timestamp type url ... num_media location_name location_latlong location_city unix_timestamp video_file audio_file duration sampling_rate transcript
4 CzD93SEIi-E CzD93SEIi-E CzD93SEIi-E Mitzuarbeiten für unser Land, Bayern zu entwic... markus.soeder Markus Söder https://scontent-fra3-1.cdninstagram.com/v/t51... 2023-10-31 12:06:23 video https://www.instagram.com/p/CzD93SEIi-E ... 1 NaN NaN NaN 1698753983 CzD93SEIi-E.mp4 CzD93SEIi-E.mp3 67.89 44100.0 Ich bitte auf den abgelagerten Vortrag der Maa...

1 rows × 25 columns

df.loc[4, 'transcript']
'Ich bitte auf den abgelagerten Vortrag der Maaßen-Söder-Entfühlen ein.  Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Ich schwöre Treue der Verfassung des Freistaates Bayern, Gehorsam den Gesetzen und gewissenhafte Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Herr Ministerpräsident, ich darf Ihnen im Namen des ganzen Hauses ganz persönlich die herzlichsten Glückwünsche aussprechen und wünsche Ihnen viel Erfolg und gute Nerven auch bei Ihrer Aufgabe. Herzlichen Dank.  Applaus'

Overall, the transcriptions work well. The first sentence above, however, shows that we still can expect misinterpretations.

Source: Transcription using Whisper

Analyzing Corpus and Word Frequencies

Among a variety of possibilities, we can, for example, look at the frequencies of the words contained in the corpus or examine the corpus for recurring themes it contains.

First we need to import all the required libraries once again. The Natural Language Toolkit (NLTK) gives us access to a variety of natural language processing functions (e.g. tokenisation, stop word removal, part-of-speech tagging, …).

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import requests
import pandas as pd

When analysing word frequencies, we can use stop word lists to ignore words that occur frequently but are not relevant to us. We can easily download such a list. However, this can also be individually adapted to the purpose.

# Retrieve Stopwords from Github
sw_json = requests.get('https://github.com/stopwords-iso/stopwords-de/raw/master/stopwords-de.json')

Now we can tokenise the existing text, remove the stop words or punctuation marks they contain, convert the words to lower case, or use bi-grams in addition to single-word tokens.

We then sum up the occurrences of the individual words and make the results available in a DataFrame.

def word_freq(text, punctuation=False, stop_words = False, lowercasing = False, bigrams = False):

    if punctuation:
        # Tokenizing, removing punctuation
        tokens = RegexpTokenizer(r'\w+').tokenize(text) # https://regexr.com/
        # Tokenizing, w/o removing punctuation
        # tokens = text.split()
        tokens = word_tokenize(text)

    if stop_words:
        # Removing Stopwords
        tokens = [w for w in tokens if not w.lower() in stop_words]

    if lowercasing:
        # Lower-Casing
        tokens = [w.lower() for w in tokens]

    if bigrams:
        # Converting text tokens into bigrams
        tokens = nltk.bigrams(tokens)

    # Creating Data Frame
    freq = nltk.FreqDist(tokens) # display(freq)
    df = pd.DataFrame.from_dict(freq, orient='index')
    df.columns = ['Frequency']
    df.index.name = 'Term'

    # Here we calculate the total number of tokens in our Frequency List
    total_tokens = sum(freq.values()) # sum([2,3,4,5,6])

    # Here we add a new column `Relative` (*100 for percentage)
    df['Relative'] = (df['Frequency'] / total_tokens) * 100

    return df
from pathlib import Path
import os

#@markdown Do you want bigrams included?
bigrams = True #@param {type:"boolean"}

#@markdown Should all words get lower cased before counting the occurances?
lowercasing = True #@param {type:"boolean"}

#@markdown Do you want to exclude stopwords in your result list?
stopwords = True #@param {type:"boolean"}

#@markdown Do you want to remove punctuation before counting the occurances?
punctuation = True #@param {type:"boolean"}
# Load stopwords file if necessary
if stopwords:
    stopwords = sw_json.json()

# Read source file and concat all texts
text = ' '.join(list(df[text_column]))

# Call word_freq() with specified parameters
df_freq = word_freq(text, punctuation = punctuation, stop_words = stopwords, lowercasing = lowercasing, bigrams = bigrams)

# Sort results for descending values
df_freq = df_freq.sort_values("Relative", ascending = False)

Frequency Relative
(jüdisches, leben) 5 1.259446
(allerheiligen, allerseelen) 4 1.007557
(ilse, aigner) 3 0.755668
(bayerischer, landtag) 3 0.755668
(klare, haltung) 2 0.503778
(wünschen, einfach) 2 0.503778
(vaters, freundschaftliche) 2 0.503778
(tod, vaters) 2 0.503778
(günter, tod) 2 0.503778
(schwiegervater, günter) 2 0.503778


One way to visualise word frequencies and recurring themes of texts are word clouds. These basically show the most frequently occurring words in the text (similar to the table created earlier), but more frequently occurring words are depicted larger than less frequently occurring words.

First, we have to install the necessary library wordcloud.

!pip install -q wordcloud

The actual implementation of this approach is relatively simple. We need to combine all the texts into a single text, as we did in the previous step with the frequency analysis, and pass it to the imported library.

from wordcloud import WordCloud, STOPWORDS

def generate_wordcloud(text, path):

    text = ' '.join(list(text))

    # Generate a word cloud image
    wordcloud = WordCloud(background_color="white",width=1920, height=1080).generate(text)

    # Dazugehörige Grafik erstellen
    plt.imshow(wordcloud, interpolation="bilinear") # Auflösung/Interpolation der Grafik
    plt.figtext(0.5, 0.1, wordcloud_subcaption, wrap=True, horizontalalignment='center', fontsize=12)
    plt.savefig(path, dpi=300)

Once again, we have the option of adjusting various parameters. Remember to specify the right file path, file name and column of your text data!

#@markdown Input for additional stopwords; whitespace separated
stopwords_extension_wc = '' #@param {type: "string"}

#@markdown Subcaption for the wordcloud, leave blank to ignore
wordcloud_subcaption = 'Markus S\xF6der' #@param {type: "string"}

Now all we have to do is load the stop word file, add our own additions and then trigger the creation of the word cloud using the function we created at the beginning.

The result image is saved in the defined data_path.

import matplotlib.pyplot as plt
import requests

# Retrieve Stopwords from Github
r = requests.get('https://github.com/stopwords-iso/stopwords-de/raw/master/stopwords-de.json')
stop_words = r.json()

# Convert input into list
stopwords_extension_wc_list = stopwords_extension_wc.split(' ')

# Stopwörter in die WordCloud laden

generate_wordcloud(df[text_column], 'wordcloud.png')

Source: Introduction to Corpus Analysis


In summary, this session provides us with the practical skills to use Python, pandas, and Jupyter notebooks for the computational analysis of multimodal social media data. Our adherence to Tidy Data principles and the integration of technologies like OCR and Whisper are integral to extract and analyze textual content from multimedia sources. In the next session we will keep exploring the content through a textual lens. Further, we will use prompting as a technique to classify texts as part of a computational content analysis.

More Resources

Python & Computational Social Sciences

Python & NLP


Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A C G van der Velden. 2022. Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 16 (1): 1–18. https://doi.org/10.1080/19312458.2021.2015574.
Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. 2018. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning. “O’Reilly Media, Inc.”
Haim, Mario. 2023. Computational Communication Science: Eine Einführung. Springer Fachmedien Wiesbaden.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.