Preprocessing is an important step in the computational analysis of social media data, especially when dealing with multimodal content such as images, videos, and audio. This chapter introduces techniques to transform visual and audio content into computer-readable text, allowing us to apply well-established text analysis methods (Baden et al. 2022) to platforms like Instagram and TikTok. Although these platforms are primarily visual, extracting text enables us to leverage advanced computational social science techniques to derive meaningful insights from embedded and spoken content.
The following sections provide a detailed overview of the preprocessing steps we use for text extraction. Instagram posts often contain embedded text, and videos posted on TikTok or Instagram frequently include audio layers, which we can convert into analyzable textual data. We use Optical Character Recognition (OCR) for text extraction from images, and for audio content, we apply the Whisper model for transcription. Additionally, we conclude this chapter with a simple application of corpus analytics to explore word frequencies within the extracted content.
Important
Update 2024:I updated the Notebook to use the OpenAI API. for Whisper and included the new Tidal Tales file format. Additionally, the new notebook extracts the first frame from any video.
We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.
!pip install -q easyocr
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/2.9 MB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 2.9/2.9 MB 142.7 MB/s eta 0:00:01 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 79.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/307.2 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/912.2 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 912.2/912.2 kB 49.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/286.8 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 286.8/286.8 kB 26.6 MB/s eta 0:00:00
Now we extract text from all images using the EasyOCR library. Our goal is to systematically process all images within a folder and extract the text content.
Step-by-Step Explanation:
Setup and Libraries:
We use os to navigate the folder structure, tqdm to add a progress bar, and pandas to store the results if needed.
EasyOCR is initialized to recognize German text with reader = easyocr.Reader(['de']).
Define the Path to the Images:
The variable images_root_path specifies the root folder containing subfolders of images. Each subfolder may represent a different author or category.
Loop Through the Images:
We use os.walk() to iterate through each subfolder (root) and the files within it (files). This way, we can process each image individually.
We check if the file extension is .jpg, .jpeg, or .png to ensure we only process image files.
Extract Text Using EasyOCR:
The reader.readtext(image_path) function reads the image and returns a list of recognized text areas.
We concatenate the recognized text from all detected areas into a single string using ' '.join().
Store the OCR Results:
The extracted text for each image is stored in a dictionary called ocr_results. The key for each entry is a tuple (author, image_id), allowing us to easily identify where each piece of text came from.
import pandas as pdimport easyocrimport osfrom tqdm.notebook import tqdm# Define the path to the images folderimages_root_path ='posts/images'# Initialize the EasyOCR readerreader = easyocr.Reader(['de'])# Initialize a dictionary to store OCR resultsocr_results = {}# Loop through each subfolder in the images folderfor root, dirs, files in os.walk(images_root_path):forfilein tqdm(files, desc=f"Processing images in {root}"):iffile.endswith(('.jpg', '.jpeg', '.png')): # Add more image file extensions if needed image_path = os.path.join(root, file) author = os.path.basename(root) image_id, _ = os.path.splitext(file)# Read the image using EasyOCR text = reader.readtext(image_path)# Extracted text as a single string extracted_text =' '.join([line[1] for line in text])# Store the result in the dictionary ocr_results[(author, image_id)] = extracted_text
After extracting text from the images, the next step is to add this information to your existing dataset.
We use pandas to add a new column to our df_posts DataFrame. The new column, named 'ocr_text', will contain the text extracted from each image. To achieve this, we use the apply() function to iterate over each row in the DataFrame. For each row, we look up the corresponding OCR text in our ocr_results dictionary using the (author, id) tuple as the key.
# Add a new column for OCR text in the dataframedf_posts['ocr_text'] = df_posts.apply(lambda row: ocr_results.get((row['author'], row['id']), ''), axis=1)
Note the new ocr_text column:
df_posts.head()
Unnamed: 0
id
thread_id
parent_id
body
author
author_fullname
author_avatar_url
timestamp
type
...
media_url
hashtags
num_likes
num_comments
num_media
location_name
location_latlong
location_city
unix_timestamp
ocr_text
0
0
DBwPNDuNdAg
DBwPNDuNdAg
DBwPNDuNdAg
Hallo Heidelberg! Zum ersten Mal zu viert hier...
kathaschulze
Katharina Schulze
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-10-30 15:42:29
photo
...
https://scontent.cdninstagram.com/v/t51.2885-1...
heidelberg,schlossheidelberg,badenwürttemberg,...
3816
51
1
Heidelberg
49.4122,8.71
NaN
1730302949
1
1
DCOcihAOOfr
DCOcihAOOfr
DCOcihAOOfr
When the police at the Palm Ridge Magistrate's...
news24
News24
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-11-11 09:16:17
photo
...
https://scontent.cdninstagram.com/v/t51.2885-1...
NaN
358
13
1
NaN
NaN
NaN
1731316577
news24 Unlucky escape: Alleged serial rapist's...
2
2
DCHWFYTta-b
DCHWFYTta-b
DCHWFYTta-b
Gemeinsam kämpfen wir für soziale Gerechtigkei...
bayernspd
BayernSPD
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-11-08 15:05:08
photo
...
https://scontent.cdninstagram.com/v/t51.29350-...
NaN
3
3
1
NaN
NaN
NaN
1731078308
DER BESTE MOMENT, MITGLIED ZU WERDEN, WAR GEST...
3
3
DCEHp67sb1U
DCEHp67sb1U
DCEHp67sb1U
Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb...
katrin.ebnersteiner
Katrin Ebner-Steiner, MdL
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-11-07 09:01:20
photo
...
https://scontent.cdninstagram.com/v/t39.30808-...
NaN
108
7
1
NaN
NaN
NaN
1730970080
DIE CHAOS-AMPEL IST ZERBROCHENI DEUTSCHLAND BR...
4
4
DCBmO3HOcxk
DCBmO3HOcxk
DCBmO3HOcxk
Die USA hat gewählt und sich für nationalistis...
gruenebayern
GRÜNE Bayern
https://scontent-fra3-1.cdninstagram.com/v/t51...
2024-11-06 09:30:48
photo
...
https://scontent-fra5-2.cdninstagram.com/v/t39...
USWahl,Trump,Feminismus,Frauen,Politik,Grüne
1774
71
1
NaN
NaN
NaN
1730885448
Wenn die Welt verrückt spielt, braucht es eine...
5 rows × 22 columns
Once we’ve added the OCR text to our DataFrame, it’s important to save our work so that we can easily access it later without rerunning the entire OCR process.
To do this, we save the DataFrame to a CSV file:
df_posts.to_csv('2024-11-11-Posts.csv')
However, since we’re working in a Colab environment, it’s recommended to save the file to your Google Drive to ensure persistence. This way, your results won’t be lost when the Colab session ends. You can modify the save path like this:
# Save to Google Drive for persistencedf_posts.to_csv('/content/drive/MyDrive/2024-11-11-Posts.csv')
Automated Audio Transcription Using Whisper
Next, we want to automatically transcribe audio files using OpenAI’s Whisper model. We’ll use the openai Python package to interact with the Whisper API for this purpose. The following code snippet shows how to set up the transcription function:
We’re using the userdata.get('openai-forschung-mad') to securely retrieve our OpenAI API key from Google Colab’s user data storage.
OpenAI Client:
The OpenAI client is initialized using the retrieved api_key to authenticate our requests to the API.
Backoff for Rate Limiting:
The @backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError)) decorator is used to retry the transcription request in case of a rate limit or API error, following an exponential backoff strategy. This helps manage interruptions if the API hits rate limits or faces temporary issues.
Transcription Function:
The function run_request(audio_file) takes an audio file as input and returns the transcription generated by Whisper.
The model used is "whisper-1", OpenAI’s latest automatic speech recognition model.
Extracting Audio from Videos and Transcribing Using Whisper
Next, we automate audio extraction from video files and transcribe that audio using OpenAI’s Whisper model. Let’s go over how this is done:
Import Libraries:
We use several libraries here: os for file management, tqdm to show a progress bar, and moviepy for video processing.
We ensure that the directory to store extracted audio exists by using os.makedirs(audio_save_path, exist_ok=True).
Process Each Video File:
We loop through all the video files in the specified directory using os.walk(). This way, we can handle multiple videos, even if they’re located in different subfolders.
Extract Audio from Videos:
For each video, we create a VideoFileClip object with moviepy.
We then extract the audio from this video and save it as an MP3 file in the audio_save_path directory. The audio extraction is done using video_clip.audio.write_audiofile(audio_path, codec='libmp3lame').
Transcribe Audio Using Whisper:
After extracting the audio, we pass it to the run_request(audio_file) function defined earlier, which sends it to OpenAI’s Whisper model for transcription.
The result (transcription_text) is then stored in the transcription_results dictionary, using (author, video_id) as the key for easier reference.
Error Handling:
A try...except block is used to catch any exceptions during video processing or transcription. This way, if there is an issue with a particular file, the script will continue running for the remaining files.
import osfrom tqdm.notebook import tqdmimport moviepy.editor as mpimport pandas as pd# Define the pathsvideos_root_path ='posts/videos'images_root_path ='posts/images'audio_save_path ='posts/audio'# Ensure the audio directory existsos.makedirs(audio_save_path, exist_ok=True)# Initialize a dictionary to store transcription resultstranscription_results = {}# Loop through each subfolder and video file in the videos folderfor root, dirs, files in os.walk(videos_root_path):forfilein tqdm(files, desc=f"Processing videos in {root}"):iffile.endswith(('.mp4', '.avi', '.mov', '.mkv')): # Add more video file extensions if needed video_path = os.path.join(root, file) author = os.path.basename(root) video_id, _ = os.path.splitext(file)# Extract audio from the video and save as MP3try: video_clip = mp.VideoFileClip(video_path) audio_path = os.path.join(audio_save_path, f"{video_id}.mp3") video_clip.audio.write_audiofile(audio_path, codec='libmp3lame')# Transcribe the audio using OpenAI Whisper audio_file =open(audio_path, "rb") response = run_request(audio_file) transcription_text = response.text# Store the result in the dictionary transcription_results[(author, video_id)] = transcription_textexceptExceptionas e:print(f"Error processing video {video_path}: {e}")
MoviePy - Writing audio in posts/audio/DCHYJgitebc.mp3
MoviePy - Done.
Adding Transcription Results to Your DataFrame
After transcribing the audio from your videos, the next step is to integrate these transcriptions into your existing DataFrame (df_posts). This allows you to have both the visual (OCR from images) and auditory (transcriptions from videos) data all in one place, making it easier for further analysis.
To achieve this, we add a new column called transcription_text to the DataFrame. Here’s how it’s done:
# Add a new column for transcription text in the dataframedf_posts['transcription_text'] = df_posts.apply(lambda row: transcription_results.get((row['author'], row['id']), ''), axis=1)
Explanation:
New Column Creation:
We add the column 'transcription_text' to store the transcription corresponding to each post.
Using apply():
The apply() function is used to iterate over each row in the DataFrame.
For each row, we extract the (author, id) tuple to look up the corresponding transcription in the transcription_results dictionary.
If a transcription is found, it is added to the 'transcription_text' column; otherwise, an empty string is used as the default value.
Now, take a look at the result. Note the new column transcription text.
df_posts.sample(10)
Unnamed: 0
id
thread_id
parent_id
body
author
author_fullname
author_avatar_url
timestamp
type
...
hashtags
num_likes
num_comments
num_media
location_name
location_latlong
location_city
unix_timestamp
ocr_text
transcription_text
5
5
DB8hUbWNcXS
DB8hUbWNcXS
DB8hUbWNcXS
Humusverlust auf Bayerns Feldern: Eine Gefahr ...
ludwighartmann
Ludwig Hartmann
https://scontent-fra3-2.cdninstagram.com/v/t51...
2024-11-04 10:11:40
photo
...
Landwirtschaft,Humus,Bodenschutz,Klimaschutz,H...
805
50
5
NaN
NaN
NaN
1730715100
ludwighartmannde CSU Lmndwirten wichtige Förde...
10
10
Cl06_FgImCM
Cl06_FgImCM
Cl06_FgImCM
"Keepin' up with news from around the world! T...
dh.news.catcher
DH News Collector
https://scontent-fra3-2.cdninstagram.com/v/t51...
2022-12-06 12:42:59
photo
...
RobotReading,CoffeeAndNewspaper,LearningMoreEv...
1
0
1
NaN
NaN
NaN
1670330579
4
4
DCBmO3HOcxk
DCBmO3HOcxk
DCBmO3HOcxk
Die USA hat gewählt und sich für nationalistis...
gruenebayern
GRÜNE Bayern
https://scontent-fra3-1.cdninstagram.com/v/t51...
2024-11-06 09:30:48
photo
...
USWahl,Trump,Feminismus,Frauen,Politik,Grüne
1774
71
1
NaN
NaN
NaN
1730885448
Wenn die Welt verrückt spielt, braucht es eine...
0
0
DBwPNDuNdAg
DBwPNDuNdAg
DBwPNDuNdAg
Hallo Heidelberg! Zum ersten Mal zu viert hier...
kathaschulze
Katharina Schulze
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-10-30 15:42:29
photo
...
heidelberg,schlossheidelberg,badenwürttemberg,...
3816
51
1
Heidelberg
49.4122,8.71
NaN
1730302949
6
6
DCB1aieNF-o
DCB1aieNF-o
DCB1aieNF-o
Was für ein Horror. \n \nFühlt ihr euch auch, ...
kathaschulze
Katharina Schulze
https://scontent-fra5-1.cdninstagram.com/v/t51...
2024-11-06 11:43:28
photo
...
NaN
2622
140
1
NaN
NaN
NaN
1730893408
11
11
DCHYJgitebc
DCHYJgitebc
DCHYJgitebc
#kanzlerera \nIch freu mich auf den Bundestags...
kathaschulze
Katharina Schulze
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-11-08 15:26:11
video
...
kanzlerera
3140
163
1
Bayern, Germany
48.894107570617,11.583000803261
NaN
1731079571
Are you ready for it?
1
1
DCOcihAOOfr
DCOcihAOOfr
DCOcihAOOfr
When the police at the Palm Ridge Magistrate's...
news24
News24
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-11-11 09:16:17
photo
...
NaN
358
13
1
NaN
NaN
NaN
1731316577
news24 Unlucky escape: Alleged serial rapist's...
8
8
CmG5tP3ohLS
CmG5tP3ohLS
CmG5tP3ohLS
Taking the time to appreciate the morning, one...
dh.news.catcher
DH News Collector
https://scontent-fra3-2.cdninstagram.com/v/t51...
2022-12-13 12:18:08
photo
...
RobotLife,UpliftingNews,aiart,stablediffusion
4
0
1
NaN
NaN
NaN
1670933888
36 ELNE AK8 HCSTFOIO A 1a6 KFoB. HEA An; EPST ...
13
13
DCBmLuTv_7C
DCBmLuTv_7C
DCBmLuTv_7C
#Klartext von @hubertaiwanger\n\n#FREIEWÄHLER ...
fw_bayern
FREIE WÄHLER Bayern
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-11-06 09:30:26
photo
...
Klartext,FREIEWÄHLER,Aiwanger,Trump,USAElectio...
599
15
1
NaN
NaN
NaN
1730885426
Hubert Aiwanger @HubertAiwanger #Trump #USWahl...
12
12
DBwG6eEuPIg
DBwG6eEuPIg
DBwG6eEuPIg
Carel Benjamin Schoeman, the attorney accused ...
news24
News24
https://scontent.cdninstagram.com/v/t51.2885-1...
2024-10-30 14:30:06
photo
...
NaN
4439
569
1
NaN
NaN
NaN
1730298606
news24 Meet Carel Schoeman; the attorney accus...
10 rows × 23 columns
Finally, don’t forget to save your updated DataFrame so your changes are not lost:
df_posts.to_csv('2024-11-11-Posts.csv')
After extracting audio from the video files, it’s important to save those audio files so that you can access them later for further analysis without having to re-extract them from the videos.
To do this, we compress the posts/ folder into a ZIP file. This includes the extracted audio, as well as any other processed files. We use the following command in Colab:
We start by transforming the df_posts DataFrame into a “long format” where each type of text (captions, OCR text, transcriptions) is represented as a separate row.
This is achieved using the pd.melt() function, where:
id_vars=['id'] indicates that the id column should remain unchanged.
value_vars=['body', 'ocr_text', 'transcription_text'] are the columns we want to melt, each representing a different type of text.
var_name='Text Type' assigns a name to the new column that identifies the type of text.
value_name='Text' names the column containing the text values.
We map the values in the 'Text Type' column to more descriptive names for clarity:
'body' becomes 'Caption'
'ocr_text' becomes 'OCR'
'transcription_text' becomes 'Transcription'
# Map the Text Type to more descriptive namesdf_long['Text Type'] = df_long['Text Type'].map({'body': 'Caption','ocr_text': 'OCR','transcription_text': 'Transcription'})
Add Image File References:
We create a new column named 'Image' that contains the name of the image file associated with each post. This is useful for linking text data to the corresponding images.
Important
The line below works with the original Zeeschuimer import notebook, where we only download one image per post. When using the updated version with gallery posts we need to use the column 'media_filename'. We need to add the column to id_vars in line 30.
We add a new column called 'Post Type' and set it to 'Post' for every row. This can be helpful if you later want to differentiate between different types of content (e.g., posts vs. stories). The Preprocessing Notebook on GitHub shows how to process Posts and Stories, there we apply OCR and Whisper twice, once for the posts dataset, once for the stories. Thereafter we combine the datasets, the 'Post Type' column then helps during the analysis stage (i.e. we might want to compare posts to stories).
df_long['Post Type'] ='Post'
To make sure our “Text Master” table only contains meaningful entries, we need to filter out any rows where the text is missing or empty. This is done by keeping only rows that contain valid strings in the 'Text' column.
Conclusion
In summary, this session provided us with the foundations to use Python, pandas, and Jupyter notebooks for the computational analysis of multimodal social media data. Our adherence to Tidy Data principles and the integration of technologies like OCR and Whisper are integral to extract and analyze textual content from multimedia sources. In the next session we will keep exploring the content through a textual lens. Further, we will use prompting as a technique to classify texts as part of a computational content analysis.
References
Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A C G van der Velden. 2022. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.”Communication Methods and Measures 16 (1): 1–18. https://doi.org/10.1080/19312458.2021.2015574.