Social Media Lab

We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

!pip install -q easyocr

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/2.9 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 2.9/2.9 MB 142.7 MB/s eta 0:00:01   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 79.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/307.2 kB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/912.2 kB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 912.2/912.2 kB 49.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/286.8 kB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 286.8/286.8 kB 26.6 MB/s eta 0:00:00

Now we extract text from all images using the EasyOCR library. Our goal is to systematically process all images within a folder and extract the text content.

Step-by-Step Explanation:

Setup and Libraries:
- We use os to navigate the folder structure, tqdm to add a progress bar, and pandas to store the results if needed.
- EasyOCR is initialized to recognize German text with reader = easyocr.Reader(['de']).
Define the Path to the Images:
- The variable images_root_path specifies the root folder containing subfolders of images. Each subfolder may represent a different author or category.
Loop Through the Images:
- We use os.walk() to iterate through each subfolder (root) and the files within it (files). This way, we can process each image individually.
- We check if the file extension is .jpg, .jpeg, or .png to ensure we only process image files.
Extract Text Using EasyOCR:
- The reader.readtext(image_path) function reads the image and returns a list of recognized text areas.
- We concatenate the recognized text from all detected areas into a single string using ' '.join().
Store the OCR Results:
- The extracted text for each image is stored in a dictionary called ocr_results. The key for each entry is a tuple (author, image_id), allowing us to easily identify where each piece of text came from.

import pandas as pd
import easyocr
import os
from tqdm.notebook import tqdm

# Define the path to the images folder
images_root_path = 'posts/images'

# Initialize the EasyOCR reader
reader = easyocr.Reader(['de'])

# Initialize a dictionary to store OCR results
ocr_results = {}

# Loop through each subfolder in the images folder
for root, dirs, files in os.walk(images_root_path):
    for file in tqdm(files, desc=f"Processing images in {root}"):
        if file.endswith(('.jpg', '.jpeg', '.png')):  # Add more image file extensions if needed
            image_path = os.path.join(root, file)
            author = os.path.basename(root)
            image_id, _ = os.path.splitext(file)

            # Read the image using EasyOCR
            text = reader.readtext(image_path)

            # Extracted text as a single string
            extracted_text = ' '.join([line[1] for line in text])

            # Store the result in the dictionary
            ocr_results[(author, image_id)] = extracted_text

After extracting text from the images, the next step is to add this information to your existing dataset.

We use pandas to add a new column to our df_posts DataFrame. The new column, named 'ocr_text', will contain the text extracted from each image. To achieve this, we use the apply() function to iterate over each row in the DataFrame. For each row, we look up the corresponding OCR text in our ocr_results dictionary using the (author, id) tuple as the key.

# Add a new column for OCR text in the dataframe
df_posts['ocr_text'] = df_posts.apply(lambda row: ocr_results.get((row['author'], row['id']), ''), axis=1)

Note the new ocr_text column:

df_posts.head()

	Unnamed: 0	id	thread_id	parent_id	body	author	author_fullname	author_avatar_url	timestamp	type	...	media_url	hashtags	num_likes	num_comments	num_media	location_name	location_latlong	location_city	unix_timestamp	ocr_text
0	0	DBwPNDuNdAg	DBwPNDuNdAg	DBwPNDuNdAg	Hallo Heidelberg! Zum ersten Mal zu viert hier...	kathaschulze	Katharina Schulze	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-10-30 15:42:29	photo	...	https://scontent.cdninstagram.com/v/t51.2885-1...	heidelberg,schlossheidelberg,badenwürttemberg,...	3816	51	1	Heidelberg	49.4122,8.71	NaN	1730302949
1	1	DCOcihAOOfr	DCOcihAOOfr	DCOcihAOOfr	When the police at the Palm Ridge Magistrate's...	news24	News24	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-11-11 09:16:17	photo	...	https://scontent.cdninstagram.com/v/t51.2885-1...	NaN	358	13	1	NaN	NaN	NaN	1731316577	news24 Unlucky escape: Alleged serial rapist's...
2	2	DCHWFYTta-b	DCHWFYTta-b	DCHWFYTta-b	Gemeinsam kämpfen wir für soziale Gerechtigkei...	bayernspd	BayernSPD	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-11-08 15:05:08	photo	...	https://scontent.cdninstagram.com/v/t51.29350-...	NaN	3	3	1	NaN	NaN	NaN	1731078308	DER BESTE MOMENT, MITGLIED ZU WERDEN, WAR GEST...
3	3	DCEHp67sb1U	DCEHp67sb1U	DCEHp67sb1U	Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb...	katrin.ebnersteiner	Katrin Ebner-Steiner, MdL	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-11-07 09:01:20	photo	...	https://scontent.cdninstagram.com/v/t39.30808-...	NaN	108	7	1	NaN	NaN	NaN	1730970080	DIE CHAOS-AMPEL IST ZERBROCHENI DEUTSCHLAND BR...
4	4	DCBmO3HOcxk	DCBmO3HOcxk	DCBmO3HOcxk	Die USA hat gewählt und sich für nationalistis...	gruenebayern	GRÜNE Bayern	https://scontent-fra3-1.cdninstagram.com/v/t51...	2024-11-06 09:30:48	photo	...	https://scontent-fra5-2.cdninstagram.com/v/t39...	USWahl,Trump,Feminismus,Frauen,Politik,Grüne	1774	71	1	NaN	NaN	NaN	1730885448	Wenn die Welt verrückt spielt, braucht es eine...

5 rows × 22 columns

Once we’ve added the OCR text to our DataFrame, it’s important to save our work so that we can easily access it later without rerunning the entire OCR process.

To do this, we save the DataFrame to a CSV file:

df_posts.to_csv('2024-11-11-Posts.csv')

However, since we’re working in a Colab environment, it’s recommended to save the file to your Google Drive to ensure persistence. This way, your results won’t be lost when the Colab session ends. You can modify the save path like this:

# Save to Google Drive for persistence
df_posts.to_csv('/content/drive/MyDrive/2024-11-11-Posts.csv')

Automated Audio Transcription Using Whisper

Next, we want to automatically transcribe audio files using OpenAI’s Whisper model. We’ll use the openai Python package to interact with the Whisper API for this purpose. The following code snippet shows how to set up the transcription function:

import openai
from openai import OpenAI
from google.colab import userdata
import backoff

api_key = userdata.get('openai-forschung-mad')

client = OpenAI(api_key=api_key)


@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(audio_file):
    return client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

Explanation

API Key:
- We’re using the userdata.get('openai-forschung-mad') to securely retrieve our OpenAI API key from Google Colab’s user data storage.
OpenAI Client:
- The OpenAI client is initialized using the retrieved api_key to authenticate our requests to the API.
Backoff for Rate Limiting:
- The @backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError)) decorator is used to retry the transcription request in case of a rate limit or API error, following an exponential backoff strategy. This helps manage interruptions if the API hits rate limits or faces temporary issues.
Transcription Function:
- The function run_request(audio_file) takes an audio file as input and returns the transcription generated by Whisper.
- The model used is "whisper-1", OpenAI’s latest automatic speech recognition model.

Extracting Audio from Videos and Transcribing Using Whisper

Next, we automate audio extraction from video files and transcribe that audio using OpenAI’s Whisper model. Let’s go over how this is done:

Import Libraries:
- We use several libraries here: os for file management, tqdm to show a progress bar, and moviepy for video processing.
Define Paths:
- We define the paths for the videos (videos_root_path), extracted audio files (audio_save_path), and other media as needed. This structure makes it easy to manage and analyze different types of content.
Create Audio Directory:
- We ensure that the directory to store extracted audio exists by using os.makedirs(audio_save_path, exist_ok=True).
Process Each Video File:
- We loop through all the video files in the specified directory using os.walk(). This way, we can handle multiple videos, even if they’re located in different subfolders.
Extract Audio from Videos:
- For each video, we create a VideoFileClip object with moviepy.
- We then extract the audio from this video and save it as an MP3 file in the audio_save_path directory. The audio extraction is done using video_clip.audio.write_audiofile(audio_path, codec='libmp3lame').
Transcribe Audio Using Whisper:
- After extracting the audio, we pass it to the run_request(audio_file) function defined earlier, which sends it to OpenAI’s Whisper model for transcription.
- The result (transcription_text) is then stored in the transcription_results dictionary, using (author, video_id) as the key for easier reference.
Error Handling:
- A try...except block is used to catch any exceptions during video processing or transcription. This way, if there is an issue with a particular file, the script will continue running for the remaining files.

import os
from tqdm.notebook import tqdm
import moviepy.editor as mp
import pandas as pd


# Define the paths
videos_root_path = 'posts/videos'
images_root_path = 'posts/images'
audio_save_path = 'posts/audio'

# Ensure the audio directory exists
os.makedirs(audio_save_path, exist_ok=True)

# Initialize a dictionary to store transcription results
transcription_results = {}

# Loop through each subfolder and video file in the videos folder
for root, dirs, files in os.walk(videos_root_path):
    for file in tqdm(files, desc=f"Processing videos in {root}"):
        if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):  # Add more video file extensions if needed
            video_path = os.path.join(root, file)
            author = os.path.basename(root)
            video_id, _ = os.path.splitext(file)

            # Extract audio from the video and save as MP3
            try:
                video_clip = mp.VideoFileClip(video_path)
                audio_path = os.path.join(audio_save_path, f"{video_id}.mp3")
                video_clip.audio.write_audiofile(audio_path, codec='libmp3lame')

                # Transcribe the audio using OpenAI Whisper
                audio_file = open(audio_path, "rb")
                response = run_request(audio_file)
                transcription_text = response.text

                # Store the result in the dictionary
                transcription_results[(author, video_id)] = transcription_text
            except Exception as e:
                print(f"Error processing video {video_path}: {e}")

MoviePy - Writing audio in posts/audio/DCHYJgitebc.mp3


chunk:   0%|          | 0/143 [00:00<?, ?it/s, now=None]
chunk:  71%|███████▏  | 102/143 [00:00<00:00, 1016.47it/s, now=None]

MoviePy - Done.

Adding Transcription Results to Your DataFrame

After transcribing the audio from your videos, the next step is to integrate these transcriptions into your existing DataFrame (df_posts). This allows you to have both the visual (OCR from images) and auditory (transcriptions from videos) data all in one place, making it easier for further analysis.

To achieve this, we add a new column called transcription_text to the DataFrame. Here’s how it’s done:

# Add a new column for transcription text in the dataframe
df_posts['transcription_text'] = df_posts.apply(lambda row: transcription_results.get((row['author'], row['id']), ''), axis=1)

Explanation:

New Column Creation:
- We add the column 'transcription_text' to store the transcription corresponding to each post.
Using apply():
- The apply() function is used to iterate over each row in the DataFrame.
- For each row, we extract the (author, id) tuple to look up the corresponding transcription in the transcription_results dictionary.
- If a transcription is found, it is added to the 'transcription_text' column; otherwise, an empty string is used as the default value.

Now, take a look at the result. Note the new column transcription text.

df_posts.sample(10)

	Unnamed: 0	id	thread_id	parent_id	body	author	author_fullname	author_avatar_url	timestamp	type	...	hashtags	num_likes	num_comments	num_media	location_name	location_latlong	location_city	unix_timestamp	ocr_text	transcription_text
5	5	DB8hUbWNcXS	DB8hUbWNcXS	DB8hUbWNcXS	Humusverlust auf Bayerns Feldern: Eine Gefahr ...	ludwighartmann	Ludwig Hartmann	https://scontent-fra3-2.cdninstagram.com/v/t51...	2024-11-04 10:11:40	photo	...	Landwirtschaft,Humus,Bodenschutz,Klimaschutz,H...	805	50	5	NaN	NaN	NaN	1730715100	ludwighartmannde CSU Lmndwirten wichtige Förde...
10	10	Cl06_FgImCM	Cl06_FgImCM	Cl06_FgImCM	"Keepin' up with news from around the world! T...	dh.news.catcher	DH News Collector	https://scontent-fra3-2.cdninstagram.com/v/t51...	2022-12-06 12:42:59	photo	...	RobotReading,CoffeeAndNewspaper,LearningMoreEv...	1	0	1	NaN	NaN	NaN	1670330579
4	4	DCBmO3HOcxk	DCBmO3HOcxk	DCBmO3HOcxk	Die USA hat gewählt und sich für nationalistis...	gruenebayern	GRÜNE Bayern	https://scontent-fra3-1.cdninstagram.com/v/t51...	2024-11-06 09:30:48	photo	...	USWahl,Trump,Feminismus,Frauen,Politik,Grüne	1774	71	1	NaN	NaN	NaN	1730885448	Wenn die Welt verrückt spielt, braucht es eine...
0	0	DBwPNDuNdAg	DBwPNDuNdAg	DBwPNDuNdAg	Hallo Heidelberg! Zum ersten Mal zu viert hier...	kathaschulze	Katharina Schulze	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-10-30 15:42:29	photo	...	heidelberg,schlossheidelberg,badenwürttemberg,...	3816	51	1	Heidelberg	49.4122,8.71	NaN	1730302949
6	6	DCB1aieNF-o	DCB1aieNF-o	DCB1aieNF-o	Was für ein Horror. \n \nFühlt ihr euch auch, ...	kathaschulze	Katharina Schulze	https://scontent-fra5-1.cdninstagram.com/v/t51...	2024-11-06 11:43:28	photo	...	NaN	2622	140	1	NaN	NaN	NaN	1730893408
11	11	DCHYJgitebc	DCHYJgitebc	DCHYJgitebc	#kanzlerera \nIch freu mich auf den Bundestags...	kathaschulze	Katharina Schulze	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-11-08 15:26:11	video	...	kanzlerera	3140	163	1	Bayern, Germany	48.894107570617,11.583000803261	NaN	1731079571		Are you ready for it?
1	1	DCOcihAOOfr	DCOcihAOOfr	DCOcihAOOfr	When the police at the Palm Ridge Magistrate's...	news24	News24	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-11-11 09:16:17	photo	...	NaN	358	13	1	NaN	NaN	NaN	1731316577	news24 Unlucky escape: Alleged serial rapist's...
8	8	CmG5tP3ohLS	CmG5tP3ohLS	CmG5tP3ohLS	Taking the time to appreciate the morning, one...	dh.news.catcher	DH News Collector	https://scontent-fra3-2.cdninstagram.com/v/t51...	2022-12-13 12:18:08	photo	...	RobotLife,UpliftingNews,aiart,stablediffusion	4	0	1	NaN	NaN	NaN	1670933888	36 ELNE AK8 HCSTFOIO A 1a6 KFoB. HEA An; EPST ...
13	13	DCBmLuTv_7C	DCBmLuTv_7C	DCBmLuTv_7C	#Klartext von @hubertaiwanger\n\n#FREIEWÄHLER ...	fw_bayern	FREIE WÄHLER Bayern	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-11-06 09:30:26	photo	...	Klartext,FREIEWÄHLER,Aiwanger,Trump,USAElectio...	599	15	1	NaN	NaN	NaN	1730885426	Hubert Aiwanger @HubertAiwanger #Trump #USWahl...
12	12	DBwG6eEuPIg	DBwG6eEuPIg	DBwG6eEuPIg	Carel Benjamin Schoeman, the attorney accused ...	news24	News24	https://scontent.cdninstagram.com/v/t51.2885-1...	2024-10-30 14:30:06	photo	...	NaN	4439	569	1	NaN	NaN	NaN	1730298606	news24 Meet Carel Schoeman; the attorney accus...

10 rows × 23 columns

Finally, don’t forget to save your updated DataFrame so your changes are not lost:

df_posts.to_csv('2024-11-11-Posts.csv')

After extracting audio from the video files, it’s important to save those audio files so that you can access them later for further analysis without having to re-extract them from the videos.

To do this, we compress the posts/ folder into a ZIP file. This includes the extracted audio, as well as any other processed files. We use the following command in Colab:

!zip -r --update posts.zip posts/

updating: posts/ (stored 0%)
  adding: posts/audio/ (stored 0%)
  adding: posts/audio/DCHYJgitebc.mp3 (deflated 2%)

Create the Text Master

In this step, we will create a “Text Master” table that contains all the different types of text data from your posts: captions, OCR-extracted text from images, and transcriptions from videos. The goal is to follow the tidy data principle: each observation should be in one row. Here, one text type from a post is considered one observation.

Step-by-Step Explanation:

Melt the DataFrame:
- We start by transforming the df_posts DataFrame into a “long format” where each type of text (captions, OCR text, transcriptions) is represented as a separate row.
- This is achieved using the pd.melt() function, where:
  - id_vars=['id'] indicates that the id column should remain unchanged.
  - value_vars=['body', 'ocr_text', 'transcription_text'] are the columns we want to melt, each representing a different type of text.
  - var_name='Text Type' assigns a name to the new column that identifies the type of text.
  - value_name='Text' names the column containing the text values.

# Melt the dataframe
df_long = pd.melt(df_posts, id_vars=['id'],
                  value_vars=['body', 'ocr_text', 'transcription_text'],
                  var_name='Text Type',
                  value_name='Text')

Map Text Types to Descriptive Names:

We map the values in the 'Text Type' column to more descriptive names for clarity:
- 'body' becomes 'Caption'
- 'ocr_text' becomes 'OCR'
- 'transcription_text' becomes 'Transcription'

# Map the Text Type to more descriptive names
df_long['Text Type'] = df_long['Text Type'].map({
    'body': 'Caption',
    'ocr_text': 'OCR',
    'transcription_text': 'Transcription'
})

Add Image File References:

We create a new column named 'Image' that contains the name of the image file associated with each post. This is useful for linking text data to the corresponding images.

Important

The line below works with the original Zeeschuimer import notebook, where we only download one image per post. When using the updated version with gallery posts we need to use the column 'media_filename'. We need to add the column to id_vars in line 30.

df_long['Image'] = df_long['id'].apply(lambda x: f'{x}.jpg')

Rename Columns for Clarity:

We rename the 'id' column to 'Identifier' for a clearer understanding of what this column represents.

df_long.rename(columns={'id': 'Identifier'}, inplace=True)

Add Post Type Column:

We add a new column called 'Post Type' and set it to 'Post' for every row. This can be helpful if you later want to differentiate between different types of content (e.g., posts vs. stories). The Preprocessing Notebook on GitHub shows how to process Posts and Stories, there we apply OCR and Whisper twice, once for the posts dataset, once for the stories. Thereafter we combine the datasets, the 'Post Type' column then helps during the analysis stage (i.e. we might want to compare posts to stories).

df_long['Post Type'] = 'Post'

To make sure our “Text Master” table only contains meaningful entries, we need to filter out any rows where the text is missing or empty. This is done by keeping only rows that contain valid strings in the 'Text' column.