Data Preprocessing

Author

Michael Achmann-Denkler

Published

November 18, 2024

Preprocessing is an important step in the computational analysis of social media data, especially when dealing with multimodal content such as images, videos, and audio. This chapter introduces techniques to transform visual and audio content into computer-readable text, allowing us to apply well-established text analysis methods (Baden et al. 2022) to platforms like Instagram and TikTok. Although these platforms are primarily visual, extracting text enables us to leverage advanced computational social science techniques to derive meaningful insights from embedded and spoken content.

The following sections provide a detailed overview of the preprocessing steps we use for text extraction. Instagram posts often contain embedded text, and videos posted on TikTok or Instagram frequently include audio layers, which we can convert into analyzable textual data. We use Optical Character Recognition (OCR) for text extraction from images, and for audio content, we apply the Whisper model for transcription. Additionally, we conclude this chapter with a simple application of corpus analytics to explore word frequencies within the extracted content.

Important

Update 2024: I updated the Notebook to use the OpenAI API. for Whisper and included the new Tidal Tales file format. Additionally, the new notebook extracts the first frame from any video.

The original OCR Notebook and Whisper Notebook are still available.

OCR & Whisper

We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

!pip install -q easyocr
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/2.9 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 2.9/2.9 MB 142.7 MB/s eta 0:00:01   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 79.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/307.2 kB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/912.2 kB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 912.2/912.2 kB 49.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/286.8 kB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 286.8/286.8 kB 26.6 MB/s eta 0:00:00

Now we extract text from all images using the EasyOCR library. Our goal is to systematically process all images within a folder and extract the text content.

Step-by-Step Explanation:

  1. Setup and Libraries:
    • We use os to navigate the folder structure, tqdm to add a progress bar, and pandas to store the results if needed.
    • EasyOCR is initialized to recognize German text with reader = easyocr.Reader(['de']).
  2. Define the Path to the Images:
    • The variable images_root_path specifies the root folder containing subfolders of images. Each subfolder may represent a different author or category.
  3. Loop Through the Images:
    • We use os.walk() to iterate through each subfolder (root) and the files within it (files). This way, we can process each image individually.
    • We check if the file extension is .jpg, .jpeg, or .png to ensure we only process image files.
  4. Extract Text Using EasyOCR:
    • The reader.readtext(image_path) function reads the image and returns a list of recognized text areas.
    • We concatenate the recognized text from all detected areas into a single string using ' '.join().
  5. Store the OCR Results:
    • The extracted text for each image is stored in a dictionary called ocr_results. The key for each entry is a tuple (author, image_id), allowing us to easily identify where each piece of text came from.
import pandas as pd
import easyocr
import os
from tqdm.notebook import tqdm

# Define the path to the images folder
images_root_path = 'posts/images'

# Initialize the EasyOCR reader
reader = easyocr.Reader(['de'])

# Initialize a dictionary to store OCR results
ocr_results = {}

# Loop through each subfolder in the images folder
for root, dirs, files in os.walk(images_root_path):
    for file in tqdm(files, desc=f"Processing images in {root}"):
        if file.endswith(('.jpg', '.jpeg', '.png')):  # Add more image file extensions if needed
            image_path = os.path.join(root, file)
            author = os.path.basename(root)
            image_id, _ = os.path.splitext(file)

            # Read the image using EasyOCR
            text = reader.readtext(image_path)

            # Extracted text as a single string
            extracted_text = ' '.join([line[1] for line in text])

            # Store the result in the dictionary
            ocr_results[(author, image_id)] = extracted_text

After extracting text from the images, the next step is to add this information to your existing dataset.

We use pandas to add a new column to our df_posts DataFrame. The new column, named 'ocr_text', will contain the text extracted from each image. To achieve this, we use the apply() function to iterate over each row in the DataFrame. For each row, we look up the corresponding OCR text in our ocr_results dictionary using the (author, id) tuple as the key.

# Add a new column for OCR text in the dataframe
df_posts['ocr_text'] = df_posts.apply(lambda row: ocr_results.get((row['author'], row['id']), ''), axis=1)

Note the new ocr_text column:

df_posts.head()
Unnamed: 0 id thread_id parent_id body author author_fullname author_avatar_url timestamp type ... media_url hashtags num_likes num_comments num_media location_name location_latlong location_city unix_timestamp ocr_text
0 0 DBwPNDuNdAg DBwPNDuNdAg DBwPNDuNdAg Hallo Heidelberg! Zum ersten Mal zu viert hier... kathaschulze Katharina Schulze https://scontent.cdninstagram.com/v/t51.2885-1... 2024-10-30 15:42:29 photo ... https://scontent.cdninstagram.com/v/t51.2885-1... heidelberg,schlossheidelberg,badenwürttemberg,... 3816 51 1 Heidelberg 49.4122,8.71 NaN 1730302949
1 1 DCOcihAOOfr DCOcihAOOfr DCOcihAOOfr When the police at the Palm Ridge Magistrate's... news24 News24 https://scontent.cdninstagram.com/v/t51.2885-1... 2024-11-11 09:16:17 photo ... https://scontent.cdninstagram.com/v/t51.2885-1... NaN 358 13 1 NaN NaN NaN 1731316577 news24 Unlucky escape: Alleged serial rapist's...
2 2 DCHWFYTta-b DCHWFYTta-b DCHWFYTta-b Gemeinsam kämpfen wir für soziale Gerechtigkei... bayernspd BayernSPD https://scontent.cdninstagram.com/v/t51.2885-1... 2024-11-08 15:05:08 photo ... https://scontent.cdninstagram.com/v/t51.29350-... NaN 3 3 1 NaN NaN NaN 1731078308 DER BESTE MOMENT, MITGLIED ZU WERDEN, WAR GEST...
3 3 DCEHp67sb1U DCEHp67sb1U DCEHp67sb1U Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb... katrin.ebnersteiner Katrin Ebner-Steiner, MdL https://scontent.cdninstagram.com/v/t51.2885-1... 2024-11-07 09:01:20 photo ... https://scontent.cdninstagram.com/v/t39.30808-... NaN 108 7 1 NaN NaN NaN 1730970080 DIE CHAOS-AMPEL IST ZERBROCHENI DEUTSCHLAND BR...
4 4 DCBmO3HOcxk DCBmO3HOcxk DCBmO3HOcxk Die USA hat gewählt und sich für nationalistis... gruenebayern GRÜNE Bayern https://scontent-fra3-1.cdninstagram.com/v/t51... 2024-11-06 09:30:48 photo ... https://scontent-fra5-2.cdninstagram.com/v/t39... USWahl,Trump,Feminismus,Frauen,Politik,Grüne 1774 71 1 NaN NaN NaN 1730885448 Wenn die Welt verrückt spielt, braucht es eine...

5 rows × 22 columns

Once we’ve added the OCR text to our DataFrame, it’s important to save our work so that we can easily access it later without rerunning the entire OCR process.

To do this, we save the DataFrame to a CSV file:

df_posts.to_csv('2024-11-11-Posts.csv')

However, since we’re working in a Colab environment, it’s recommended to save the file to your Google Drive to ensure persistence. This way, your results won’t be lost when the Colab session ends. You can modify the save path like this:

# Save to Google Drive for persistence
df_posts.to_csv('/content/drive/MyDrive/2024-11-11-Posts.csv')

Automated Audio Transcription Using Whisper

Next, we want to automatically transcribe audio files using OpenAI’s Whisper model. We’ll use the openai Python package to interact with the Whisper API for this purpose. The following code snippet shows how to set up the transcription function:

import openai
from openai import OpenAI
from google.colab import userdata
import backoff

api_key = userdata.get('openai-forschung-mad')

client = OpenAI(api_key=api_key)


@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(audio_file):
    return client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

Explanation

  1. API Key:
    • We’re using the userdata.get('openai-forschung-mad') to securely retrieve our OpenAI API key from Google Colab’s user data storage.
  2. OpenAI Client:
    • The OpenAI client is initialized using the retrieved api_key to authenticate our requests to the API.
  3. Backoff for Rate Limiting:
    • The @backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError)) decorator is used to retry the transcription request in case of a rate limit or API error, following an exponential backoff strategy. This helps manage interruptions if the API hits rate limits or faces temporary issues.
  4. Transcription Function:
    • The function run_request(audio_file) takes an audio file as input and returns the transcription generated by Whisper.
    • The model used is "whisper-1", OpenAI’s latest automatic speech recognition model.

Extracting Audio from Videos and Transcribing Using Whisper

Next, we automate audio extraction from video files and transcribe that audio using OpenAI’s Whisper model. Let’s go over how this is done:

  1. Import Libraries:
    • We use several libraries here: os for file management, tqdm to show a progress bar, and moviepy for video processing.
  2. Define Paths:
  3. Create Audio Directory:
    • We ensure that the directory to store extracted audio exists by using os.makedirs(audio_save_path, exist_ok=True).
  4. Process Each Video File:
    • We loop through all the video files in the specified directory using os.walk(). This way, we can handle multiple videos, even if they’re located in different subfolders.
  5. Extract Audio from Videos:
    • For each video, we create a VideoFileClip object with moviepy.
    • We then extract the audio from this video and save it as an MP3 file in the audio_save_path directory. The audio extraction is done using video_clip.audio.write_audiofile(audio_path, codec='libmp3lame').
  6. Transcribe Audio Using Whisper:
    • After extracting the audio, we pass it to the run_request(audio_file) function defined earlier, which sends it to OpenAI’s Whisper model for transcription.
    • The result (transcription_text) is then stored in the transcription_results dictionary, using (author, video_id) as the key for easier reference.
  7. Error Handling:
    • A try...except block is used to catch any exceptions during video processing or transcription. This way, if there is an issue with a particular file, the script will continue running for the remaining files.
import os
from tqdm.notebook import tqdm
import moviepy.editor as mp
import pandas as pd


# Define the paths
videos_root_path = 'posts/videos'
images_root_path = 'posts/images'
audio_save_path = 'posts/audio'

# Ensure the audio directory exists
os.makedirs(audio_save_path, exist_ok=True)

# Initialize a dictionary to store transcription results
transcription_results = {}

# Loop through each subfolder and video file in the videos folder
for root, dirs, files in os.walk(videos_root_path):
    for file in tqdm(files, desc=f"Processing videos in {root}"):
        if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):  # Add more video file extensions if needed
            video_path = os.path.join(root, file)
            author = os.path.basename(root)
            video_id, _ = os.path.splitext(file)

            # Extract audio from the video and save as MP3
            try:
                video_clip = mp.VideoFileClip(video_path)
                audio_path = os.path.join(audio_save_path, f"{video_id}.mp3")
                video_clip.audio.write_audiofile(audio_path, codec='libmp3lame')

                # Transcribe the audio using OpenAI Whisper
                audio_file = open(audio_path, "rb")
                response = run_request(audio_file)
                transcription_text = response.text

                # Store the result in the dictionary
                transcription_results[(author, video_id)] = transcription_text
            except Exception as e:
                print(f"Error processing video {video_path}: {e}")
MoviePy - Writing audio in posts/audio/DCHYJgitebc.mp3
MoviePy - Done.

Adding Transcription Results to Your DataFrame

After transcribing the audio from your videos, the next step is to integrate these transcriptions into your existing DataFrame (df_posts). This allows you to have both the visual (OCR from images) and auditory (transcriptions from videos) data all in one place, making it easier for further analysis.

To achieve this, we add a new column called transcription_text to the DataFrame. Here’s how it’s done:

# Add a new column for transcription text in the dataframe
df_posts['transcription_text'] = df_posts.apply(lambda row: transcription_results.get((row['author'], row['id']), ''), axis=1)

Explanation:

  1. New Column Creation:
    • We add the column 'transcription_text' to store the transcription corresponding to each post.
  2. Using apply():
    • The apply() function is used to iterate over each row in the DataFrame.
    • For each row, we extract the (author, id) tuple to look up the corresponding transcription in the transcription_results dictionary.
    • If a transcription is found, it is added to the 'transcription_text' column; otherwise, an empty string is used as the default value.

Now, take a look at the result. Note the new column transcription text.

df_posts.sample(10)
Unnamed: 0 id thread_id parent_id body author author_fullname author_avatar_url timestamp type ... hashtags num_likes num_comments num_media location_name location_latlong location_city unix_timestamp ocr_text transcription_text
5 5 DB8hUbWNcXS DB8hUbWNcXS DB8hUbWNcXS Humusverlust auf Bayerns Feldern: Eine Gefahr ... ludwighartmann Ludwig Hartmann https://scontent-fra3-2.cdninstagram.com/v/t51... 2024-11-04 10:11:40 photo ... Landwirtschaft,Humus,Bodenschutz,Klimaschutz,H... 805 50 5 NaN NaN NaN 1730715100 ludwighartmannde CSU Lmndwirten wichtige Förde...
10 10 Cl06_FgImCM Cl06_FgImCM Cl06_FgImCM "Keepin' up with news from around the world! T... dh.news.catcher DH News Collector https://scontent-fra3-2.cdninstagram.com/v/t51... 2022-12-06 12:42:59 photo ... RobotReading,CoffeeAndNewspaper,LearningMoreEv... 1 0 1 NaN NaN NaN 1670330579
4 4 DCBmO3HOcxk DCBmO3HOcxk DCBmO3HOcxk Die USA hat gewählt und sich für nationalistis... gruenebayern GRÜNE Bayern https://scontent-fra3-1.cdninstagram.com/v/t51... 2024-11-06 09:30:48 photo ... USWahl,Trump,Feminismus,Frauen,Politik,Grüne 1774 71 1 NaN NaN NaN 1730885448 Wenn die Welt verrückt spielt, braucht es eine...
0 0 DBwPNDuNdAg DBwPNDuNdAg DBwPNDuNdAg Hallo Heidelberg! Zum ersten Mal zu viert hier... kathaschulze Katharina Schulze https://scontent.cdninstagram.com/v/t51.2885-1... 2024-10-30 15:42:29 photo ... heidelberg,schlossheidelberg,badenwürttemberg,... 3816 51 1 Heidelberg 49.4122,8.71 NaN 1730302949
6 6 DCB1aieNF-o DCB1aieNF-o DCB1aieNF-o Was für ein Horror. \n \nFühlt ihr euch auch, ... kathaschulze Katharina Schulze https://scontent-fra5-1.cdninstagram.com/v/t51... 2024-11-06 11:43:28 photo ... NaN 2622 140 1 NaN NaN NaN 1730893408
11 11 DCHYJgitebc DCHYJgitebc DCHYJgitebc #kanzlerera \nIch freu mich auf den Bundestags... kathaschulze Katharina Schulze https://scontent.cdninstagram.com/v/t51.2885-1... 2024-11-08 15:26:11 video ... kanzlerera 3140 163 1 Bayern, Germany 48.894107570617,11.583000803261 NaN 1731079571 Are you ready for it?
1 1 DCOcihAOOfr DCOcihAOOfr DCOcihAOOfr When the police at the Palm Ridge Magistrate's... news24 News24 https://scontent.cdninstagram.com/v/t51.2885-1... 2024-11-11 09:16:17 photo ... NaN 358 13 1 NaN NaN NaN 1731316577 news24 Unlucky escape: Alleged serial rapist's...
8 8 CmG5tP3ohLS CmG5tP3ohLS CmG5tP3ohLS Taking the time to appreciate the morning, one... dh.news.catcher DH News Collector https://scontent-fra3-2.cdninstagram.com/v/t51... 2022-12-13 12:18:08 photo ... RobotLife,UpliftingNews,aiart,stablediffusion 4 0 1 NaN NaN NaN 1670933888 36 ELNE AK8 HCSTFOIO A 1a6 KFoB. HEA An; EPST ...
13 13 DCBmLuTv_7C DCBmLuTv_7C DCBmLuTv_7C #Klartext von @hubertaiwanger\n\n#FREIEWÄHLER ... fw_bayern FREIE WÄHLER Bayern https://scontent.cdninstagram.com/v/t51.2885-1... 2024-11-06 09:30:26 photo ... Klartext,FREIEWÄHLER,Aiwanger,Trump,USAElectio... 599 15 1 NaN NaN NaN 1730885426 Hubert Aiwanger @HubertAiwanger #Trump #USWahl...
12 12 DBwG6eEuPIg DBwG6eEuPIg DBwG6eEuPIg Carel Benjamin Schoeman, the attorney accused ... news24 News24 https://scontent.cdninstagram.com/v/t51.2885-1... 2024-10-30 14:30:06 photo ... NaN 4439 569 1 NaN NaN NaN 1730298606 news24 Meet Carel Schoeman; the attorney accus...

10 rows × 23 columns

Finally, don’t forget to save your updated DataFrame so your changes are not lost:

df_posts.to_csv('2024-11-11-Posts.csv')

After extracting audio from the video files, it’s important to save those audio files so that you can access them later for further analysis without having to re-extract them from the videos.

To do this, we compress the posts/ folder into a ZIP file. This includes the extracted audio, as well as any other processed files. We use the following command in Colab:

!zip -r --update posts.zip posts/
updating: posts/ (stored 0%)
  adding: posts/audio/ (stored 0%)
  adding: posts/audio/DCHYJgitebc.mp3 (deflated 2%)

Create the Text Master

In this step, we will create a “Text Master” table that contains all the different types of text data from your posts: captions, OCR-extracted text from images, and transcriptions from videos. The goal is to follow the tidy data principle: each observation should be in one row. Here, one text type from a post is considered one observation.

Step-by-Step Explanation:

  1. Melt the DataFrame:
    • We start by transforming the df_posts DataFrame into a “long format” where each type of text (captions, OCR text, transcriptions) is represented as a separate row.
    • This is achieved using the pd.melt() function, where:
      • id_vars=['id'] indicates that the id column should remain unchanged.
      • value_vars=['body', 'ocr_text', 'transcription_text'] are the columns we want to melt, each representing a different type of text.
      • var_name='Text Type' assigns a name to the new column that identifies the type of text.
      • value_name='Text' names the column containing the text values.
# Melt the dataframe
df_long = pd.melt(df_posts, id_vars=['id'],
                  value_vars=['body', 'ocr_text', 'transcription_text'],
                  var_name='Text Type',
                  value_name='Text')

Map Text Types to Descriptive Names:

  • We map the values in the 'Text Type' column to more descriptive names for clarity:
    • 'body' becomes 'Caption'
    • 'ocr_text' becomes 'OCR'
    • 'transcription_text' becomes 'Transcription'
# Map the Text Type to more descriptive names
df_long['Text Type'] = df_long['Text Type'].map({
    'body': 'Caption',
    'ocr_text': 'OCR',
    'transcription_text': 'Transcription'
})

Add Image File References:

  • We create a new column named 'Image' that contains the name of the image file associated with each post. This is useful for linking text data to the corresponding images.
Important

The line below works with the original Zeeschuimer import notebook, where we only download one image per post. When using the updated version with gallery posts we need to use the column 'media_filename'. We need to add the column to id_vars in line 30.

df_long['Image'] = df_long['id'].apply(lambda x: f'{x}.jpg')

Rename Columns for Clarity:

  • We rename the 'id' column to 'Identifier' for a clearer understanding of what this column represents.
df_long.rename(columns={'id': 'Identifier'}, inplace=True)

Add Post Type Column:

  • We add a new column called 'Post Type' and set it to 'Post' for every row. This can be helpful if you later want to differentiate between different types of content (e.g., posts vs. stories). The Preprocessing Notebook on GitHub shows how to process Posts and Stories, there we apply OCR and Whisper twice, once for the posts dataset, once for the stories. Thereafter we combine the datasets, the 'Post Type' column then helps during the analysis stage (i.e. we might want to compare posts to stories).
df_long['Post Type'] = 'Post'

To make sure our “Text Master” table only contains meaningful entries, we need to filter out any rows where the text is missing or empty. This is done by keeping only rows that contain valid strings in the 'Text' column.

Conclusion

In summary, this session provided us with the foundations to use Python, pandas, and Jupyter notebooks for the computational analysis of multimodal social media data. Our adherence to Tidy Data principles and the integration of technologies like OCR and Whisper are integral to extract and analyze textual content from multimedia sources. In the next session we will keep exploring the content through a textual lens. Further, we will use prompting as a technique to classify texts as part of a computational content analysis.

References

Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A C G van der Velden. 2022. Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 16 (1): 1–18. https://doi.org/10.1080/19312458.2021.2015574.

Reuse

Citation

BibTeX citation:
@online{achmann-denkler2024,
  author = {Achmann-Denkler, Michael},
  title = {Data {Preprocessing}},
  date = {2024-11-18},
  url = {https://social-media-lab.net/processing/preprocessing.html},
  doi = {10.5281/zenodo.10039756},
  langid = {en}
}
For attribution, please cite this work as:
Achmann-Denkler, Michael. 2024. “Data Preprocessing.” November 18, 2024. https://doi.org/10.5281/zenodo.10039756.