The analysis of texutal data has a long tradition under the term Natural Language Processing (NLP). As noted by Bengfort, Bilbro, and Ojeda (2018), “Language is unstructured data that has been produced by people to be understood by other people”. This characterization of language as unstructured data highlights its contrast with structured or semi-structured data. Unlike structured data, which is organized in a way that computers can easily parse and analyze, unstructured data like language requires more complex methods to be processed and understood. In the context of e.g. Instagram, CrowdTangle exports contain structured data columns such as ‘User Name’, ‘Like Count’, or ‘Comment Count’. These pieces of data are quantifiable and can be easily sorted, filtered, or counted, e.g. using tools like Excel or Python’s pandas library. For instance, we can quickly determine the most active users by counting the number of rows associated with each username. In contrast, unstructured data is not organized in a predefined manner and is typically more challenging to process and analyze. The ‘Description’ column in our dataset, which contains the captions of Instagram posts, is a prime example of unstructured data. These captions, composed of paragraphs or sentences, require different analytical approaches to extract meaningful insights. Unlike structured data, we cannot simply count or sort these texts in a straightforward manner. In our context, we often refer to the collection of texts we analyze as a “Corpus”. Each individual piece of text is called a “Document”. Each document can be broken down into smaller units known as “features”. Features can be words, phrases, or even patterns of words, which we then use to quantify and analyze the text (compare p. 230 Haim 2023). For the goal of our research seminar, we can follow the three technical perspectives inspired by Haim (2023): 1. Frequency Analysis, 2. Contextual Analysis, and 3. Content Analysis.
Schedule
In our first session, we begin with frequency analyses of our corpus, which involves counting words or phrases to identify the most common elements. This method provides a foundational understanding of the prominent themes or topics. Additionally, we learn to convert embedded text in images and videos into machine-readable format, using OCR, and automated audio transcription.
Next, we will engage in explorative text analysis. This step enhances our understanding of the corpus and lays the groundwork for quantitative content analysis. We plan to utilize tools like GPT (and possibly BERTopic for an in-depth exploration of our documents.
Finally, we move towards more complex methods like classification or coding. These techniques allow us to categorize text into predefined groups or themes, enabling a more nuanced and quantitative understanding of the content. By applying these methods, we can, for example, classify Instagram captions into categories such as ‘promotional’, ‘personal’, ‘informative’, etc., based on their content and context.
Hands-On
We are working with Python and pandas, our data is structured in tables, also known as DataFrames. Each DataFrame (df) consists of rows and columns. We can store and structure data differently using these two dimensions, one concept for storing research data using tables is Tidy Data(Wickham 2014). According to this standard
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
What, in context of social media data, is an obersavtion? Is it a post? I suggest to start by seeing posts as observations, i.e. rows. Thus, we have one table for our corpus, consisting of one row per post with multiple columns for different variables, including an ID, possibly a link, a referrence to the image / video, and one or more text variables for each post. When dealing with Instagram or TikTok posts, we might have three text columns: caption / description, OCR, and transcription. When dealing with stories two: OCR and transcription.
Note
When dealing with more complex data, e.g. Instagram albums that may contain multiple images per post, we will have to reconsider this choice. In this case we might consider each observation to be one image / video, which has variables like OCR and transcription. Keeping the ID column for images and videos, we have a fixed reference to the original post, thus we may re-merge the data later on with the post metadata or combine variables across media for one post.
All data exported from CrowdTangle, 4CAT, and Zeeschuimer-F are saved as CSV files. Throughout the semester, we keep using this file format to save our progress. We work with multiple Jupyter notebooks, generally one notebook per task. This helps to keep a good structure of our projects. Each time we modified the df, we save the CSV file to our Google Drive / Harddrive. In the two examples below we add an OCR and a Transcription column to our DataFrame, for each task we use one notebook. After completing each task, we store the results in a file. While Google Drive provides file versioning to mitigate data loss in certain scenarios, I recommend to save your results to a new file during the experimental phase. This practice ensures data safety until you have fully verified the functionality of your code. Additionally, I recommend naming your files in a YYYY-MM-DD-descriptive-name.csv fashion. When working with colab notebooks I recommend to keep track of notebooks using notes / lists, e.g. using the Dataloom plugin for Obsidian.
The CSV files contain only metadata, the actual media files (images / videos) are saved to different locations. The OCR and Transcription notebooks below contain code to import media files from 4CAT and Zeeschuimer-F. I suggest to save the files to media/videos or media/images. Both notebooks introduce a column image_file or video_file where the relative location of the media files is written to. Creating a new ZIP file using the new folder structure and saving the file to Google Drive allows us to use the media files in future notebooks (e.g. for image classification) without modifying the image_file or video_file columns again.
Note
This page and all referenced notebooks deal with 4CAT and Zeeschuimer-F metadata and media files. Generally all information applies to instaloader as well. Its advisable to use the --filename-patterncommand line parameter to control the filename of the media files. Mapping JSON metadata to actual media objects becomes easier this way. Once all posts / stories have been loaded using instaloader, I recommend to read all JSON files in a loop and create a DataFrame (see Data Collection / Posts / Instaloader for more information and code examples).
Key Take-Aways
We organize our data inspired by TidyData
One row per post
One column per variable
We use one notebook per task
We save our progress to CSV files, either on our harddrive or Google Drive
We keep a reference to media files as a relative reference in our DataFrame
We keep our media files in the structure media/videos, and media/images, which we compress to ZIP and keep on our Google Drive (or central HDD location)
When working with experimental code, keep backups of your data file, do not overwrite the original file!
From Images / Videos to Text
Computational approaches for text analyses are established as part of computational sociales science research (Baden et al. 2022), which we may utilize when dealing with visual and multimodal social media. Instagram posts often contain embedded text, TikTok posts often contain an audio layer, both of which we can transform to computer readable text. For the first, we are going to use OCR, for the second we apply Whisper. The following subchapters demonstrate the application of these technique in order to extract textual content from images and videos. In the thirs subchapter, I demonstrate a simple application of corpus analytics for a first analysis of the social media content based on word frequencies.
OCR
We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.
!pip -q install easyocr
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 29.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 908.3/908.3 kB 57.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.6 MB/s eta 0:00:00
# Imports for OCRimport easyocrreader = easyocr.Reader(['de','en'])
We define a very simple method to receive one string for all text recognized: The readtextmethod returns a list of text areas, in this example we concatenate the string, therefore the order of words is sometimes not correct.
Also, we save the file to Google Drive to save our results.
OpenAI offers Whisper transcriptions as a service, see their documentation. The notebook below takes you step-by-step through using the Whisper model on your own computer / colab.
Extract Audio from Video File
After loading the metadta and media files from the Google Drive, we extract the audio from each video file to prepare the automated transcription.
!pip install -q moviepy
import os# Set audio directory pathaudio_path ="media/audio/"# Check if the directory existsifnot os.path.exists(audio_path):# Create the directory if it does not exist os.makedirs(audio_path)
from moviepy.editor import*for index, row in df.iterrows():if row['video_file'] !="":# Load the video file video = VideoFileClip(row['video_file']) filename = row['video_file'].split('/')[-1]# Extract the audio from the video file audio = video.audioif audio isnotNone: sampling_rate = audio.fps current_suffix = filename.split(".")[-1] new_filename = filename.replace(current_suffix, "mp3")# Save the audio to a file audio.write_audiofile("{}{}".format(audio_path, new_filename))else: new_filename ="No Audio" sampling_rate =-1# Update DataFrame inplace df.at[index, 'audio_file'] = new_filename df.at[index, 'duration'] = video.duration df.at[index, 'sampling_rate'] = sampling_rate df.at[index, 'video_file'] = row['video_file'].split('/')[-1]# Close the video file video.close()
MoviePy - Writing audio in media/audio/CzD93SEIi-E.mp3
MoviePy - Done.
We’ve extracted the audio content of each video file to a mp3 file in the media/audio folder. The files keep the name of the video file. We added new columns to the metadata for audio duration and sampling_rate. In case the video did not include an audio file, smapling_rateis set to -1, which we use to filter the df when transcribing the files.
df[df['video_file'] !=""].head()
id
thread_id
parent_id
body
author
author_fullname
author_avatar_url
timestamp
type
url
...
num_comments
num_media
location_name
location_latlong
location_city
unix_timestamp
video_file
audio_file
duration
sampling_rate
4
CzD93SEIi-E
CzD93SEIi-E
CzD93SEIi-E
Mitzuarbeiten für unser Land, Bayern zu entwic...
markus.soeder
Markus Söder
https://scontent-fra3-1.cdninstagram.com/v/t51...
2023-10-31 12:06:23
video
https://www.instagram.com/p/CzD93SEIi-E
...
227
1
NaN
NaN
NaN
1698753983
CzD93SEIi-E.mp4
CzD93SEIi-E.mp3
67.89
44100.0
1 rows × 24 columns
Let’s update the ZIPed folder to include the audio files.
!zip-r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media
And save the updated metadata file. Change filename when importing stories here!
df.to_csv(four_cat_file_path)
Transcriptions using Whisper
The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
The abstract from the paper is the following:
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
The next code snippet initializes the Whisper model. The transcribe_aduio method is applied to each row of the dataframe where sampling_rate > 0, thus only to those lines with referencees to audio files. Each audio file is transcribed using Whisper, the result, one text string, is saved to the transcript column.
Adjust the language variable according to your needs! The model is also capable of automated translation, e.g. setting language to english when processing German content results in an English translation of the speech. (Additionally, the task variable accepts translate).
import torchfrom transformers import pipeline, WhisperProcessor, WhisperForConditionalGenerationimport librosa# Set device to GPU if available, else use CPUdevice ="cuda:0"if torch.cuda.is_available() else"cpu"# Initialize the Whisper model pipeline for automatic speech recognitionpipe = pipeline("automatic-speech-recognition", model="openai/whisper-large", chunk_length_s=30, device=device,)# Load model and processor for multilingual supportprocessor = WhisperProcessor.from_pretrained("openai/whisper-large")model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")# Function to read, transcribe, and handle longer audio files in different languagesdef transcribe_audio(filename, language='german'):try:# Load and resample audio file audio_path =f"{audio_folder}/{filename}" waveform, original_sample_rate = librosa.load(audio_path, sr=None, mono=True) waveform_resampled = librosa.resample(waveform, orig_sr=original_sample_rate, target_sr=16000)# Get forced decoder IDs for the specified language forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")# Process the audio file in chunks and transcribe transcription =""for i inrange(0, len(waveform_resampled), 16000*30): # 30 seconds chunks chunk = waveform_resampled[i:i +16000*30] input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids) chunk_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] transcription +=" "+ chunk_transcriptionreturn transcription.strip()exceptExceptionas e:print(f"Error processing file {filename}: {e}")return""# Filter the DataFrame (sampling_rates < 0 identify items without audio)filtered_index = df['sampling_rate'] >0# Apply the transcription function to each row in the filtered DataFramedf.loc[filtered_index, 'transcript'] = df.loc[filtered_index, 'audio_file'].apply(transcribe_audio)
df[df['video_file'] !=""].head()
id
thread_id
parent_id
body
author
author_fullname
author_avatar_url
timestamp
type
url
...
num_media
location_name
location_latlong
location_city
unix_timestamp
video_file
audio_file
duration
sampling_rate
transcript
4
CzD93SEIi-E
CzD93SEIi-E
CzD93SEIi-E
Mitzuarbeiten für unser Land, Bayern zu entwic...
markus.soeder
Markus Söder
https://scontent-fra3-1.cdninstagram.com/v/t51...
2023-10-31 12:06:23
video
https://www.instagram.com/p/CzD93SEIi-E
...
1
NaN
NaN
NaN
1698753983
CzD93SEIi-E.mp4
CzD93SEIi-E.mp3
67.89
44100.0
Ich bitte auf den abgelagerten Vortrag der Maa...
1 rows × 25 columns
df.loc[4, 'transcript']
'Ich bitte auf den abgelagerten Vortrag der Maaßen-Söder-Entfühlen ein. Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Ich schwöre Treue der Verfassung des Freistaates Bayern, Gehorsam den Gesetzen und gewissenhafte Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Herr Ministerpräsident, ich darf Ihnen im Namen des ganzen Hauses ganz persönlich die herzlichsten Glückwünsche aussprechen und wünsche Ihnen viel Erfolg und gute Nerven auch bei Ihrer Aufgabe. Herzlichen Dank. Applaus'
Overall, the transcriptions work well. The first sentence above, however, shows that we still can expect misinterpretations.
Among a variety of possibilities, we can, for example, look at the frequencies of the words contained in the corpus or examine the corpus for recurring themes it contains.
First we need to import all the required libraries once again. The Natural Language Toolkit (NLTK) gives us access to a variety of natural language processing functions (e.g. tokenisation, stop word removal, part-of-speech tagging, …).
When analysing word frequencies, we can use stop word lists to ignore words that occur frequently but are not relevant to us. We can easily download such a list. However, this can also be individually adapted to the purpose.
# Retrieve Stopwords from Githubsw_json = requests.get('https://github.com/stopwords-iso/stopwords-de/raw/master/stopwords-de.json')
Now we can tokenise the existing text, remove the stop words or punctuation marks they contain, convert the words to lower case, or use bi-grams in addition to single-word tokens.
We then sum up the occurrences of the individual words and make the results available in a DataFrame.
def word_freq(text, punctuation=False, stop_words =False, lowercasing =False, bigrams =False):if punctuation:# Tokenizing, removing punctuation tokens = RegexpTokenizer(r'\w+').tokenize(text) # https://regexr.com/else:# Tokenizing, w/o removing punctuation# tokens = text.split() tokens = word_tokenize(text)if stop_words:# Removing Stopwords tokens = [w for w in tokens ifnot w.lower() in stop_words]if lowercasing:# Lower-Casing tokens = [w.lower() for w in tokens]if bigrams:# Converting text tokens into bigrams tokens = nltk.bigrams(tokens)# Creating Data Frame freq = nltk.FreqDist(tokens) # display(freq) df = pd.DataFrame.from_dict(freq, orient='index') df.columns = ['Frequency'] df.index.name ='Term'# Here we calculate the total number of tokens in our Frequency List total_tokens =sum(freq.values()) # sum([2,3,4,5,6])# Here we add a new column `Relative` (*100 for percentage) df['Relative'] = (df['Frequency'] / total_tokens) *100return df
from pathlib import Pathimport os#@markdown Do you want bigrams included?bigrams =True#@param {type:"boolean"}#@markdown Should all words get lower cased before counting the occurances?lowercasing =True#@param {type:"boolean"}#@markdown Do you want to exclude stopwords in your result list?stopwords =True#@param {type:"boolean"}#@markdown Do you want to remove punctuation before counting the occurances?punctuation =True#@param {type:"boolean"}
# Load stopwords file if necessaryif stopwords: stopwords = sw_json.json()# Read source file and concat all textstext =' '.join(list(df[text_column]))# Call word_freq() with specified parametersdf_freq = word_freq(text, punctuation = punctuation, stop_words = stopwords, lowercasing = lowercasing, bigrams = bigrams)# Sort results for descending valuesdf_freq = df_freq.sort_values("Relative", ascending =False)display(df_freq[0:10])
Frequency
Relative
Term
(jüdisches, leben)
5
1.259446
(allerheiligen, allerseelen)
4
1.007557
(ilse, aigner)
3
0.755668
(bayerischer, landtag)
3
0.755668
(klare, haltung)
2
0.503778
(wünschen, einfach)
2
0.503778
(vaters, freundschaftliche)
2
0.503778
(tod, vaters)
2
0.503778
(günter, tod)
2
0.503778
(schwiegervater, günter)
2
0.503778
Wordcloud
One way to visualise word frequencies and recurring themes of texts are word clouds. These basically show the most frequently occurring words in the text (similar to the table created earlier), but more frequently occurring words are depicted larger than less frequently occurring words.
First, we have to install the necessary library wordcloud.
!pip install -q wordcloud
The actual implementation of this approach is relatively simple. We need to combine all the texts into a single text, as we did in the previous step with the frequency analysis, and pass it to the imported library.
from wordcloud import WordCloud, STOPWORDSdef generate_wordcloud(text, path): text =' '.join(list(text))# Generate a word cloud image wordcloud = WordCloud(background_color="white",width=1920, height=1080).generate(text)# Dazugehörige Grafik erstellen plt.imshow(wordcloud, interpolation="bilinear") # Auflösung/Interpolation der Grafik plt.axis("off") plt.figtext(0.5, 0.1, wordcloud_subcaption, wrap=True, horizontalalignment='center', fontsize=12) plt.savefig(path, dpi=300) plt.show()
Once again, we have the option of adjusting various parameters. Remember to specify the right file path, file name and column of your text data!
#@markdown Input for additional stopwords; whitespace separatedstopwords_extension_wc =''#@param {type: "string"}#@markdown Subcaption for the wordcloud, leave blank to ignorewordcloud_subcaption ='Markus S\xF6der'#@param {type: "string"}
Now all we have to do is load the stop word file, add our own additions and then trigger the creation of the word cloud using the function we created at the beginning.
The result image is saved in the defined data_path.
import matplotlib.pyplot as pltimport requests# Retrieve Stopwords from Githubr = requests.get('https://github.com/stopwords-iso/stopwords-de/raw/master/stopwords-de.json')stop_words = r.json()# Convert input into liststopwords_extension_wc_list = stopwords_extension_wc.split(' ')stop_words.extend(stopwords_extension_wc_list)# Stopwörter in die WordCloud ladenSTOPWORDS.update(stop_words)generate_wordcloud(df[text_column], 'wordcloud.png')
In summary, this session provides us with the practical skills to use Python, pandas, and Jupyter notebooks for the computational analysis of multimodal social media data. Our adherence to Tidy Data principles and the integration of technologies like OCR and Whisper are integral to extract and analyze textual content from multimedia sources. In the next session we will keep exploring the content through a textual lens. Further, we will use prompting as a technique to classify texts as part of a computational content analysis.
Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A C G van der Velden. 2022. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.”Communication Methods and Measures 16 (1): 1–18. https://doi.org/10.1080/19312458.2021.2015574.
Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. 2018. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning. “O’Reilly Media, Inc.”
Haim, Mario. 2023. Computational Communication Science: Eine Einführung. Springer Fachmedien Wiesbaden.