!pip install -q moviepy
Extract Audio from Video File
After loading the metadta and media files from the Google Drive, we extract the audio from each video file to prepare the automated transcription.
In [8]:
In [9]:
import os
# Set audio directory path
= "media/audio/"
audio_path
# Check if the directory exists
if not os.path.exists(audio_path):
# Create the directory if it does not exist
os.makedirs(audio_path)
In [10]:
from moviepy.editor import *
for index, row in df.iterrows():
if row['video_file'] != "":
# Load the video file
= VideoFileClip(row['video_file'])
video = row['video_file'].split('/')[-1]
filename
# Extract the audio from the video file
= video.audio
audio
if audio is not None:
= audio.fps
sampling_rate = filename.split(".")[-1]
current_suffix = filename.replace(current_suffix, "mp3")
new_filename
# Save the audio to a file
"{}{}".format(audio_path, new_filename))
audio.write_audiofile(else:
= "No Audio"
new_filename = -1
sampling_rate
# Update DataFrame inplace
'audio_file'] = new_filename
df.at[index, 'duration'] = video.duration
df.at[index, 'sampling_rate'] = sampling_rate
df.at[index,
'video_file'] = row['video_file'].split('/')[-1]
df.at[index,
# Close the video file
video.close()
MoviePy - Writing audio in media/audio/CzD93SEIi-E.mp3
MoviePy - Done.
We’ve extracted the audio content of each video file to a mp3
file in the media/audio
folder. The files keep the name of the video file. We added new columns to the metadata for audio duration and sampling_rate. In case the video did not include an audio file, smapling_rate
is set to -1
, which we use to filter the df
when transcribing the files.
In [11]:
'video_file'] != ""].head() df[df[
id | thread_id | parent_id | body | author | author_fullname | author_avatar_url | timestamp | type | url | ... | num_comments | num_media | location_name | location_latlong | location_city | unix_timestamp | video_file | audio_file | duration | sampling_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | CzD93SEIi-E | CzD93SEIi-E | CzD93SEIi-E | Mitzuarbeiten für unser Land, Bayern zu entwic... | markus.soeder | Markus Söder | https://scontent-fra3-1.cdninstagram.com/v/t51... | 2023-10-31 12:06:23 | video | https://www.instagram.com/p/CzD93SEIi-E | ... | 227 | 1 | NaN | NaN | NaN | 1698753983 | CzD93SEIi-E.mp4 | CzD93SEIi-E.mp3 | 67.89 | 44100.0 |
1 rows × 24 columns
Let’s update the ZIP
ed folder to include the audio files.
In [12]:
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media
updating: media/ (stored 0%)
updating: media/videos/ (stored 0%)
updating: media/videos/CzD93SEIi-E.mp4 (deflated 0%)
adding: media/audio/ (stored 0%)
adding: media/audio/CzD93SEIi-E.mp3 (deflated 1%)
And save the updated metadata file. Change filename when importing stories here!
In [14]:
df.to_csv(four_cat_file_path)
Transcriptions using Whisper
The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
The abstract from the paper is the following:
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
– https://huggingface.co/docs/transformers/model_doc/whisper
In [15]:
!pip install -q transformers
The next code snippet initializes the Whisper model. The transcribe_aduio
method is applied to each row of the dataframe where sampling_rate
> 0
, thus only to those lines with referencees to audio files. Each audio file is transcribed using Whisper, the result, one text string, is saved to the transcript
column.
Adjust the language variable according to your needs! The model is also capable of automated translation, e.g. setting language
to english when processing German content results in an English translation of the speech. (Additionally, the task
variable accepts translate
).
In [19]:
import torch
from transformers import pipeline, WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Set device to GPU if available, else use CPU
= "cuda:0" if torch.cuda.is_available() else "cpu"
device
# Initialize the Whisper model pipeline for automatic speech recognition
= pipeline(
pipe "automatic-speech-recognition",
="openai/whisper-large",
model=30,
chunk_length_s=device,
device
)
# Load model and processor for multilingual support
= WhisperProcessor.from_pretrained("openai/whisper-large")
processor = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
model
# Function to read, transcribe, and handle longer audio files in different languages
def transcribe_audio(filename, language='german'):
try:
# Load and resample audio file
= f"{audio_folder}/{filename}"
audio_path = librosa.load(audio_path, sr=None, mono=True)
waveform, original_sample_rate = librosa.resample(waveform, orig_sr=original_sample_rate, target_sr=16000)
waveform_resampled
# Get forced decoder IDs for the specified language
= processor.get_decoder_prompt_ids(language=language, task="transcribe")
forced_decoder_ids
# Process the audio file in chunks and transcribe
= ""
transcription for i in range(0, len(waveform_resampled), 16000 * 30): # 30 seconds chunks
= waveform_resampled[i:i + 16000 * 30]
chunk = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features
input_features = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
predicted_ids = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
chunk_transcription += " " + chunk_transcription
transcription
return transcription.strip()
except Exception as e:
print(f"Error processing file {filename}: {e}")
return ""
# Filter the DataFrame (sampling_rates < 0 identify items without audio)
= df['sampling_rate'] > 0
filtered_index
# Apply the transcription function to each row in the filtered DataFrame
'transcript'] = df.loc[filtered_index, 'audio_file'].apply(transcribe_audio) df.loc[filtered_index,
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
In [20]:
'video_file'] != ""].head() df[df[
id | thread_id | parent_id | body | author | author_fullname | author_avatar_url | timestamp | type | url | ... | num_media | location_name | location_latlong | location_city | unix_timestamp | video_file | audio_file | duration | sampling_rate | transcript | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | CzD93SEIi-E | CzD93SEIi-E | CzD93SEIi-E | Mitzuarbeiten für unser Land, Bayern zu entwic... | markus.soeder | Markus Söder | https://scontent-fra3-1.cdninstagram.com/v/t51... | 2023-10-31 12:06:23 | video | https://www.instagram.com/p/CzD93SEIi-E | ... | 1 | NaN | NaN | NaN | 1698753983 | CzD93SEIi-E.mp4 | CzD93SEIi-E.mp3 | 67.89 | 44100.0 | Ich bitte auf den abgelagerten Vortrag der Maa... |
1 rows × 25 columns
In [21]:
4, 'transcript'] df.loc[
'Ich bitte auf den abgelagerten Vortrag der Maaßen-Söder-Entfühlen ein. Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Ich schwöre Treue der Verfassung des Freistaates Bayern, Gehorsam den Gesetzen und gewissenhafte Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Herr Ministerpräsident, ich darf Ihnen im Namen des ganzen Hauses ganz persönlich die herzlichsten Glückwünsche aussprechen und wünsche Ihnen viel Erfolg und gute Nerven auch bei Ihrer Aufgabe. Herzlichen Dank. Applaus'
Overall, the transcriptions work well. The first sentence above, however, shows that we still can expect misinterpretations.