!pip install -q moviepyExtract Audio from Video File
After loading the metadta and media files from the Google Drive, we extract the audio from each video file to prepare the automated transcription.
import os
# Set audio directory path
audio_path = "media/audio/"
# Check if the directory exists
if not os.path.exists(audio_path):
# Create the directory if it does not exist
os.makedirs(audio_path)from moviepy.editor import *
for index, row in df.iterrows():
if row['video_file'] != "":
# Load the video file
video = VideoFileClip(row['video_file'])
filename = row['video_file'].split('/')[-1]
# Extract the audio from the video file
audio = video.audio
if audio is not None:
sampling_rate = audio.fps
current_suffix = filename.split(".")[-1]
new_filename = filename.replace(current_suffix, "mp3")
# Save the audio to a file
audio.write_audiofile("{}{}".format(audio_path, new_filename))
else:
new_filename = "No Audio"
sampling_rate = -1
# Update DataFrame inplace
df.at[index, 'audio_file'] = new_filename
df.at[index, 'duration'] = video.duration
df.at[index, 'sampling_rate'] = sampling_rate
df.at[index, 'video_file'] = row['video_file'].split('/')[-1]
# Close the video file
video.close()MoviePy - Writing audio in media/audio/CzD93SEIi-E.mp3
MoviePy - Done.
We’ve extracted the audio content of each video file to a mp3 file in the media/audio folder. The files keep the name of the video file. We added new columns to the metadata for audio duration and sampling_rate. In case the video did not include an audio file, smapling_rateis set to -1, which we use to filter the df when transcribing the files.
df[df['video_file'] != ""].head()| id | thread_id | parent_id | body | author | author_fullname | author_avatar_url | timestamp | type | url | ... | num_comments | num_media | location_name | location_latlong | location_city | unix_timestamp | video_file | audio_file | duration | sampling_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | CzD93SEIi-E | CzD93SEIi-E | CzD93SEIi-E | Mitzuarbeiten für unser Land, Bayern zu entwic... | markus.soeder | Markus Söder | https://scontent-fra3-1.cdninstagram.com/v/t51... | 2023-10-31 12:06:23 | video | https://www.instagram.com/p/CzD93SEIi-E | ... | 227 | 1 | NaN | NaN | NaN | 1698753983 | CzD93SEIi-E.mp4 | CzD93SEIi-E.mp3 | 67.89 | 44100.0 |
1 rows × 24 columns
Let’s update the ZIPed folder to include the audio files.
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip mediaupdating: media/ (stored 0%)
updating: media/videos/ (stored 0%)
updating: media/videos/CzD93SEIi-E.mp4 (deflated 0%)
adding: media/audio/ (stored 0%)
adding: media/audio/CzD93SEIi-E.mp3 (deflated 1%)
And save the updated metadata file. Change filename when importing stories here!
df.to_csv(four_cat_file_path)Transcriptions using Whisper
The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
The abstract from the paper is the following:
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
– https://huggingface.co/docs/transformers/model_doc/whisper
!pip install -q transformersThe next code snippet initializes the Whisper model. The transcribe_aduio method is applied to each row of the dataframe where sampling_rate > 0, thus only to those lines with referencees to audio files. Each audio file is transcribed using Whisper, the result, one text string, is saved to the transcript column.
Adjust the language variable according to your needs! The model is also capable of automated translation, e.g. setting language to english when processing German content results in an English translation of the speech. (Additionally, the task variable accepts translate).
import torch
from transformers import pipeline, WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Set device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Initialize the Whisper model pipeline for automatic speech recognition
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large",
chunk_length_s=30,
device=device,
)
# Load model and processor for multilingual support
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
# Function to read, transcribe, and handle longer audio files in different languages
def transcribe_audio(filename, language='german'):
try:
# Load and resample audio file
audio_path = f"{audio_folder}/{filename}"
waveform, original_sample_rate = librosa.load(audio_path, sr=None, mono=True)
waveform_resampled = librosa.resample(waveform, orig_sr=original_sample_rate, target_sr=16000)
# Get forced decoder IDs for the specified language
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")
# Process the audio file in chunks and transcribe
transcription = ""
for i in range(0, len(waveform_resampled), 16000 * 30): # 30 seconds chunks
chunk = waveform_resampled[i:i + 16000 * 30]
input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
chunk_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
transcription += " " + chunk_transcription
return transcription.strip()
except Exception as e:
print(f"Error processing file {filename}: {e}")
return ""
# Filter the DataFrame (sampling_rates < 0 identify items without audio)
filtered_index = df['sampling_rate'] > 0
# Apply the transcription function to each row in the filtered DataFrame
df.loc[filtered_index, 'transcript'] = df.loc[filtered_index, 'audio_file'].apply(transcribe_audio)Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
df[df['video_file'] != ""].head()| id | thread_id | parent_id | body | author | author_fullname | author_avatar_url | timestamp | type | url | ... | num_media | location_name | location_latlong | location_city | unix_timestamp | video_file | audio_file | duration | sampling_rate | transcript | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | CzD93SEIi-E | CzD93SEIi-E | CzD93SEIi-E | Mitzuarbeiten für unser Land, Bayern zu entwic... | markus.soeder | Markus Söder | https://scontent-fra3-1.cdninstagram.com/v/t51... | 2023-10-31 12:06:23 | video | https://www.instagram.com/p/CzD93SEIi-E | ... | 1 | NaN | NaN | NaN | 1698753983 | CzD93SEIi-E.mp4 | CzD93SEIi-E.mp3 | 67.89 | 44100.0 | Ich bitte auf den abgelagerten Vortrag der Maa... |
1 rows × 25 columns
df.loc[4, 'transcript']'Ich bitte auf den abgelagerten Vortrag der Maaßen-Söder-Entfühlen ein. Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Ich schwöre Treue der Verfassung des Freistaates Bayern, Gehorsam den Gesetzen und gewissenhafte Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Herr Ministerpräsident, ich darf Ihnen im Namen des ganzen Hauses ganz persönlich die herzlichsten Glückwünsche aussprechen und wünsche Ihnen viel Erfolg und gute Nerven auch bei Ihrer Aufgabe. Herzlichen Dank. Applaus'
Overall, the transcriptions work well. The first sentence above, however, shows that we still can expect misinterpretations.