Back to Article
Transcription using Whisper
Download Notebook

Extract Audio from Video File

After loading the metadta and media files from the Google Drive, we extract the audio from each video file to prepare the automated transcription.

In [8]:
!pip install -q moviepy
In [9]:
import os

# Set audio directory path
audio_path = "media/audio/"

# Check if the directory exists
if not os.path.exists(audio_path):
    # Create the directory if it does not exist
    os.makedirs(audio_path)
In [10]:
from moviepy.editor import *

for index, row in df.iterrows():
    if row['video_file'] != "":
        # Load the video file
        video = VideoFileClip(row['video_file'])
        filename = row['video_file'].split('/')[-1]

        # Extract the audio from the video file
        audio = video.audio

        if audio is not None:
            sampling_rate = audio.fps
            current_suffix = filename.split(".")[-1]
            new_filename = filename.replace(current_suffix, "mp3")

            # Save the audio to a file
            audio.write_audiofile("{}{}".format(audio_path, new_filename))
        else:
            new_filename = "No Audio"
            sampling_rate = -1

        # Update DataFrame inplace
        df.at[index, 'audio_file'] = new_filename
        df.at[index, 'duration'] = video.duration
        df.at[index, 'sampling_rate'] = sampling_rate

        df.at[index, 'video_file'] = row['video_file'].split('/')[-1]

        # Close the video file
        video.close()
MoviePy - Writing audio in media/audio/CzD93SEIi-E.mp3
                                                                      
MoviePy - Done.

We’ve extracted the audio content of each video file to a mp3 file in the media/audio folder. The files keep the name of the video file. We added new columns to the metadata for audio duration and sampling_rate. In case the video did not include an audio file, smapling_rateis set to -1, which we use to filter the df when transcribing the files.

In [11]:
df[df['video_file'] != ""].head()
id thread_id parent_id body author author_fullname author_avatar_url timestamp type url ... num_comments num_media location_name location_latlong location_city unix_timestamp video_file audio_file duration sampling_rate
4 CzD93SEIi-E CzD93SEIi-E CzD93SEIi-E Mitzuarbeiten für unser Land, Bayern zu entwic... markus.soeder Markus Söder https://scontent-fra3-1.cdninstagram.com/v/t51... 2023-10-31 12:06:23 video https://www.instagram.com/p/CzD93SEIi-E ... 227 1 NaN NaN NaN 1698753983 CzD93SEIi-E.mp4 CzD93SEIi-E.mp3 67.89 44100.0

1 rows × 24 columns

Let’s update the ZIPed folder to include the audio files.

In [12]:
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media
updating: media/ (stored 0%)
updating: media/videos/ (stored 0%)
updating: media/videos/CzD93SEIi-E.mp4 (deflated 0%)
  adding: media/audio/ (stored 0%)
  adding: media/audio/CzD93SEIi-E.mp3 (deflated 1%)

And save the updated metadata file. Change filename when importing stories here!

In [14]:
df.to_csv(four_cat_file_path)

Transcriptions using Whisper

The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

The abstract from the paper is the following:

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

– https://huggingface.co/docs/transformers/model_doc/whisper

In [15]:
!pip install -q transformers

The next code snippet initializes the Whisper model. The transcribe_aduio method is applied to each row of the dataframe where sampling_rate > 0, thus only to those lines with referencees to audio files. Each audio file is transcribed using Whisper, the result, one text string, is saved to the transcript column.

Adjust the language variable according to your needs! The model is also capable of automated translation, e.g. setting language to english when processing German content results in an English translation of the speech. (Additionally, the task variable accepts translate).

In [19]:
import torch
from transformers import pipeline, WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Set device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Initialize the Whisper model pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large",
    chunk_length_s=30,
    device=device,
)

# Load model and processor for multilingual support
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")

# Function to read, transcribe, and handle longer audio files in different languages
def transcribe_audio(filename, language='german'):
    try:
        # Load and resample audio file
        audio_path = f"{audio_folder}/{filename}"
        waveform, original_sample_rate = librosa.load(audio_path, sr=None, mono=True)
        waveform_resampled = librosa.resample(waveform, orig_sr=original_sample_rate, target_sr=16000)

        # Get forced decoder IDs for the specified language
        forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")

        # Process the audio file in chunks and transcribe
        transcription = ""
        for i in range(0, len(waveform_resampled), 16000 * 30):  # 30 seconds chunks
            chunk = waveform_resampled[i:i + 16000 * 30]
            input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features
            predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
            chunk_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
            transcription += " " + chunk_transcription

        return transcription.strip()
    except Exception as e:
        print(f"Error processing file {filename}: {e}")
        return ""


# Filter the DataFrame (sampling_rates < 0 identify items without audio)
filtered_index = df['sampling_rate'] > 0

# Apply the transcription function to each row in the filtered DataFrame
df.loc[filtered_index, 'transcript'] = df.loc[filtered_index, 'audio_file'].apply(transcribe_audio)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
In [20]:
df[df['video_file'] != ""].head()
id thread_id parent_id body author author_fullname author_avatar_url timestamp type url ... num_media location_name location_latlong location_city unix_timestamp video_file audio_file duration sampling_rate transcript
4 CzD93SEIi-E CzD93SEIi-E CzD93SEIi-E Mitzuarbeiten für unser Land, Bayern zu entwic... markus.soeder Markus Söder https://scontent-fra3-1.cdninstagram.com/v/t51... 2023-10-31 12:06:23 video https://www.instagram.com/p/CzD93SEIi-E ... 1 NaN NaN NaN 1698753983 CzD93SEIi-E.mp4 CzD93SEIi-E.mp3 67.89 44100.0 Ich bitte auf den abgelagerten Vortrag der Maa...

1 rows × 25 columns

In [21]:
df.loc[4, 'transcript']
'Ich bitte auf den abgelagerten Vortrag der Maaßen-Söder-Entfühlen ein.  Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Ich schwöre Treue der Verfassung des Freistaates Bayern, Gehorsam den Gesetzen und gewissenhafte Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Herr Ministerpräsident, ich darf Ihnen im Namen des ganzen Hauses ganz persönlich die herzlichsten Glückwünsche aussprechen und wünsche Ihnen viel Erfolg und gute Nerven auch bei Ihrer Aufgabe. Herzlichen Dank.  Applaus'

Overall, the transcriptions work well. The first sentence above, however, shows that we still can expect misinterpretations.