= {
classification_dict "Collages": [
"A screenshot with multiple visual elements such as text, graphics, and images combined.",
],"Campaign Material": [
"An image primarily showcasing election-related flyers, brochures, or handouts.",
"A distinct promotional poster for a political event or campaign.",
"Visible printed material urging people to vote or join a political cause."
],"Political Events": [
"An image distinctly capturing the essence of a political campaign event.",
"A location set for a political event, possibly without a crowd.",
"A large assembly of supporters or participants at an open-air political rally.",
"Clear visuals of a venue set for a significant political gathering or convention.",
"Focused visuals of attendees or participants of a political rally or event.",
"Inside ambiance of a political convention or major political conference.",
"Prominent figures or speakers on stage addressing a political audience.",
"A serene image primarily focused on landscapes, travel.",
"Food, beverages, or generic shots."
],"Individual Contact": [
"A politician genuinely engaging or interacting with individuals or small groups.",
"A close-up or selfie, primarily showcasing an individual, possibly with political affiliations.",
"An informal or candid shot with emphasis on individual engagement, perhaps in a political setting."
],"Interviews & Media": [
"An indoor setting, well-lit, designed for professional media interviews or broadcasts.",
"Clear visuals of an interviewee in a controlled studio environment.",
"Microphone or recording equipment predominantly in front of a speaker indoors.",
"Behind-the-scenes ambiance of a media setup or broadcast preparation.",
"Visuals from a TV or media broadcast, including distinct channel or media branding.",
"Significant media branding or logos evident, possibly during an interview or panel discussion.",
"Structured indoor setting of a press conference or media event with multiple participants."
],"Social Media Moderation": [
"Face-centric visual with the individual addressing or connecting with the camera.",
"Emphasis on facial features, minimal background distractions, typical of online profiles.",
"Portrait-style close-up of a face, without discernible logos, graphics, or overlays."
], }
Image classification using CLIP works by comparing a string of text with an image. If we compare multiple text-strings with the same image, we can determine the phrase with the highest similarity score and infer the classification. To make the classification work for my scenario, I created a dictionary, where each image type is mapped to multiple sentences describing how an image in this class would look like.
My implementation is inspired by this medium story.
In [2]:
Using this dictionary, we can now compare the images to the strings using CLIP in a loop.
In [4]:
from tqdm import tqdm
import numpy as np
import pandas as pd
import torch
from PIL import Image
# Assuming preprocess, clip model, and device are already initialized
def classify_images_with_clip(image_files, classification_dict, column_name, BATCH_SIZE=500):
= flatten_classification_dict(classification_dict)
labels_map, flat_labels = clip.tokenize(flat_labels).to(device)
text
= []
results for batch_start in tqdm(range(0, len(image_files), BATCH_SIZE)):
= batch_start + BATCH_SIZE
batch_end = image_files[batch_start:batch_end]
batch_files = preprocess_images(batch_files)
images if not images:
continue
= torch.tensor(np.stack(images)).to(device)
image_input
= model_inference(image_input, text)
logits_per_image
update_results(logits_per_image, batch_files, flat_labels, labels_map, results, column_name)
return pd.DataFrame(results)
def flatten_classification_dict(classification_dict):
= {}
labels_map = []
flat_labels for category, phrases in classification_dict.items():
for phrase in phrases:
flat_labels.append(phrase)= category
labels_map[phrase] return labels_map, flat_labels
def preprocess_images(image_files):
= []
images for img_file in image_files:
try:
= preprocess(Image.open(img_file))
image
images.append(image)except IOError:
print(f"Error loading image: {img_file}")
return images
def model_inference(image_input, text):
with torch.no_grad():
= model(image_input, text)
logits_per_image, _ return logits_per_image.softmax(dim=-1).cpu().numpy() * 100
def update_results(logits_per_image, batch_files, flat_labels, labels_map, results, column_name):
= np.argsort(logits_per_image, axis=1)[:, -2:]
max_indices for idx, (file, top_indices) in enumerate(zip(batch_files, max_indices)):
= {"Image": file}
result for rank, label_idx in enumerate(top_indices[::-1], 1):
= flat_labels[label_idx]
label = labels_map[label]
category = logits_per_image[idx, label_idx].round(2)
prob
result.update({f"{column_name}_{rank}": category,
f"{column_name}_label_{rank}": label,
f"{column_name}_prob_{rank}": prob
})
results.append(result)
def update_results(logits_per_image, batch_files, flat_labels, labels_map, results, column_name):
= np.argmax(logits_per_image, axis=1)
max_indices for idx, (file, top_index) in enumerate(zip(batch_files, max_indices)):
= flat_labels[top_index]
label = labels_map[label]
category = logits_per_image[idx, top_index].round(2) # Fixed probability extraction
prob
= {
result "Image": file,
f"{column_name}": category,
f"{column_name} Label": label,
f"{column_name} Probability": prob
}
results.append(result)
In [5]:
import os
= df['Image'].unique()
image_files
# Perform the classification and get the results as a DataFrame
= classify_images_with_clip(image_files, classification_dict, 'Classification') classified_df
The classified_df
contains the classification results. This implementation just saves the highest probability labels and classifications.
In [7]:
classified_df.head()
Image | Classification | Classification Label | Classification Probability | |
---|---|---|---|---|
0 | /content/media/images/afd.bund/212537388606051... | Political Events | Focused visuals of attendees or participants o... | 26.78125 |
1 | /content/media/images/afd.bund/212537470102207... | Interviews & Media | Visuals from a TV or media broadcast, includin... | 71.31250 |
2 | /content/media/images/afd.bund/249085122621717... | Social Media Moderation | Emphasis on facial features, minimal backgroun... | 29.21875 |
3 | /content/media/images/afd.bund/260084001188499... | Interviews & Media | Clear visuals of an interviewee in a controlle... | 79.62500 |
4 | /content/media/images/afd.bund/260085279483160... | Interviews & Media | Clear visuals of an interviewee in a controlle... | 48.00000 |
Qualitative Evaluation Running the last cells in the notebook creates a visual overview of the classification results. n images are sampled and displayed per group. The qualitative evaluation does not replace proper external validation!. Additionally, the overview is saved to {current_date}-CLIP-Classification.html
. Download the file and open it in your browser for a better layout. The second validation cell creates a simple interface displaying on image and the classification result. Click on the “Right” or “Wrong” button to browse through multiple images and get a rough feeling of the classification qualities.