Image Classification


Michael Achmann-Denkler


January 22, 2024

As outlined in the Images as Data chapter, multiple approaches towards image classification exist. In the current version of this chapter I would like to concentrate on zero-shot classifications using CLIP, a neural network desigend by OpenAI, and multimodal GPT-4. The first notebook, based on CLIP, experiments with image-type classification for the analysis of political communication. Image types have been used for the analysis of Alexander Van der Bellen’s storytelling in his Instagram campaign (Liebhart and Bernhardt 2017), and for the analysis of the 2017 German election of Instagram (Haim and Jungblut 2021). Through my experiments I have found some shortcomings of the image types in combination with computational analyses, as they require the annotator to have a deeper understanding beyond the knowledge of presence or absence of singular items. Additionally, the image types from the literature have overlaps, which is not exactly compatible with theory behind image type analysis (Grittmann and Ammann 2011), nor visual content analysis (Rose 2016), which suggest categories to be mutually exclusive. So, in short, expect the classification results in this case to be mixed!

For the second experiment, the classification using multimodal GPT-4, I’m showcasing some items from the visual frame analysis (Grabe and Bucy 2009), which has been applied in several social media and Instagram studies. Our experiments are based on the adaption by Gordillo-Rodrı́guez and Bellido-Pérez (2023), who created a comprehensive overview of items and their theoretical grounding. For GPT-4 we will classify multiple items at the same time. The notebook keeps track of the classification cost, though not as sophisticated as for the textual classification (once more: Work in Progress!).

A third approach towards the classification of visual material is the Ensemble approach, where we combine e.g. multiple model outputs for a final classification. I have experimented with the combination of automatically generated image captions, the output object detection, and the textual content of images. Combining each of these variables per image into one final classification using GPT, I have obtained promising results in first (informal) experiments. A proper validation and comparison to other classification approaches is yet needed.

CLIP for Image Type Classification

Image classification using CLIP works by comparing a string of text with an image. If we compare multiple text-strings with the same image, we can determine the phrase with the highest similarity score and infer the classification. To make the classification work for my scenario, I created a dictionary, where each image type is mapped to multiple sentences describing how an image in this class would look like.

An overview of the CLIP-Classification process, starting with the creation of phrases describing the target content.

My implementation is inspired by this medium story.

classification_dict = {
    "Collages": [
        "A screenshot with multiple visual elements such as text, graphics, and images combined.",
    "Campaign Material": [
        "An image primarily showcasing election-related flyers, brochures, or handouts.",
        "A distinct promotional poster for a political event or campaign.",
        "Visible printed material urging people to vote or join a political cause."
    "Political Events": [
        "An image distinctly capturing the essence of a political campaign event.",
        "A location set for a political event, possibly without a crowd.",
        "A large assembly of supporters or participants at an open-air political rally.",
        "Clear visuals of a venue set for a significant political gathering or convention.",
        "Focused visuals of attendees or participants of a political rally or event.",
        "Inside ambiance of a political convention or major political conference.",
        "Prominent figures or speakers on stage addressing a political audience.",
        "A serene image primarily focused on landscapes, travel.",
        "Food, beverages, or generic shots."
    "Individual Contact": [
        "A politician genuinely engaging or interacting with individuals or small groups.",
        "A close-up or selfie, primarily showcasing an individual, possibly with political affiliations.",
        "An informal or candid shot with emphasis on individual engagement, perhaps in a political setting."
    "Interviews & Media": [
        "An indoor setting, well-lit, designed for professional media interviews or broadcasts.",
        "Clear visuals of an interviewee in a controlled studio environment.",
        "Microphone or recording equipment predominantly in front of a speaker indoors.",
        "Behind-the-scenes ambiance of a media setup or broadcast preparation.",
        "Visuals from a TV or media broadcast, including distinct channel or media branding.",
        "Significant media branding or logos evident, possibly during an interview or panel discussion.",
        "Structured indoor setting of a press conference or media event with multiple participants."
    "Social Media Moderation": [
        "Face-centric visual with the individual addressing or connecting with the camera.",
        "Emphasis on facial features, minimal background distractions, typical of online profiles.",
        "Portrait-style close-up of a face, without discernible logos, graphics, or overlays."

Using this dictionary, we can now compare the images to the strings using CLIP in a loop.

from tqdm import tqdm
import numpy as np
import pandas as pd
import torch
from PIL import Image

# Assuming preprocess, clip model, and device are already initialized

def classify_images_with_clip(image_files, classification_dict, column_name, BATCH_SIZE=500):
    labels_map, flat_labels = flatten_classification_dict(classification_dict)
    text = clip.tokenize(flat_labels).to(device)

    results = []
    for batch_start in tqdm(range(0, len(image_files), BATCH_SIZE)):
        batch_end = batch_start + BATCH_SIZE
        batch_files = image_files[batch_start:batch_end]
        images = preprocess_images(batch_files)
        if not images:
        image_input = torch.tensor(np.stack(images)).to(device)

        logits_per_image = model_inference(image_input, text)
        update_results(logits_per_image, batch_files, flat_labels, labels_map, results, column_name)

    return pd.DataFrame(results)

def flatten_classification_dict(classification_dict):
    labels_map = {}
    flat_labels = []
    for category, phrases in classification_dict.items():
        for phrase in phrases:
            labels_map[phrase] = category
    return labels_map, flat_labels

def preprocess_images(image_files):
    images = []
    for img_file in image_files:
            image = preprocess(
        except IOError:
            print(f"Error loading image: {img_file}")
    return images

def model_inference(image_input, text):
    with torch.no_grad():
        logits_per_image, _ = model(image_input, text)
        return logits_per_image.softmax(dim=-1).cpu().numpy() * 100

def update_results(logits_per_image, batch_files, flat_labels, labels_map, results, column_name):
    max_indices = np.argsort(logits_per_image, axis=1)[:, -2:]
    for idx, (file, top_indices) in enumerate(zip(batch_files, max_indices)):
        result = {"Image": file}
        for rank, label_idx in enumerate(top_indices[::-1], 1):
            label = flat_labels[label_idx]
            category = labels_map[label]
            prob = logits_per_image[idx, label_idx].round(2)
                f"{column_name}_{rank}": category,
                f"{column_name}_label_{rank}": label,
                f"{column_name}_prob_{rank}": prob

def update_results(logits_per_image, batch_files, flat_labels, labels_map, results, column_name):
    max_indices = np.argmax(logits_per_image, axis=1)
    for idx, (file, top_index) in enumerate(zip(batch_files, max_indices)):
        label = flat_labels[top_index]
        category = labels_map[label]
        prob = logits_per_image[idx, top_index].round(2)  # Fixed probability extraction

        result = {
            "Image": file,
            f"{column_name}": category,
            f"{column_name} Label": label,
            f"{column_name} Probability": prob

import os

image_files = df['Image'].unique()

# Perform the classification and get the results as a DataFrame
classified_df = classify_images_with_clip(image_files, classification_dict, 'Classification')

The classified_df contains the classification results. This implementation just saves the highest probability labels and classifications.

Image Classification Classification Label Classification Probability
0 /content/media/images/afd.bund/212537388606051... Political Events Focused visuals of attendees or participants o... 26.78125
1 /content/media/images/afd.bund/212537470102207... Interviews & Media Visuals from a TV or media broadcast, includin... 71.31250
2 /content/media/images/afd.bund/249085122621717... Social Media Moderation Emphasis on facial features, minimal backgroun... 29.21875
3 /content/media/images/afd.bund/260084001188499... Interviews & Media Clear visuals of an interviewee in a controlle... 79.62500
4 /content/media/images/afd.bund/260085279483160... Interviews & Media Clear visuals of an interviewee in a controlle... 48.00000

Qualitative Evaluation Running the last cells in the notebook creates a visual overview of the classification results. n images are sampled and displayed per group. The qualitative evaluation does not replace proper external validation!. Additionally, the overview is saved to {current_date}-CLIP-Classification.html. Download the file and open it in your browser for a better layout. The second validation cell creates a simple interface displaying on image and the classification result. Click on the “Right” or “Wrong” button to browse through multiple images and get a rough feeling of the classification qualities.

Source: CLIP Classification

Multimodal GPT-4 for Visual Frame Classification

In this example we use GPT-4 to classify multiple variables at the same time. We’re using items based on Visual Frames (Grabe and Bucy 2009), as adopted by Gordillo-Rodrı́guez and Bellido-Pérez (2023). For an actual study based on visual frames we need to add more items!

I found the following sentence to be essential for the classification to work: We’re not interested in the identity of any person in the image, please anonymize any personal information and concentrate on the objective image analysis framework outlined below.

prompt = """
You are an AI assistant with years of training in image analysis for political communication. Use the following annotation manual to code the provided image. We're **not** interested in the identity of any person in the image, please anonymize any personal information and concentrate on the objective image analysis framework outlined below.
**Objective**: Perform a content analysis of a given Instagram posts by political candidates during the 2021 German election campaign, identifying the presentation of the self and visual framing as per Goffman (1956) and Grabe and Bucy (2009).
### Annotation Guide
#### 1. **Performance** (Categorical)
   - CampaignOrPartyEvent, PrivateEvent, SolidarityEvent, ProtestEvent, MediaEvent, CampaignMaterial, Other (String)
#### 2. **Environment **(Categorical)
   - **Environment**: Indoors / Outdoors / NotApplicable / Other
   - **Location**: EventHall / StreetOrPlaza / WorkPlace / Parliament / TVStudio / PrivateTransport / PublicTransport / Industry / Commerce / Nature / Home / NotApplicable / Other (String)
#### 3. **Dress Style **(Categorical)
   - **DressStyle**: Formal (Identify formal and professional attire, like suits and ties or dresses) / Casual (Look for informal attire, like sportswear, T-shirts, comfortable clothes.)
   - **RolledUpSleeves**: Tag shirts/blouses with rolled-up sleeves (Boolean).
**Analytical Process**
  - For each Instagram post, identify and record occurrences of the items listed in the provided table.
  - Categorize findings under the appropriate theory and visual frame.
**Reporting**: Summarize the findings in a structured JSON format based on the variable names in the Annotation guide. Respond only in JSON, respect the data types indicated in the manual.

The following methods help with converting the LLM results back into a pandas dataframe.

import pandas as pd
import json
import re
from import json_normalize

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
            out[name[:-1]] = x

    return out

def parse_response(response, identifier):
        if isinstance(response, str):
            response = json.loads(response)
        response = flatten_json(response)
        response['ID'] = identifier
        return response
    except json.JSONDecodeError:
        match ='```json\n([\s\S]+)\n```', response)
        if match:
                json_data = json.loads(
                json_data = flatten_json(json_data)
                json_data['image_path'] = identifier
                return json_data
            except json.JSONDecodeError:
        return {'image_path': identifier, 'error': response}

The following cell contains the actual classification loop. As with text classification, we send one image at a time with the same prompt all over again. For our in-class tutorial I added two filters: 1. We sample the data and just classify a part of the dataframe. 2. I filter for one particular account.

Remove these filters for real world applications!

import base64
from tqdm.notebook import tqdm
import openai
from google.colab import userdata
import pandas as pd
import backoff

# Retrieving OpenAI API Key
api_key = userdata.get('openai-lehrstuhl-api')

# Initialize OpenAI client
client = openai.OpenAI(api_key=api_key)

# Cost per token
prompt_cost = 0.01 / 1000  # Cost per prompt token
completion_cost = 0.03 / 1000  # Cost per completion token

# Initialize total cost
total_cost = 0.0

def encode_image(image_path):
    Encodes an image to base64.

    :param image_path: Path to the image file.
    :return: Base64 encoded string of the image.
    with open(image_path, "rb") as image_file:
        return base64.b64encode('utf-8')

@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(prompt, base64_image):
    Sends a request to OpenAI with given prompt and image.

    :param prompt: Text prompt for the request.
    :param base64_image: Base64 encoded image.
    :return: Response from the API.
    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}


responses = []
data = []

sample_df = df[df['Username'] == "afd.bund"]
sample_df = sample_df.sample(5)

for index, row in tqdm(sample_df.iterrows(), total=sample_df.shape[0]):
            image_path = row['image_path']
            base64_image = encode_image(image_path)
            response = run_request(prompt, base64_image)
            r = response.choices[0].message.content

            current_prompt_cost = response.usage.prompt_tokens * prompt_cost
            current_completion_cost = response.usage.completion_tokens * completion_cost
            current_cost = current_prompt_cost + current_completion_cost
            total_cost += current_cost

            print(f"This round cost ${current_cost:.6f}")

            responses.append({'image_path': row['image_path'], 'classification': r})

            processed_data = parse_response(r, row['image_path'])

        except Exception as e:
            print(f"Error processing image {row['image_path']}: {e}")

print(f"Total cost ${total_cost:.6f}")
This round cost $0.017040
This round cost $0.016950
This round cost $0.017010
This round cost $0.017010
This round cost $0.017040
Total cost $0.085050

Once we conver the python list data to the pandas data_df, we can display the classification results neatly.

data_df = pd.DataFrame(data)
Performance Environment_Environment Environment_Location DressStyle_DressStyle DressStyle_RolledUpSleeves image_path
0 CampaignOrPartyEvent Outdoors StreetOrPlaza Formal False /content/media/images/afd.bund/263854933600008...
1 MediaEvent Indoors TVStudio Formal False /content/media/images/afd.bund/267144444973600...
2 ProtestEvent Outdoors StreetOrPlaza Casual False /content/media/images/afd.bund/266546813941933...
3 CampaignOrPartyEvent Indoors EventHall Formal False /content/media/images/afd.bund/264395151191513...
4 CampaignOrPartyEvent Outdoors StreetOrPlaza Formal False /content/media/images/afd.bund/264597862229445...

Using the next cell, we can qualitatively check the classification results.

import pandas as pd
from IPython.display import display, Image
import random

# Assuming your DataFrame is already loaded and named data_df
# data_df = pd.read_csv('your_data_file.csv') # Uncomment if you need to load the DataFrame

def display_random_image_and_classification(df):
    # Select a random row from the DataFrame
    random_row = df.sample(1).iloc[0]

    # Get the image path and classification from the row
    image_path = random_row['image_path'] # Replace 'image_path' with the actual column name

    # Display the image

    # Display the classification
    print(f"Performance: {random_row['Performance']}")
    print(f"Environment_Environment: {random_row['Environment_Environment']}")
    print(f"Environment_Location: {random_row['Environment_Location']}")
    print(f"DressStyle_DressStyle: {random_row['DressStyle_DressStyle']}")
    print(f"DressStyle_RolledUpSleeves: {random_row['DressStyle_RolledUpSleeves']}")

# Call the function to display an image and its classification

And merge the results with the overall dataframe.

total_df = pd.merge(df, data_df, how="left", on="image_path")
Unnamed: 0 ID Time of Posting Type of Content video_url image_url Username Video Length (s) Expiration Caption Is Verified Stickers Accessibility Caption Attribution URL image_path Performance Environment_Environment Environment_Location DressStyle_DressStyle DressStyle_RolledUpSleeves
44 44 2638549336000085388_1484534097 2021-08-12 09:13:30 Video NaN NaN afd.bund 5.000 2021-08-13 09:13:30 NaN True [] NaN /content/media/images/afd.bund/263854933600008... CampaignOrPartyEvent Outdoors StreetOrPlaza Formal False
70 70 2643951511915139810_1484534097 2021-08-19 20:06:41 Image NaN NaN afd.bund NaN 2021-08-20 20:06:41 NaN True [] Photo by Alternative für Deutschland on August... /content/media/images/afd.bund/264395151191513... CampaignOrPartyEvent Indoors EventHall Formal False
93 93 2645978622294450871_1484534097 2021-08-22 15:14:11 Video NaN NaN afd.bund 2.066 2021-08-23 15:14:11 NaN True [] NaN NaN /content/media/images/afd.bund/264597862229445... CampaignOrPartyEvent Outdoors StreetOrPlaza Formal False
162 162 2665468139419335791_1484534097 2021-09-18 12:36:23 Image NaN NaN afd.bund NaN 2021-09-19 12:36:23 NaN True [] Photo by Alternative für Deutschland on Septem... /content/media/images/afd.bund/266546813941933... ProtestEvent Outdoors StreetOrPlaza Casual False
165 165 2671444449736006853_1484534097 2021-09-26 18:30:15 Image NaN NaN afd.bund NaN 2021-09-27 18:30:15 NaN True [{'height': 0.044419695058272, 'rotation': 0, ... Photo by Alternative für Deutschland on Septem... NaN /content/media/images/afd.bund/267144444973600... MediaEvent Indoors TVStudio Formal False
Source: Multimodal GPT-4


I’m currently experimenting with ensemble classification approaches. The ensemble classification might quite possibly be obsolete with the introduction of multimodal LLMs, yet a comparison and evaluation between the approaches might still be interesting. Our approach to ensemble models combines multiple data types — Captions, Objects, and OCR — to enhance image classification with GPT models. For a detailed understanding and practical application of this technique, I created the seperate ensemble chapter, which includes a notebook for the practical steps involved in its implementation using Google Vision APIs.


All image classification approaches introduced in this chapter have some experimental characters. They show promising results for some applications, but we will need to keep experimenting and evaluating the approaches, to compare the approaches with one another, to find the best, and especially most valid of all! Your projects are going to be good test cases to try and see which classification approach works (best). In the next session we will come back to the Agreement & Evaluation chapter and take a look at the updated notebooks for image annotation using Label Studio.


Gordillo-Rodrı́guez, Marı́a-Teresa, and Elena Bellido-Pérez. 2023. The visual frame of the political candidate on Instagram: the 2021 Catalan regional elections.” Dı́gitos. Revista de Comunicación Digital 0 (9).
Grabe, Maria Elizabeth, and Erik Page Bucy. 2009. Image Bite Politics: News and the Visual Framing of Elections. Oxford University Press, USA.
Grittmann, Elke, and Ilona Ammann. 2011. Quantitative Bildtypenanlyse.” In Die Entschlüsselung der Bilder: Methoden zur Erforschung visueller Kommunikation : ein Handbuch, edited by Thomas Petersen and Clemens Schwender, 163–78. von Halem.
Haim, Mario, and Marc Jungblut. 2021. Politicians’ Self-depiction and Their News Portrayal: Evidence from 28 Countries Using Visual Computational Analysis.” Political Communication 38 (1-2): 55–74.
Liebhart, Karin, and Petra Bernhardt. 2017. Political storytelling on Instagram: Key aspects of Alexander Van der Bellen’s successful 2016 presidential election campaign.” Media and Communication 5 (4): 15–25.
Rose, Gillian. 2016. Visual Methodologies: An Introduction to Researching with Visual Materials. SAGE Publications.



BibTeX citation:
  author = {Achmann-Denkler, Michael},
  title = {Image {Classification}},
  date = {2024-01-22},
  url = {},
  doi = {10.5281/zenodo.10039756},
  langid = {en}
For attribution, please cite this work as:
Achmann-Denkler, Michael. 2024. “Image Classification.” January 22, 2024.