{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ":::{.content-hidden}\n", "# Visual Exploration\n", ":::\n" ] }, { "cell_type": "markdown", "metadata": { "id": "OLmiGF9cD7cU" }, "source": [ "For this notebook we use a 4CAT corpus collected from TikTok about the [2024 Farmers' Protest in Germany](https://de.wikipedia.org/wiki/Bauernproteste_in_Deutschland_ab_Dezember_2023). Let's take a look at all relevant columns. We're mostly dealing with the `image_file` column. Additionally, the images files should be extracted to the `/content/media/images/` path. (See linked notebook for the conversion from the original 4CAT files)." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "0Otg3eZT89js", "outputId": "a5a048b2-82db-42c8-a697-12ae3780e51f" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idbodyTranscriptimage_file
07321692663852404001#Fakten #mutzurwahrheit #ulrichsiegmund #AfD #...Liebe Freunde, schaut euch das an, das ist der.../content/media/images/7321692663852404001.jpg
17320593840212151584Unstoppable πŸ‡©πŸ‡ͺ #deutschland #8januar2024 #baue...the next, video!!/content/media/images/7320593840212151584.jpg
2732134195733306089608.01.2024 Streik - Hoss & Hopf #hossundhopf #...scheiß Bauern, die, was weiß ich, ich habe auc.../content/media/images/7321341957333060896.jpg
37321355364950117665#streik #2024 #bauernstreik2024 #deutschland #...😎😎😎😎😎😎😎😎😎/content/media/images/7321355364950117665.jpg
47321656341590789409#🌞❀️ #sunshineheart #sunshineheartforever #πŸ‡©πŸ‡ͺ ...NaN/content/media/images/7321656341590789409.jpg
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " id body \\\n", "0 7321692663852404001 #Fakten #mutzurwahrheit #ulrichsiegmund #AfD #... \n", "1 7320593840212151584 Unstoppable πŸ‡©πŸ‡ͺ #deutschland #8januar2024 #baue... \n", "2 7321341957333060896 08.01.2024 Streik - Hoss & Hopf #hossundhopf #... \n", "3 7321355364950117665 #streik #2024 #bauernstreik2024 #deutschland #... \n", "4 7321656341590789409 #🌞❀️ #sunshineheart #sunshineheartforever #πŸ‡©πŸ‡ͺ ... \n", "\n", " Transcript \\\n", "0 Liebe Freunde, schaut euch das an, das ist der... \n", "1 the next, video!! \n", "2 scheiß Bauern, die, was weiß ich, ich habe auc... \n", "3 😎😎😎😎😎😎😎😎😎 \n", "4 NaN \n", "\n", " image_file \n", "0 /content/media/images/7321692663852404001.jpg \n", "1 /content/media/images/7320593840212151584.jpg \n", "2 /content/media/images/7321341957333060896.jpg \n", "3 /content/media/images/7321355364950117665.jpg \n", "4 /content/media/images/7321656341590789409.jpg " ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['id', 'body', 'Transcript', 'image_file']].head()" ] }, { "cell_type": "markdown", "metadata": { "id": "Kb69tFAe09Eo" }, "source": [ "## BERTopic\n", "Let's first install `bertopic` including the vision extensions.\n", "\n", "::: {.callout-note}\n", "The following code has been taken from the [BERTopic documentation](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) and was only slightly changed.\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ScDNrgNc1JEs" }, "outputs": [], "source": [ "!pip install bertopic[vision]" ] }, { "cell_type": "markdown", "metadata": { "id": "v6g5iTWY48kX" }, "source": [ "### Images Only\n", "\n", "Next, we prepare the pipeline for an image-only model: We want to fit the Topic Model on the image content only. We follow [the BERTOpic Multimodal Manual](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html#images-only), and generate image captions using the `vit-gpt2-image-captioning`package. The documentation offers a lot of different options, we can incorporate textual content for the topic modeling, or fit the model on textual information only and look for the best matching images for each cluster and display them.\n", "\n", "In our example we focus on image-only topics models." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1k7yDHDv493M", "outputId": "cf1460e5-f669-4ffe-bf45-82447c60fb49" }, "outputs": [], "source": [ "from bertopic.representation import KeyBERTInspired, VisualRepresentation\n", "from bertopic.backend import MultiModalBackend\n", "\n", "# Image embedding model\n", "embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n", "\n", "# Image to text representation model\n", "representation_model = {\n", " \"Visual_Aspect\": VisualRepresentation(image_to_text_model=\"nlpconnect/vit-gpt2-image-captioning\")\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, select the column with the path of your images files, in my example `image_file`. Convert it to a python `list`." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "id": "OgatBjBZ6V5I" }, "outputs": [], "source": [ "image_only_df = df.copy()\n", "images = image_only_df['image_file'].to_list()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's time to fit the model." ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HQj3G9Rs6TEK", "outputId": "987c265b-f792-47bd-9647-4309e316fe75" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [02:33<00:00, 21.88s/it]\n", "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:02<00:00, 2.99it/s]\n" ] } ], "source": [ "from bertopic import BERTopic\n", "\n", "# Train our model with images only\n", "topic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, min_topic_size=5)\n", "topics, probs = topic_model.fit_transform(documents=None, images=images)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally let's display the topics. **Remember:** Topic `-1` is a collection of documenst that do not fit into any topic." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "ncAQVkkZ7Oix", "outputId": "2055f300-6d85-40b5-d675-791fe0609990" }, "outputs": [], "source": [ "# See linked notebook for code." ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }