{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ":::{.content-hidden}\n",
        "# Visual Exploration\n",
        ":::\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OLmiGF9cD7cU"
      },
      "source": [
        "For this notebook we use a 4CAT corpus collected from TikTok about the [2024 Farmers' Protest in Germany](https://de.wikipedia.org/wiki/Bauernproteste_in_Deutschland_ab_Dezember_2023). Let's take a look at all relevant columns. We're mostly dealing with the `image_file` column. Additionally, the images files should be extracted to the `/content/media/images/` path. (See linked notebook for the conversion from the original 4CAT files)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 77,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "0Otg3eZT89js",
        "outputId": "a5a048b2-82db-42c8-a697-12ae3780e51f"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "  <div id=\"df-8795fc19-07dc-479d-bb03-1a7eec9a1bda\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>body</th>\n",
              "      <th>Transcript</th>\n",
              "      <th>image_file</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>7321692663852404001</td>\n",
              "      <td>#Fakten #mutzurwahrheit #ulrichsiegmund #AfD #...</td>\n",
              "      <td>Liebe Freunde, schaut euch das an, das ist der...</td>\n",
              "      <td>/content/media/images/7321692663852404001.jpg</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>7320593840212151584</td>\n",
              "      <td>Unstoppable 🇩🇪 #deutschland #8januar2024 #baue...</td>\n",
              "      <td>the next, video!!</td>\n",
              "      <td>/content/media/images/7320593840212151584.jpg</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>7321341957333060896</td>\n",
              "      <td>08.01.2024 Streik - Hoss &amp; Hopf #hossundhopf #...</td>\n",
              "      <td>scheiß Bauern, die, was weiß ich, ich habe auc...</td>\n",
              "      <td>/content/media/images/7321341957333060896.jpg</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>7321355364950117665</td>\n",
              "      <td>#streik #2024 #bauernstreik2024 #deutschland #...</td>\n",
              "      <td>😎😎😎😎😎😎😎😎😎</td>\n",
              "      <td>/content/media/images/7321355364950117665.jpg</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>7321656341590789409</td>\n",
              "      <td>#🌞❤️ #sunshineheart #sunshineheartforever #🇩🇪 ...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>/content/media/images/7321656341590789409.jpg</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8795fc19-07dc-479d-bb03-1a7eec9a1bda')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-8795fc19-07dc-479d-bb03-1a7eec9a1bda button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-8795fc19-07dc-479d-bb03-1a7eec9a1bda');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-fbe29d46-54e8-4de6-a9b8-48692a7f675b\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-fbe29d46-54e8-4de6-a9b8-48692a7f675b')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-fbe29d46-54e8-4de6-a9b8-48692a7f675b button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                    id                                               body  \\\n",
              "0  7321692663852404001  #Fakten #mutzurwahrheit #ulrichsiegmund #AfD #...   \n",
              "1  7320593840212151584  Unstoppable 🇩🇪 #deutschland #8januar2024 #baue...   \n",
              "2  7321341957333060896  08.01.2024 Streik - Hoss & Hopf #hossundhopf #...   \n",
              "3  7321355364950117665  #streik #2024 #bauernstreik2024 #deutschland #...   \n",
              "4  7321656341590789409  #🌞❤️ #sunshineheart #sunshineheartforever #🇩🇪 ...   \n",
              "\n",
              "                                          Transcript  \\\n",
              "0  Liebe Freunde, schaut euch das an, das ist der...   \n",
              "1                                  the next, video!!   \n",
              "2  scheiß Bauern, die, was weiß ich, ich habe auc...   \n",
              "3                                          😎😎😎😎😎😎😎😎😎   \n",
              "4                                                NaN   \n",
              "\n",
              "                                      image_file  \n",
              "0  /content/media/images/7321692663852404001.jpg  \n",
              "1  /content/media/images/7320593840212151584.jpg  \n",
              "2  /content/media/images/7321341957333060896.jpg  \n",
              "3  /content/media/images/7321355364950117665.jpg  \n",
              "4  /content/media/images/7321656341590789409.jpg  "
            ]
          },
          "execution_count": 77,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df[['id', 'body', 'Transcript', 'image_file']].head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Kb69tFAe09Eo"
      },
      "source": [
        "## BERTopic\n",
        "Let's first install `bertopic` including the vision extensions.\n",
        "\n",
        "::: {.callout-note}\n",
        "The following code has been taken from the [BERTopic documentation](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) and was only slightly changed.\n",
        ":::\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ScDNrgNc1JEs"
      },
      "outputs": [],
      "source": [
        "!pip install bertopic[vision]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v6g5iTWY48kX"
      },
      "source": [
        "### Images Only\n",
        "\n",
        "Next, we prepare the pipeline for an image-only model: We want to fit the Topic Model on the image content only. We follow [the BERTOpic Multimodal Manual](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html#images-only), and generate image captions using the `vit-gpt2-image-captioning`package. The documentation offers a lot of different options, we can incorporate textual content for the topic modeling, or fit the model on textual information only and look for the best matching images for each cluster and display them.\n",
        "\n",
        "In our example we focus on image-only topics models."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1k7yDHDv493M",
        "outputId": "cf1460e5-f669-4ffe-bf45-82447c60fb49"
      },
      "outputs": [],
      "source": [
        "from bertopic.representation import KeyBERTInspired, VisualRepresentation\n",
        "from bertopic.backend import MultiModalBackend\n",
        "\n",
        "# Image embedding model\n",
        "embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)\n",
        "\n",
        "# Image to text representation model\n",
        "representation_model = {\n",
        "    \"Visual_Aspect\": VisualRepresentation(image_to_text_model=\"nlpconnect/vit-gpt2-image-captioning\")\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, select the column with the path of your images files, in my example `image_file`. Convert it to a python `list`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 74,
      "metadata": {
        "id": "OgatBjBZ6V5I"
      },
      "outputs": [],
      "source": [
        "image_only_df = df.copy()\n",
        "images = image_only_df['image_file'].to_list()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now it's time to fit the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 75,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HQj3G9Rs6TEK",
        "outputId": "987c265b-f792-47bd-9647-4309e316fe75"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 7/7 [02:33<00:00, 21.88s/it]\n",
            "100%|██████████| 7/7 [00:02<00:00,  2.99it/s]\n"
          ]
        }
      ],
      "source": [
        "from bertopic import BERTopic\n",
        "\n",
        "# Train our model with images only\n",
        "topic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, min_topic_size=5)\n",
        "topics, probs = topic_model.fit_transform(documents=None, images=images)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally let's display the topics. **Remember:** Topic `-1` is a collection of documenst that do not fit into any topic."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "ncAQVkkZ7Oix",
        "outputId": "2055f300-6d85-40b5-d675-791fe0609990"
      },
      "outputs": [],
      "source": [
        "# See linked notebook for code."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}