{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ":::{.content-hidden}\n", "# Human Annotations\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install the `label-studio-sdk` package for programmatic control of Label Studio:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "J1sLoUSsXzS4" }, "outputs": [], "source": [ "!pip -q install label-studio-sdk" ] }, { "cell_type": "markdown", "metadata": { "id": "NBow_A02gg8j" }, "source": [ "Next, let's read the text master from the previous sessions" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "i9hkqOpUb8kQ" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('/content/drive/MyDrive/2023-12-01-Export-Posts-Text-Master.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "-TbauZ65PWhU" }, "source": [ "In my [video on GPT text classification](https://youtu.be/QcYGwC4QzW0) I mentioned the problem of the unique identifier, as we also need a unique identifier for the annotations. Use the code below in our text classification notebook when working with multidocument classifications!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "4IGsSLHCPypm" }, "outputs": [], "source": [ "df['identifier'] = df.apply(lambda x: f\"{x['shortcode']}-{x['Text Type']}\", axis=1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "Y3NRaT_4gxes", "outputId": "155af2da-17ca-46c4-947e-6565f40640de" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0shortcodeTextText TypePolicy Issuesidentifier
00CyMAe_tufcR#Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da...Caption['1. Political parties:\\n- FREIEWÄHLER\\n- Aiwa...CyMAe_tufcR-Caption
11CyL975vouHUDie Landtagswahl war für uns als Liberale hart...Caption['Landtagswahl']CyL975vouHU-Caption
22CyL8GWWJmciNach einem starken Wahlkampf ein verdientes Er...Caption['1. Wahlkampf und Wahlergebnis:\\n- Wahlkampf\\...CyL8GWWJmci-Caption
33CyL7wyJtTV5So viele Menschen am Odeonsplatz heute mit ein...Caption['Israel', 'Terrorismus', 'Hamas', 'Entwicklun...CyL7wyJtTV5-Caption
44CyLxwHuvR4YHerzlichen Glückwunsch zu diesem grandiosen Wa...Caption['1. Wahlsieg und Parlamentseinstieg\\n- Wahlsi...CyLxwHuvR4Y-Caption
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " Unnamed: 0 shortcode Text \\\n", "0 0 CyMAe_tufcR #Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da... \n", "1 1 CyL975vouHU Die Landtagswahl war für uns als Liberale hart... \n", "2 2 CyL8GWWJmci Nach einem starken Wahlkampf ein verdientes Er... \n", "3 3 CyL7wyJtTV5 So viele Menschen am Odeonsplatz heute mit ein... \n", "4 4 CyLxwHuvR4Y Herzlichen Glückwunsch zu diesem grandiosen Wa... \n", "\n", " Text Type Policy Issues \\\n", "0 Caption ['1. Political parties:\\n- FREIEWÄHLER\\n- Aiwa... \n", "1 Caption ['Landtagswahl'] \n", "2 Caption ['1. Wahlkampf und Wahlergebnis:\\n- Wahlkampf\\... \n", "3 Caption ['Israel', 'Terrorismus', 'Hamas', 'Entwicklun... \n", "4 Caption ['1. Wahlsieg und Parlamentseinstieg\\n- Wahlsi... \n", "\n", " identifier \n", "0 CyMAe_tufcR-Caption \n", "1 CyL975vouHU-Caption \n", "2 CyL8GWWJmci-Caption \n", "3 CyL7wyJtTV5-Caption \n", "4 CyLxwHuvR4Y-Caption " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LabelStudio Setup\n", "Please specify the the URL and API-Key for you LabelStudio Instance." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "cellView": "form", "id": "71Cj4_19X3AV" }, "outputs": [], "source": [ "import json\n", "from google.colab import userdata\n", "\n", "labelstudio_key_name = \"label2-key\"\n", "labelstudio_key = userdata.get(labelstudio_key_name)\n", "labelstudio_url = \"https://label2.digitalhumanities.io\"" ] }, { "cell_type": "markdown", "metadata": { "id": "NyWtV-3PDxn3" }, "source": [ "#### Create LabelStudio Interface\n", "Before creating the LabelStudio project you will need to define your labelling interface. Once the project is set up you will only be able to edit the interface in LabelStudio." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "kVhp0vEGE4an" }, "outputs": [], "source": [ "interface = \"\"\"\n", "\n", " \n", " \n", " \n", " \n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Add a simple coding interface\n", "Do you want add codes (Classification) to the images? Please name your coding instance and add options.
**By running this cell multiple times you're able to add multiple variables (not recommended)**\n", "\n", "Add the variable name to `coding_name`, the checkbox labels in `coding_values`, and define whether to expect `single` choice or `multiple` choice input for this variable in `coding_choice`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "iYYcR7nnIaj7" }, "outputs": [], "source": [ "coding_name = \"Sentiment\"\n", "coding_values = \"Positive,Neutral,Negative\"\n", "coding_choice = \"single\"\n", "\n", "coding_interface = '
'.format(coding_name, coding_name,coding_choice)\n", "\n", "for value in coding_values.split(\",\"):\n", " value = value.strip()\n", " coding_interface += ''.format(value)\n", "\n", "coding_interface += \"\"\n", "\n", "interface += coding_interface\n", "\n", "print(\"Added {}\".format(coding_name))" ] }, { "cell_type": "markdown", "metadata": { "id": "KJPxxRGZQvwe" }, "source": [ "Finally run the next line to close the XML of the annotation interface. **Run this line even if you do not want to add any variables at the moment!** " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "I1B_JTpMUbjy" }, "outputs": [], "source": [ "interface += \"\"\"\n", " \n", " \n", " \"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Project Upload\n", "This final step creates a LabelStudio project and configures the interface. Define a `project_name`, select the `text_column`, and `identifier_column`. Additionally, you may define a `sample_percentage` for sampling, we start with $30\\%$. When working with the Open Source version of Label Studio we need to create on project per annotator, enter the number of annotators in `num_copies` to create multiple copies at once." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/" }, "id": "Uyvam3dH7uB5", "outputId": "3e0856ad-0bf7-42a1-9ee6-ea17887f83fb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All done, created project #0! Visit https://label2.digitalhumanities.io/projects/61/ and get started labelling!\n" ] } ], "source": [ "from label_studio_sdk import Client\n", "import contextlib\n", "import io\n", "\n", "project_name = \"vSMA Test 1\" \n", "text_column = \"Text\" \n", "identifier_column = \"identifier\" \n", "sample_percentage = 30 \n", "num_copies = 1 \n", "\n", "sample_size = round(len(df) * (sample_percentage / 100))\n", "\n", "ls = Client(url=labelstudio_url, api_key=labelstudio_key)\n", "\n", "df_tasks = df[[identifier_column, text_column]]\n", "df_tasks = df_tasks.sample(sample_size)\n", "df_tasks = df_tasks.fillna(\"\")\n", "\n", "for i in range(0, num_copies):\n", " project_name = f\"{project_name} #{i}\"\n", " # Create the project\n", " project = ls.start_project(\n", " title=project_name,\n", " label_config=interface,\n", " sampling=\"Uniform sampling\"\n", " )\n", "\n", " with contextlib.redirect_stdout(io.StringIO()):\n", " project.import_tasks(\n", " df_tasks.to_dict('records')\n", " )\n", "\n", " print(f\"All done, created project #{i}! Visit {labelstudio_url}/projects/{project.id}/ and get started labelling!\")" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }