Human Annotations

Install the label-studio-sdk package for programmatic control of Label Studio:

!pip -q install label-studio-sdk

Next, let’s read the text master from the previous sessions

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/2023-12-01-Export-Posts-Text-Master.csv')

In my video on GPT text classification I mentioned the problem of the unique identifier, as we also need a unique identifier for the annotations. Use the code below in our text classification notebook when working with multidocument classifications!

df['identifier'] = df.apply(lambda x: f"{x['shortcode']}-{x['Text Type']}", axis=1)
df.head()
Unnamed: 0 shortcode Text Text Type Policy Issues identifier
0 0 CyMAe_tufcR #Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da... Caption ['1. Political parties:\n- FREIEWÄHLER\n- Aiwa... CyMAe_tufcR-Caption
1 1 CyL975vouHU Die Landtagswahl war für uns als Liberale hart... Caption ['Landtagswahl'] CyL975vouHU-Caption
2 2 CyL8GWWJmci Nach einem starken Wahlkampf ein verdientes Er... Caption ['1. Wahlkampf und Wahlergebnis:\n- Wahlkampf\... CyL8GWWJmci-Caption
3 3 CyL7wyJtTV5 So viele Menschen am Odeonsplatz heute mit ein... Caption ['Israel', 'Terrorismus', 'Hamas', 'Entwicklun... CyL7wyJtTV5-Caption
4 4 CyLxwHuvR4Y Herzlichen Glückwunsch zu diesem grandiosen Wa... Caption ['1. Wahlsieg und Parlamentseinstieg\n- Wahlsi... CyLxwHuvR4Y-Caption

LabelStudio Setup

Please specify the the URL and API-Key for you LabelStudio Instance.

import json
from google.colab import userdata

labelstudio_key_name = "label2-key"
labelstudio_key = userdata.get(labelstudio_key_name)
labelstudio_url = "https://label2.digitalhumanities.io"

Create LabelStudio Interface

Before creating the LabelStudio project you will need to define your labelling interface. Once the project is set up you will only be able to edit the interface in LabelStudio.

interface = """
<View style="display:flex;">
  <View style="flex:33%">
    <Text name="Text" value="$Text"/>
  </View>
  <View style="flex:66%">
"""

Add a simple coding interface

Do you want add codes (Classification) to the images? Please name your coding instance and add options.
By running this cell multiple times you’re able to add multiple variables (not recommended)

Add the variable name to coding_name, the checkbox labels in coding_values, and define whether to expect single choice or multiple choice input for this variable in coding_choice.

coding_name = "Sentiment"
coding_values = "Positive,Neutral,Negative"
coding_choice = "single"

coding_interface = '<Header value="{}" /><Choices name="{}" choice="{}" toName="Text">'.format(coding_name, coding_name,coding_choice)

for value in coding_values.split(","):
  value = value.strip()
  coding_interface += '<Choice value="{}" />'.format(value)

coding_interface += "</Choices>"

interface += coding_interface

print("Added {}".format(coding_name))

Finally run the next line to close the XML of the annotation interface. Run this line even if you do not want to add any variables at the moment!

interface += """
        </View>
    </View>
    """

Project Upload

This final step creates a LabelStudio project and configures the interface. Define a project_name, select the text_column, and identifier_column. Additionally, you may define a sample_percentage for sampling, we start with \(30\%\). When working with the Open Source version of Label Studio we need to create on project per annotator, enter the number of annotators in num_copies to create multiple copies at once.

from label_studio_sdk import Client
import contextlib
import io

project_name = "vSMA Test 1" 
text_column = "Text" 
identifier_column = "identifier" 
sample_percentage = 30  
num_copies = 1 

sample_size = round(len(df) * (sample_percentage / 100))

ls = Client(url=labelstudio_url, api_key=labelstudio_key)

df_tasks = df[[identifier_column, text_column]]
df_tasks = df_tasks.sample(sample_size)
df_tasks = df_tasks.fillna("")

for i in range(0, num_copies):
  project_name = f"{project_name} #{i}"
  # Create the project
  project = ls.start_project(
      title=project_name,
      label_config=interface,
      sampling="Uniform sampling"
  )

  with contextlib.redirect_stdout(io.StringIO()):
    project.import_tasks(
          df_tasks.to_dict('records')
        )

  print(f"All done, created project #{i}! Visit {labelstudio_url}/projects/{project.id}/ and get started labelling!")
All done, created project #0! Visit https://label2.digitalhumanities.io/projects/61/ and get started labelling!