Label Studio for Visual Annotations

First lets install the packages:

In [1]:

!pip -q install label-studio-sdk gcloud

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/454.4 kB ? eta -:--:--     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━ 348.2/454.4 kB 10.2 MB/s eta 0:00:01     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 454.4/454.4 kB 9.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Building wheel for gcloud (setup.py) ... done

Next, setup Google Cloud. Please specify the file path for the credentials file in order to upload images to google cloud bucket (provided via GRIPS or your own).

In [2]:

#@title ## Gcloud Setup
#@markdown 

import json
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials

gcloud_credentials_path = '/content/vsma-course-2324-72da2075ad3a.json' #@param {type: "string"}
gcloud_bucket = 'label-studio-vsma' #@param {type: "string"}

with open(gcloud_credentials_path, 'rb') as f:
  credentials_dict = json.loads(f.read())

credentials = ServiceAccountCredentials.from_json_keyfile_dict(
    credentials_dict
)
client = storage.Client(credentials=credentials, project='local-grove-153811')
bucket = client.get_bucket(gcloud_bucket)

Let’s read the dataframe from previous sessions

In [4]:

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/2024-01-19-AfD-Stories-Exported.csv')

In [6]:

df.head()

	Unnamed: 0.3	Unnamed: 0.2	Unnamed: 0.1	Unnamed: 0	ID	Time of Posting	Type of Content	video_url	image_url	Username	...	Is Verified	Stickers	Accessibility Caption	Attribution URL	image_path	OCR	Objects	caption	Vertex Caption	Ensemble
0	0	0	0	1	2125373886060513565_1484534097	2019-09-04 08:05:27	Image	NaN	NaN	afd.bund	...	True	[]	Photo by Alternative für Deutschland on Septem...	NaN	/content/media/images/afd.bund/212537388606051...	FACEBOOK\nAfD\nf\nSwipe up\nund werde Fan!	NaN	a collage of a picture of a person flying a kite	an ad for facebook shows a drawing of a facebo...	Digital and Social Media Campaigning
1	1	1	1	2	2125374701022077222_1484534097	2019-09-04 08:07:04	Image	NaN	NaN	afd.bund	...	True	[]	Photo by Alternative für Deutschland on Septem...	NaN	/content/media/images/afd.bund/212537470102207...	YOUTUBE\nAfD\nSwipe up\nund abonniere uns!	NaN	a poster of a man with a red face	an advertisement for youtube with a red backgr...	Digital and Social Media Campaigning
2	2	2	2	3	2490851226217175299_1484534097	2021-01-20 14:23:30	Image	NaN	NaN	afd.bund	...	True	[]	Photo by Alternative für Deutschland on Januar...	NaN	/content/media/images/afd.bund/249085122621717...	TELEGRAM\nAfD\nSwipe up\nund folge uns!	NaN	a large blue and white photo of a plane	an advertisement for telegram with a blue back...	Digital and Social Media Campaigning
3	3	3	3	4	2600840011884997131_1484534097	2021-06-21 08:31:45	Image	NaN	NaN	afd.bund	...	True	[]	Photo by Alternative für Deutschland on June 2...	NaN	/content/media/images/afd.bund/260084001188499...	Pol\nBeih	3x Person, 1x Chair, 1x Table, 1x Picture frame	a woman sitting at a desk with a laptop	two women are sitting at a table talking to ea...	Public Engagement
4	4	4	4	5	2600852794831609459_1484534097	2021-06-21 08:57:09	Image	NaN	NaN	afd.bund	...	True	[]	Photo by Alternative für Deutschland in Berlin...	NaN	/content/media/images/afd.bund/260085279483160...	BERLIN, GERMANY\n2160 25.000\nMON 422 150M\nA0...	4x Person, 1x Furniture, 1x Television	a man sitting in front of a screen with a tv	a camera is recording a man sitting at a table...	Traditional Media Campaigning

5 rows × 23 columns

And let’s unzip the images

In [5]:

!unzip /content/drive/MyDrive/2024-01-19-AfD-Stories-Exported.zip

Upload files to Cloud Bucket

We’re using the naming convention {cloud-bucket}/{username}/{id}.jpg. The naming convention is important, as we will use it later on to map the manual and computational annotations into one dataframe. (See Identifier in the text annotation project).

In [30]:

df["Image"] =  df.apply(lambda row: "gs://{}/{}/{}.jpg".format(gcloud_bucket, row['Username'], row['ID']), axis=1)

In [34]:

from tqdm.notebook import tqdm

df["Image"] = "gs://{}/{}/{}.jpg".format(gcloud_bucket, df['Username'], df['ID'])

uploaded_count = 0
skipped_count = 0

# Use tqdm for progress bar
for row in tqdm(df.itertuples(), total=len(df), desc="Uploading Images"):
    filename = "{}/{}.jpg".format(row.Username, row.ID)
    source_filename = row.image_path
    blob = bucket.blob(filename)

    if not blob.exists(client):
        try:
            blob.upload_from_filename(source_filename)
            uploaded_count += 1
        except FileNotFoundError:
            print(f"Uploading {source_filename} failed: Missing File")
    else:
        skipped_count += 1

print()
print(f"Uploaded {uploaded_count} images successfully, skipped {skipped_count} existing files.")

Uploading /content/media/images/afd.bund/2632909594311219564_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2637169242765597715_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2637310044636651340_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2640856259194124126_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2643802824089930195_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2653863205891438589_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2664113842957989541_1484534097.jpg failed: Missing File
Uploading /content/media/images/afd.bund/2671444844831156334_1484534097.jpg failed: Missing File

Uploaded 1 images successfully, skipped 171 existing files.

LabelStudio Setup

Please specify the the URL and API-Key for you LabelStudio Instance.

In [13]:

import json
from google.colab import userdata

labelstudio_key_name = "label2-key" #@param {type: "string"}
labelstudio_key = userdata.get(labelstudio_key_name)
labelstudio_url = "https://label2.digitalhumanities.io" #@param {type: "string"}

Create LabelStudio Interface

Before creating the LabelStudio project you will need to define your labelling interface. Once the project is set up you will only be able to edit the interface in LabelStudio.

In [21]:

interface = """
<View style="display:flex;">
  <View style="flex:33%">
    <Image name="Image" value="$Image"/>
  </View>
  <View style="flex:66%">
"""

Add a simple coding interface

Do you want add codes (Classification) to the images? Please name your coding instance and add options.
By running this cell multiple times you’re able to add multiple variables (not recommended)

Add the variable name to coding_name, the checkbox labels in coding_values, and define whether to expect single choice or multiple choice input for this variable in coding_choice.

In [22]:

#@title ### Codes
#@markdown Do you want add codes (Classification) to the images? Please name your coding instance and add options. <br/> **By running this cell multiple times you're able to add multiple variables (not recommended)**

coding_name = "Sentiment" #@param {type:"string"}
coding_values = "Positive,Neutral,Negative" #@param {type:"string"}
coding_choice = "single" #@param ["single", "multiple"]

coding_interface = '<Header value="{}" /><Choices name="{}" choice="{}" toName="Image">'.format(coding_name, coding_name,coding_choice)

for value in coding_values.split(","):
  value = value.strip()
  coding_interface += '<Choice value="{}" />'.format(value)

coding_interface += "</Choices>"

interface += coding_interface

print("Added {}".format(coding_name))

Added Sentiment

Don’t forget to run the next line! It closes the interface XML!

In [23]:

interface += """
        </View>
    </View>
    """

Project Upload

This final step creates a LabelStudio project and configures the interface. Define a project_name, and identifier_column. Additionally, you may define a sample_percentage for sampling, we start with \(30\%\). When working with the Open Source version of Label Studio we need to create on project per annotator, enter the number of annotators in num_copies to create multiple copies at once.

In [33]:

from label_studio_sdk import Client
import contextlib
import io

project_name = "vSMA Image Test 1"  #@param {type: "string"}
identifier_column = "ID"  #@param {type: "string"}
#@markdown Percentage for drawing a sample to annotate, e.g. 30%
sample_percentage = 30  #@param {type: "number", min:0, max:100}
#@markdown Number of project copies. **Start testing with 1!**
num_copies = 1  #@param {type: "number", min:0, max:3}

sample_size = round(len(df) * (sample_percentage / 100))

ls = Client(url=labelstudio_url, api_key=labelstudio_key)


# Import all tasks
df_tasks = df[[identifier_column, 'Image']]
df_tasks = df_tasks.sample(sample_size)
df_tasks = df_tasks.fillna("")

for i in range(0, num_copies):
  project_name = f"{project_name} #{i}"
  # Create the project
  project = ls.start_project(
      title=project_name,
      label_config=interface,
      sampling="Uniform sampling"
  )
  # Configure Cloud Storage (in order to be able to view the images)
  project.connect_google_import_storage(bucket=gcloud_bucket, google_application_credentials=json.dumps(credentials_dict))


  with contextlib.redirect_stdout(io.StringIO()):
    project.import_tasks(
          df_tasks.to_dict('records')
        )

  print(f"All done, created project #{i}! Visit {labelstudio_url}/projects/{project.id}/ and get started labelling!")

All done, created project #0! Visit https://label2.digitalhumanities.io/projects/71/ and get started labelling!

Create Label Studio Project (Images)

Upload files to Cloud Bucket

LabelStudio Setup

Create LabelStudio Interface

Add a simple coding interface

Project Upload