Social Media Lab

Read the LabelStudio Annotations from file

Enter the location (e.g. on Google Drive) of your exported json file and run this cell. It loads the annotations in a long format into the variable all_annotations_df.

Warning

This version does not handle reviews!

Warning

This version has been tested with binary classifications. Other types of nominal data should work out of the box for interrater agreement, model evaluation needs some refactoring!

human_annotations_json = '/content/drive/MyDrive/2024-01-04-Text-Annotation-Sample.json' # @param {type:"string"}

import json

with open(human_annotations_json, 'r') as f:
  j = f.read()
  exported_annotations = json.loads(j)

import pandas as pd
from tqdm.notebook import tqdm

def process_result(result, coder, md):
    value_type = result.get('type', "")
    metadata = {
        **md,
        "coder": coder,
        "from_name": result.get('from_name', "")
    }
    annotations = []
    if value_type == "choices":
        choices = result['value'].get('choices', [])
        for choice in choices:
            r = {**metadata}
            r['value'] = choice
            annotations.append(r)
    elif value_type == "taxonomy":
        taxonomies = result['value'].get('taxonomy', [])
        for taxonomy in taxonomies:
            if len(taxonomy) > 1:
                taxonomy = " > ".join(taxonomy)
            elif len(taxonomy) == 1:
                taxonomy = taxonomy[0]
            r = {**metadata}
            r['value'] = taxonomy
            annotations.append(r)
    return annotations

all_annotations = []

for data in tqdm(exported_annotations):
    annotations = data.get("annotations")
    metadata = {
        **data.get("data")
    }

    for annotation in annotations:
      coder = annotation['completed_by']['id']
      results = annotation.get("result")
      if results:
        for result in results:
          all_annotations.extend(process_result(result, coder, metadata))
      else:
        print("Skipped Missing Result")


all_annotations_df = pd.DataFrame(all_annotations)

# Check the dataframe
all_annotations_df.head()

	Text	uuid	Filename	Identifier	coder	from_name	value
0	CDU #SOFORTPROGRAMM MITTELSTANDSPAKET WIR MACH...	3733	cdu/2021-09-13_15-01-28_UTC.jpg	CTxB6cMKe2U	13724	Positioning	True
1	CDU #SOFORTPROGRAMM MITTELSTANDSPAKET WIR MACH...	3733	cdu/2021-09-13_15-01-28_UTC.jpg	CTxB6cMKe2U	13724	Call to Action	False
2	CDU #SOFORTPROGRAMM MITTELSTANDSPAKET WIR MACH...	3733	cdu/2021-09-13_15-01-28_UTC.jpg	CTxB6cMKe2U	13724	Documentation	False
3	CDU #SOFORTPROGRAMM MITTELSTANDSPAKET WIR MACH...	3733	cdu/2021-09-13_15-01-28_UTC.jpg	CTxB6cMKe2U	13724	OCR	False
4	CDU #SOFORTPROGRAMM MITTELSTANDSPAKET WIR MACH...	3733	cdu/2021-09-13_15-01-28_UTC.jpg	CTxB6cMKe2U	7379	Positioning	True

Contingency Table

We select one coding in this step (e.g. Positioning) and create a contingency table where each annotated item (text, image) occupies one row and each coder one column. The identifier should be set to a unique column, like uuid or Filename.

Enter the from_name for the variable you’re interested in at the moment. (Refer to your LabelStudio Interface for the right from_name).

import pandas as pd
import numpy as np

from_name = 'Positioning'
identifier = 'uuid'

filtered_df = all_annotations_df[all_annotations_df['from_name'].str.contains(from_name, case=False)]

def to_bool(val):
  if isinstance(val, bool):
    return val
  if isinstance(val, str):
    return val.lower() == "true"

  return bool(val)

values = filtered_df['value'].unique()
identifier_values = filtered_df[identifier].unique()

contingency_matrix = pd.crosstab(filtered_df[identifier], filtered_df['coder'], values=filtered_df['value'], aggfunc='first')
contingency_matrix = contingency_matrix.reindex(identifier_values)

# Let's take a look at the contigency table. We refer to each coder using a pseudonymous number.
contingency_matrix.head()

coder	7107	7379	10506	13724
uuid
3733	NaN	True	True	True
4185	False	True	True	NaN
530	NaN	True	False	True
2721	NaN	False	False	False
3384	NaN	False	False	False

Calculate Pairwise \(\kappa\)

Let’s calculate Cohen’s Kappa for each coder pair.

from sklearn.metrics import cohen_kappa_score

# Function to calculate Cohen's Kappa for each pair of raters
def calculate_kappa(matrix):
    raters = matrix.columns
    kappa_scores = []
    for i in range(len(raters)):
        for j in range(i+1, len(raters)):
            rater1, rater2 = raters[i], raters[j]
            # Drop NA values for the pair of raters
            pair_matrix = matrix[[rater1, rater2]].dropna()
            # Skip pairs without overlaps
            if len(pair_matrix) > 0:
              kappa = cohen_kappa_score(pair_matrix[rater1], pair_matrix[rater2])
              kappa_scores.append({
                  "Coder 1": rater1,
                  "Coder 2": rater2,
                  "Kappa": kappa,
                  "Overlap": len(pair_matrix),
                  "Coding": from_name
                  })
    return kappa_scores

kappa_scores = calculate_kappa(contingency_matrix)
kappa_scores_df = pd.DataFrame(kappa_scores)

# Let's display the pairwise Kappa agreements.
kappa_scores_df

	Coder 1	Coder 2	Kappa	Overlap	Coding
0	7107	7379	0.469027	20	Positioning
1	7107	10506	0.782609	20	Positioning
2	7379	10506	0.669500	50	Positioning
3	7379	13724	0.888889	30	Positioning
4	10506	13724	0.714286	30	Positioning

Overall Agreement: Krippendorff’s \(\alpha\)

Next, we calculate the overall agreement using Krippendorffs Alpha. First we need to install the package.

!pip install krippendorff

Collecting krippendorff
  Downloading krippendorff-0.6.1-py3-none-any.whl (18 kB)
Requirement already satisfied: numpy<2.0,>=1.21 in /usr/local/lib/python3.10/dist-packages (from krippendorff) (1.23.5)
Installing collected packages: krippendorff
Successfully installed krippendorff-0.6.1

import krippendorff
import pandas as pd

def convert_to_reliability_data(matrix):
    transposed_matrix = matrix.T
    reliability_data = []
    for _, ratings in transposed_matrix.iterrows():
        reliability_data.append(ratings.tolist())
    return reliability_data

reliability_data = convert_to_reliability_data(contingency_matrix)

# Calculating Krippendorff's Alpha treating "Unsure" as a distinct category
alpha = krippendorff.alpha(reliability_data=reliability_data, level_of_measurement='nominal')

print("Krippendorff's Alpha:", alpha)

Krippendorff's Alpha: 0.6977687626774848

Majority Decision

One approach to assess the quality of machine labelled data is the comparison between machine-generated labels and human-generated labels, commonly known as “gold standard” labels. This process is called “label agreement” or “inter-rater agreement” and is widely used in various fields, including natural language processing, machine learning, and computational social science.

We are going to use create a majority_decision column using the human annotations: We have chosen an uneven number of annotators in order to find a majority for each label. First, we are going to create a contingency table (or matrix), then we can determine the majority decision.

import numpy as np
import pandas as pd

# Each row represents an item, and each column a decision from a different annotator.

# Step 1: Find the mode (most common decision) for each row
# The mode is used as it represents the majority decision.
# decisions will have the most frequent value in each row, handling ties by keeping all modes.
decisions = contingency_matrix.mode(axis=1)

# Step 2: Extract the primary mode (first column after mode operation)
# This represents the majority decision. If there's a tie, it takes the first one.
majority_decisions = decisions.iloc[:, 0]

##########
## Warning: This part needs some refactoring. Will be updated shortly.
##########
# Step 3: Count the number of non-NaN values (actual decisions) per row, excluding the first column
# Use .iloc[:, 1:] to exclude the first column
# row_non_nan_counts = contingency_matrix.iloc[:, 1:].notnull().sum(axis=1)

# Step 4: Invalidate the majority decision where the number of decisions is insufficient or even
# Majority decisions are only considered valid if there are more than 2 decisions and the number of decisions is odd.

# Define a condition for invalidating rows
# invalid_rows_condition = (row_non_nan_counts < 3) | (row_non_nan_counts % 2 == 0)


# Step 5: Append the majority decision as a new column in the contingency matrix
# This column now represents the aggregated decision from the annotators per item.
contingency_matrix['Majority Decision'] = majority_decisions

contingency_matrix.head()

coder	7107	7379	10506	13724	Majority Decision
uuid
3733	NaN	True	True	True	True
4185	False	True	True	NaN	True
530	NaN	True	False	True	True
2721	NaN	False	False	False	False
3384	NaN	False	False	False	False

Import computational annotations Reading the GPT annotations. Enter the correct file path below. coding_column needs to point to the column with you computational annotations, annotated_identifier to an id / filename that has been passed to LabelStudio project. The annotations are merged with the classifications based on annotated_identifier, in my case uuid.

annotated_file = '/content/drive/MyDrive/2024-01-04-Text-Annotation-Sample.csv'
coding_column = 'Positioning'
annotated_identifier = 'uuid' 

annotated_df = pd.read_csv(annotated_file)

annotated_df.head()

	Unnamed: 0	Text	Post Type	Positioning	uuid
0	54	Gemeinsam als #eineUnion stehen wir für eine ...	Post	True	54
1	200	Wir wünschen allen Schülerinnen und Schüler...	Post	True	200
2	530	Fränkisches Essen gibt Kraft: Nach einem Mitt...	Post	False	530
3	707	Kleines Zwischenfazit zum #TvTriell.	Post	False	707
4	899	In einem paar Tagen sind Wahlen, wichtige Wahl...	Story	False	899

contingency_table = pd.merge(contingency_matrix, annotated_df[['uuid', coding_column]], left_on=identifier, right_on=annotated_identifier, how='left')
contingency_table.rename(columns={coding_column: "Model"}, inplace=True)

contingency_table.head()

	uuid	7107	7379	10506	13724	Majority Decision	Model
0	3733	NaN	True	True	True	True	True
1	4185	False	True	True	NaN	True	True
2	530	NaN	True	False	True	True	False
3	2721	NaN	False	False	False	False	False
4	3384	NaN	False	False	False	False	False

Let’s suppose we’re dealing with binary data. We convert all value to binary, in order to be able to compare them correctly.

Note

The cell below needs refactoring for different type of data. In case of nominal data inside strings it might be enough to skip the cell below. A custom solution is needed for more complex use cases. Additionally, make sure to adopt the majority decision cells to any changes down here and vice versa.

# Function to convert a column to boolean if it's not already
def convert_to_bool(column):
    if contingency_table[column].dtype != 'bool':
        bool_map = {'True': True, 'False': False}
        return contingency_table[column].map(bool_map)
    return contingency_table[column]

# Convert columns to boolean if they are not already
contingency_table['Majority Decision']= convert_to_bool('Majority Decision')
contingency_table['Model'] = convert_to_bool('Model')

Let’s quickly check pairwise agreement between the Model and Majority Decision using Cohen’s Kappa:

kappa_scores = calculate_kappa(contingency_table[['Majority Decision', 'Model']])
pd.DataFrame(kappa_scores).head()

	Coder 1	Coder 2	Kappa	Overlap	Coding
0	Majority Decision	Model	0.831461	50	Positioning

Machine Learning Metrics

Finally, let’s calculate Accuracy, Precision, and F1 Score and plot a confusion matrix.

import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from IPython.display import display, Markdown


# Calculating metrics
accuracy = accuracy_score(contingency_table['Majority Decision'], contingency_table['Model'])
precision = precision_score(contingency_table['Majority Decision'], contingency_table['Model'], average='binary')
recall = recall_score(contingency_table['Majority Decision'], contingency_table['Model'], average='binary')
f1 = f1_score(contingency_table['Majority Decision'], contingency_table['Model'], average='binary')

# Creating a DataFrame for the metrics
metrics_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Value': [accuracy, precision, recall, f1]
})

# Displaying the DataFrame as a table
display(metrics_df)

	Metric	Value
0	Accuracy	0.940000
1	Precision	1.000000
2	Recall	0.769231
3	F1 Score	0.869565

import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Generate the confusion matrix
cm = confusion_matrix(contingency_table['Majority Decision'], contingency_table['Model'])

# Plotting the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['False', 'True'], yticklabels=['False', 'True'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()