Back to Article
GPT Text Classification
Download Notebook

Run the extraction of multiple variables.

In [8]:
system_prompt = """
You're an expert in detecting calls-to-action (CTAs) from texts.
**Objective:**
Determine the presence or absence of explicit and implicit CTAs within German-language content sourced from Instagram texts such as posts, stories, video transcriptions, and captions related to political campaigns from the given markdown table.
**Instructions:**
1. Examine each user input as follows:
2. Segment the content into individual sentences.
3. For each sentence, identify:
   a. Explicit CTA: Direct requests for an audience to act which are directed at the reader, e.g., "beide Stimmen CDU!", "Am 26. September #FREIEWÄHLER in den #Bundestag wählen."
   b. Explicit CTA: A clear direction on where or how to find additional information, e.g. "Mehr dazu findet ihr im Wahlprogramm auf fdp.de/vielzutun", "Besuche unsere Website für weitere Details."
   c. Implicit CTA: Suggestions or encouragements that subtly propose an action directed at the reader without a direct command, e.g., "findet ihr unter dem Link in unserer Story."
4. Classify whether an online or offline action is referrenced.
5. CTAs should be actions that the reader or voter can perform directly, like voting for a party, clicking a link, checking more information, etc. General statements, assertions, or suggestions not directed at the reader should not be classified as CTAs.
5. Return boolean variables for Implicit CTAs (`Implicit`), Explicit CTAs (`Explicit`), `Online`, and `Offline` as a JSON objet.
**Formatting:**
Only return the JSON object, nothing else. Do not repeat the text input.
"""

The following code snippet uses my gpt-cost-estimator package to simulate API requests and calculate a cost estimate. Please run the estimation whne possible to asses the price-tag before sending requests to OpenAI!

Note: This code block adds some logic to deal with multiple variables contained in the JSON object: {"Implicit": false, "Explicit": false, "Online": false, "Offline": false}. We add the columns Implicit, Explicit, Online, and Offline accordingly. To classify different variables the code need to be modified accordingly. ChatGPT can help with this task!

Fill in the MOCK, RESET_COST, SAMPLE_SIZE, COLUMNS and MODEL variables as needed (see comments above each variable.)

In [22]:
from tqdm.auto import tqdm
import json

#@markdown Do you want to mock the OpenAI request (dry run) to calculate the estimated price?
MOCK = False # @param {type: "boolean"}
#@markdown Do you want to reset the cost estimation when running the query?
RESET_COST = True # @param {type: "boolean"}
#@markdown Do you want to run the request on a smaller sample of the whole data? (Useful for testing). Enter 0 to run on the whole dataset.
SAMPLE_SIZE = 5 # @param {type: "number", min: 0}

#@markdown Which model do you want to use?
MODEL = "gpt-3.5-turbo-0613" # @param ["gpt-3.5-turbo-0613", "gpt-4-1106-preview", "gpt-4-0613"] {allow-input: true}

#@markdown Which variables did you define in your Prompt?
COLUMNS = ["Implicit", "Explicit", "Online", "Offline"] # @param {type: "raw"}

# This method extracts the four variables from the response.
def extract_variables(response_str):
    # Initialize the dictionary
    extracted = {}

    for column in COLUMNS:
      extracted[column] = None

    try:
        # Parse the JSON string
        data = json.loads(response_str)

        for column in COLUMNS:
          # Extract variables
          extracted[column] = data.get(column, None)

        return extracted

    except json.JSONDecodeError:
        # Handle JSON decoding error (e.g., malformed JSON)
        print("Error: Response is not a valid JSON string.")
        return extracted
    except KeyError:
        # Handle cases where a key is missing
        print("Error: One or more keys are missing in the JSON object.")
        return extracted
    except Exception as e:
        # Handle any other exceptions
        print(f"An unexpected error occurred: {e}")
        return extracted


# Initializing the empty column
if COLUMN not in df.columns:
  df[COLUMN] = None

# Reset Estimates
CostEstimator.reset()
print("Reset Cost Estimation")

filtered_df = df.copy()

# Skip previously annotated rows
filtered_df = filtered_df[pd.isna(filtered_df[COLUMN])]

if SAMPLE_SIZE > 0:
  filtered_df = filtered_df.sample(SAMPLE_SIZE)

for index, row in tqdm(filtered_df.iterrows(), total=len(filtered_df)):
    try:
        p = row['Text']
        response = run_request(system_prompt, p, MODEL, MOCK)

        if not MOCK:
          # Extract the response content
          # Adjust the following line according to the structure of the response
          r = response.choices[0].message.content
          extracted = extract_variables(r)

          for column in COLUMNS:
            df.at[index, column] = extracted[column]

    except Exception as e:
        print(f"An error occurred: {e}")
        # Optionally, handle the error (e.g., by logging or by setting a default value)

print()
Reset Cost Estimation
Cost: $0.0191 | Total: $0.0838
In [24]:
df[~pd.isna(df['Implicit'])]
Unnamed: 0 shortcode Text Text Type Policy Issues Call Implicit Explicit Online Offline
442 442 CxxXJBtAHhv Friedrich Merz ist nicht gerade bekannt für se... Caption ['Asylbewerberleistungsgesetz', 'Zahnsanierung... None False False False False
453 453 CxvqTwmtlJK Damit es uns nicht so ergeht wie den Indianern... Caption NaN None False True False True
494 494 Cxs9ujENMqI 🔹#Krankenhäuser🔹#Geburtsstationen und 🔹#Hebamm... Caption ['Krankenhäuser', 'Geburtsstationen', 'Hebamme... None False True False True
839 839 CxWF0mcqrhg Unterwegs im oberbayerischen Moosburg: Herzlic... Caption NaN None False True False True
1818 1818 CxvKsBBos0j 9801 Bayerische Staatsregierung MISSION 7272 9... OCR NaN None False False False False