import pandas as pd
= pd.read_csv('/content/drive/MyDrive/2023-12-01-Export-Posts-Text-Master.csv') df
GPT Text Classification
Letβs read last weekβs Text DataFrame
In [3]:
In [4]:
df.head()
Unnamed: 0 | shortcode | Text | Text Type | Policy Issues | |
---|---|---|---|---|---|
0 | 0 | CyMAe_tufcR | #Landtagswahl23 π€©π§‘π #FREIEWΓHLER #Aiwanger #Da... | Caption | ['1. Political parties:\n- FREIEWΓHLER\n- Aiwa... |
1 | 1 | CyL975vouHU | Die Landtagswahl war fΓΌr uns als Liberale hart... | Caption | ['Landtagswahl'] |
2 | 2 | CyL8GWWJmci | Nach einem starken Wahlkampf ein verdientes Er... | Caption | ['1. Wahlkampf und Wahlergebnis:\n- Wahlkampf\... |
3 | 3 | CyL7wyJtTV5 | So viele Menschen am Odeonsplatz heute mit ein... | Caption | ['Israel', 'Terrorismus', 'Hamas', 'Entwicklun... |
4 | 4 | CyLxwHuvR4Y | Herzlichen GlΓΌckwunsch zu diesem grandiosen Wa... | Caption | ['1. Wahlsieg und Parlamentseinstieg\n- Wahlsi... |
Setup for GPT
In [6]:
!pip install -q openai backoff gpt-cost-estimator
ββββββββββββββββββββββββββββββββββββββββ 221.4/221.4 kB 3.2 MB/s eta 0:00:00
ββββββββββββββββββββββββββββββββββββββββ 75.0/75.0 kB 7.9 MB/s eta 0:00:00
ββββββββββββββββββββββββββββββββββββββββ 2.0/2.0 MB 12.1 MB/s eta 0:00:00
ββββββββββββββββββββββββββββββββββββββββ 76.9/76.9 kB 7.8 MB/s eta 0:00:00
ββββββββββββββββββββββββββββββββββββββββ 58.3/58.3 kB 6.2 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
Weβre using the new Colab Feature to store keys safely within the Colab Environment. Click on the key on the left to add your API key and enable it for this notebook. Enter the name of your API-Key in the api_key_name
variable.
In [8]:
import openai
from openai import OpenAI
from google.colab import userdata
import backoff
from gpt_cost_estimator import CostEstimator
= "openai-lehrstuhl-api"
api_key_name = userdata.get(api_key_name)
api_key
# Initialize OpenAI using the key
= OpenAI(
client =api_key
api_key
)
@CostEstimator()
def query_openai(model, temperature, messages, mock=True, completion_tokens=10):
return client.chat.completions.create(
=model,
model=temperature,
temperature=messages,
messages=600)
max_tokens
# We define the run_request method to wrap it with the @backoff decorator
@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(system_prompt, user_prompt, model, mock):
= [
messages "role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
{
]
return query_openai(
=model,
model=0.0,
temperature=messages,
messages=mock
mock )
Next, we create a system prompt describing what we want to classify. For further examples of prompts and advice on prompt engineering see e.g. the prompting guide and further resources linked at the bottom of the page.
For the moment we are going to use the prompt from the literature.
Do not forget the Prompt Archive when experimenting. Share your successfull prompt with us!
In [10]:
= """
system_prompt You are an advanced classifying AI. Your task is to classify the sentiment of a text. Sentiment can be either βpositiveβ, βnegativeβ, or βneutralβ.
"""
In [11]:
= """
prompt Please classify the following social media comment into either βnegativeβ, βneutralβ, or βpositiveβ. Your answer MUST be one of [βnegativeβ, βneutralβ, βpositiveβ], and it should be presented in lowercase.
Text: [TEXT]
"""
Running the request.
The following code snippet uses my gpt-cost-estimator package to simulate API requests and calculate a cost estimate. Please run the estimation whne possible to asses the price-tag before sending requests to OpenAI! Make sure run_request
and system_prompt
(see Setup for GPT) are defined before this block by running the two blocks above!
Fill in the MOCK
, RESET_COST
, COLUMN
, SAMPLE_SIZE
, and MODEL
variables as needed (see comments above each variable.)
In [13]:
from tqdm.auto import tqdm
#@markdown Do you want to mock the OpenAI request (dry run) to calculate the estimated price?
= False # @param {type: "boolean"}
MOCK #@markdown Do you want to reset the cost estimation when running the query?
= True # @param {type: "boolean"}
RESET_COST #@markdown What's the column name to save the results of the data extraction task to?
= 'Sentiment' # @param {type: "string"}
COLUMN #@markdown Do you want to run the request on a smaller sample of the whole data? (Useful for testing). Enter 0 to run on the whole dataset.
= 25 # @param {type: "number", min: 0}
SAMPLE_SIZE
#@markdown Which model do you want to use?
= "gpt-3.5-turbo-0613" # @param ["gpt-3.5-turbo-0613", "gpt-4-1106-preview", "gpt-4-0613"] {allow-input: true}
MODEL
# Initializing the empty column
if COLUMN not in df.columns:
= None
df[COLUMN]
# Reset Estimates
CostEstimator.reset()print("Reset Cost Estimation")
= df.copy()
filtered_df
# Skip previously annotated rows
= filtered_df[pd.isna(filtered_df[COLUMN])]
filtered_df
if SAMPLE_SIZE > 0:
= filtered_df.sample(SAMPLE_SIZE)
filtered_df
for index, row in tqdm(filtered_df.iterrows(), total=len(filtered_df)):
try:
= prompt.replace('[TEXT]', row['Text'])
p = run_request(system_prompt, p, MODEL, MOCK)
response
if not MOCK:
# Extract the response content
# Adjust the following line according to the structure of the response
= response.choices[0].message.content
r
# Update the 'new_df' DataFrame
= r
df.at[index, COLUMN]
except Exception as e:
print(f"An error occurred: {e}")
# Optionally, handle the error (e.g., by logging or by setting a default value)
print()
Reset Cost Estimation
Cost: $0.0002 | Total: $0.0069
In [14]:
~pd.isna(df['Sentiment'])].head() df[
Unnamed: 0 | shortcode | Text | Text Type | Policy Issues | Sentiment | |
---|---|---|---|---|---|---|
6 | 6 | CyLt56wtNgV | Viele gemischte GefΓΌhle waren das gestern Aben... | Caption | ['Demokratie'] | negative |
27 | 27 | CyKwo3Ft6tp | Swipe dich rΓΌckwΓ€rts durch die Kampagne β¨\n\nπ€―... | Caption | ['Soziale Gerechtigkeit'] | positive |
29 | 29 | CyKwBKcqi31 | #FREIEWΓHLER jetzt zweite Kraft in Bayern! Gro... | Caption | ['StΓ€rkung der Demokratie', 'Sorgen der BΓΌrger... | positive |
66 | 66 | CyIjC3QogWT | In einer gemeinsamen ErklΓ€rung der Parteivorsi... | Caption | ['Israel'] | positive |
212 | 212 | CyAmHU7qlVc | #FREIEWΓHLER #Aiwanger | Caption | NaN | neutral |
In [15]:
# Save Results
'/content/drive/MyDrive/2023-12-01-Export-Posts-Text-Master.csv') df.to_csv(
Letβs plot the result for a first big picture
In [17]:
import matplotlib.pyplot as plt
# Count the occurrences of each sentiment
= df['Sentiment'].value_counts()
sentiment_counts
# Create a bar chart
='bar')
sentiment_counts.plot(kind
# Adding labels and title
'Sentiment')
plt.xlabel('Count')
plt.ylabel('Sentiment Counts')
plt.title(
# Show the plot
plt.show()