print("Install Packages")
!pip install -q openai crossref-commons
GPT Literature Review Assistant
Introduction
In this session, we build upon our previous exploration of literature research methods and guidelines for composing related work sections in your projects. As you prepare to develop your project reports, a thorough and efficient literature review becomes crucial.
We will introduce you to the use of GPT, a state-of-the-art language model, to assist in extracting key information from research paper abstracts. This technique will not only streamline your literature review process but also serve as a practical introduction to automated text analysis—a topic we’ll delve deeper into in upcoming sessions.
By integrating Python with the OpenAI GPT API, you’ll learn how to automate the extraction of features such as research questions, methodologies, data sources, populations, and scientific disciplines from abstracts. This hands-on experience will demonstrate the potential of LLMs.
What You’ll Learn:
- Automating Literature Review Tasks: Use GPT to quickly identify and summarize key aspects of academic papers.
- Practical Python Skills: Enhance your coding abilities by working with APIs and handling data within Jupyter notebooks.
- Foundations for Text Classification: Gain insights that will prepare you for our future sessions on automated text classification techniques.
By the end of this session, you’ll have a functional literature review assistant powered by GPT, positioning you well for the advanced text analysis topics ahead.
This notebook provides a quick introduction to automated text extraction using GPT. Note: We are not evaluating the results in this session, so do not use this notebook for production purposes or in your actual project reports! Additionally, we use certain shortcuts for demonstration purposes—for example, removing search results that lack DOIs.
Setup
At first we need to install necessary packages. Hit run and wait.
Import Publish or Perish Data.
If this is the start of your review process, upload the csv
file exported from Publish or Perish in the left-hand Files pane. Enter the filename in publish_or_perish_file_name
. Define the output name in file_name
. If you want to save the imported file in the google drive add /content/drive/MyDrive/
to the path.
Skip this cell if you want to work with a file that has been imported in the past.
We delete rows with missing DOIs. Without a DOI our code cannot retrieve abstracts. When importing the Publish or Perish file, the following code will display the number of rows that have been deleted due to missing DOIs. When using this notebook for real-world projects, you should be aware of the missing rows and manually review them!
#@title Import from Publish or Perish Data.
#@markdown If this is the start of your review process, upload the `csv` file exported from [Publish or Perish](https://harzing.com/resources/publish-or-perish) in the left-hand *Files* pane. Enter the filename in `publish_or_perish_file_name`. Define the output name in `file_name`. If you want to save the imported file in the google drive add `/content/drive/MyDrive/` to the path. <br/> **Skip this cell if you want to work with a file that has been imported in the past.**
import pandas as pd
import numpy as np
import io
= "scholar.csv" # @param {type: "string"}
publish_or_perish_file_name = "2023-10-31-Literature-Review.csv" # @param {type: "string"}
file_name
# Initialize empty DataFrame
= pd.DataFrame()
all_data
try:
= pd.read_csv(publish_or_perish_file_name)
all_data
# Remove Duplicates
= len(all_data)
initial_len = all_data.drop_duplicates(subset='DOI', keep='first')
all_data = initial_len - len(all_data)
removed_len print(f'Removed {removed_len} duplicates based on DOI.')
# Remove missing DOIs
= len(all_data)
initial_len = all_data[~pd.isna(all_data['DOI'])]
all_data = initial_len - len(all_data)
removed_len print(f'Removed {removed_len} rows without DOI.')
= all_data.sort_values(by='Cites', ascending=False).reset_index(drop=True)
all_data
print('Sorted Table by Cites.')
# Create empty columns for Literature Review
"Relevant"] = ""
all_data["Notes"] = ""
all_data["Checked"] = False
all_data[
print('Initialized Columns')
all_data.to_csv(file_name)print(f"Success: Saved data to {file_name}")
print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
print(f"Error: Failed to load data from File. {str(e)}")
Removed 172 duplicates based on DOI.
Removed 1 rows without DOI.
Sorted Table by Cites.
Initializes Columns
Success: Saved data to 2023-10-31-Literature-Review.csv
Success: Data loaded from File "2023-10-31-Literature-Review.csv".
Read previously imported File
If you want to keep going with a former review process, we can read an uploaded file / a file from google drive. Only run one cell, this one or the above.
#@title Read previously imported File
#@markdown If you want to keep going with a former review process, we can read an uploaded file / a file from google drive. **Only run one cell, this one or the above.**
import pandas as pd
import numpy as np
import io
= "2023-10-31-Literature-Review.csv" # @param {type: "string"}
file_name
try:
= pd.read_csv(file_name)
all_data
print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
print(f"Error: Failed to load data from File. {str(e)}")
Success: Data loaded from File "2023-10-31-Literature-Review.csv".
In this example we’ve saved the file locally. When working with Colab, the file will be deleted when we disconnect. For colab you should link your google drive (open the files pane on the left, click the Google Drive button). Once connected, save the file in the folder /content/drive/MyDrive/YOUR-FILENAME.csv
. It will be accessible through Drive, and Colab is from now on going to connect automatically to drive.
Check the imported data. We’re using pandas, the imported data is saved in the all_data
variable. head(2)
displays the two top rows of the table. Additionally, we have added three columns: Relevant
, Notes
, and Checked
. We are going to make use of them to keep track of our progress.
# Check the structure (and content) of the file
2) all_data.head(
Unnamed: 0.2 | Unnamed: 0.1 | Unnamed: 0 | Cites | Authors | Title | Year | Source | Publisher | ArticleURL | ... | Age | Abstract | FullTextURL | RelatedURL | babbage_similarity | babbage_search | similarities | Relevant | Notes | Checked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 746 | 844 | 21 | Florian Arendt | Suicide on Instagram – Content Analysis of a G... | 2019.0 | Crisis | Hogrefe Publishing Group | http://dx.doi.org/10.1027/0227-5910/a000529 | ... | 3.0 | Abstract. Background: Suicide is the second le... | https://econtent.hogrefe.com/doi/pdf/10.1027/0... | NaN | [-0.0018475924152880907, 0.022463073953986168,... | [-0.014954154379665852, 0.026176564395427704, ... | -1 | NaN | NaN | False |
1 | 1 | 770 | 868 | 4 | Paloma de H. Sánchez-Cobarro, Francisco-Jose M... | The Brand-Generated Content Interaction of Ins... | 2020.0 | Journal of Theoretical and Applied Electronic ... | MDPI AG | http://dx.doi.org/10.3390/jtaer16030031 | ... | 2.0 | The last decade has seen a considerable increa... | https://www.mdpi.com/0718-1876/16/3/31/pdf | NaN | [-0.0029447057750076056, 0.01190990675240755, ... | [-0.01012819167226553, 0.02539714053273201, -0... | -1 | NaN | NaN | False |
2 rows × 35 columns
In the next step we are going to start our literature review:
- We filter for the first unchecked row, ordered by the cite count.
- We retrieve the abstract from CrossRef API using the DOI.
- We display all information
- We answer whether the paper appear to be relevant by entering y or n for yes or no.
For our session, the cell only runs through one row and finishes afterwards. For a real world application you’d probably like to add some kind of loop.
from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re
# Get one row: Not checked, highest Citation count.
= all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
highest_cites_unchecked = highest_cites_unchecked.name
index
# Retrieve Abstract from Crossref
= get_publication_as_json(highest_cites_unchecked['DOI'])
response = response.get("abstract", "")
abstract
# Remove XML
= re.sub(r'<[^>]+>', '', abstract)
abstract 'Abstract'] = abstract
all_data.loc[index,
# Display all information
=True)
IPython.display.clear_output(wait= IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
title_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
authors_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
doi_disp
display(title_disp, authors_disp, doi_disp)print(textwrap.fill(abstract, 80))
= input('Relevant? (y/n): ').lower().strip() == 'y'
relevant_input
# Save user input
'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input all_data.loc[index,
Suicide on Instagram – Content Analysis of a German Suicide-Related Hashtag
Florian Arendt
Abstract. Background: Suicide is the second leading cause of death among
15–29-year-olds globally. Unfortunately, the suicide-related content on
Instagram, a popular social media platform for youth, has not received the
scholarly attention it deserves. Method: The present study provides a content
analysis of posts tagged as #selbstmord, a German suicide-related hashtag. These
posts were created between July 5 and July 11, 2017. Results: Approximately half
of all posts included words or visuals related to suicide. Cutting was by far
the most prominent method. Although sadness was the dominant emotion, self-hate
and loneliness also appeared regularly. Importantly, inconsistency – a gap
between one's inner mental state (e.g., sadness) and one's overtly expressed
behavior (e.g., smiling) – was also a recurring theme. Conversely, help-seeking,
death wishes, and professional awareness–intervention material were very rare.
An explorative analysis revealed that some videos relied on very fast cutting
techniques. We provide tentative evidence that users may be exposed to
purposefully inserted suicide-related subliminal messages (i.e., exposure to
content without the user's conscious awareness). Limitations: We only
investigated the content of posts on one German hashtag, and the sample size was
rather small. Conclusion: Suicide prevention organizations may consider posting
more awareness–intervention materials. Future research should investigate
suicide-related subliminal messages in social media video posts. Although
tentative, this finding should raise a warning flag for suicide prevention
scholars.
Relevant? (y/n): y
Next, we check whether our input has been saved:
# Check the result
all_data.iloc[index]
Unnamed: 0.2 0
Unnamed: 0.1 746
Unnamed: 0 844
Cites 21
Authors Florian Arendt
Title Suicide on Instagram – Content Analysis of a G...
Year 2019.0
Source Crisis
Publisher Hogrefe Publishing Group
ArticleURL http://dx.doi.org/10.1027/0227-5910/a000529
CitesURL NaN
GSRank 26
QueryDate 2022-09-08 10:44:44
Type journal-article
DOI 10.1027/0227-5910/a000529
ISSN 0227-5910
CitationURL NaN
Volume 40.0
Issue 1.0
StartPage 36.0
EndPage 41.0
ECC 21
CitesPerYear 7.0
CitesPerAuthor 21
AuthorCount 1
Age 3.0
Abstract Abstract. Background: Suicide is the second le...
FullTextURL https://econtent.hogrefe.com/doi/pdf/10.1027/0...
RelatedURL NaN
babbage_similarity [-0.0018475924152880907, 0.022463073953986168,...
babbage_search [-0.014954154379665852, 0.026176564395427704, ...
similarities -1
Relevant True
Notes NaN
Checked True
Name: 0, dtype: object
Using GPT to extract information from abstracts
Now for the fun part: Is it possible to use GPT to help us during the review process? We are going to try and extract text features automatically. For the moment we are going to use gpt3.5-turbo
.
Note: Please feel free to test different prompts and questions. The Promptingguide is a good resource to learn more about different prompting techniques. Use the ChatGPT interface to cheaply test prompts prior to using them with the API. Use the OpenAI Playground to optimize your prompts with a visual user interface for different settings and a prompting history (trust me, this can save your life!).
Prompts: We’re going to use the system prompt for our instructions, and the user prompt to send our content.
A word of warning: You should not trust the quality of the GPT output at this stage. The prompt has not been evaluated, overall LLMs produce output that appears meaningful most of the times. Sometimes, however, it is Hallucinations. Thus, before using prompts and LLMs for production, we have to make sure we can trust their outputs. We will dive deeper into this topic in the classification sessions.
= """
system_prompt You're an advanced AI research assistant. Your task is to extract **research questions**, **operationalization**, **data sources**, **population**, and **scientific disciplines** from user input. Return "None" if you can't find the information in user input.
**Formatting**
Return a markdown table, one row for each extracted feature: **research questions**, **operationalization**, **data sources**, **population**, and **scientific disciplines**.
"""
Please enter your API-Code in the next code cell for the openai.api_key
variable. We have changed the cell to include the gpt_prompt
variable, which sends the title and abstract as a user prompt. We’re using the openai.ChatCompletion.create()
method to send our request to the API. We expect the response in api_response['choices'][0]['message']['content']
to be markdown (see prompt above), as such we display the markdown in our notebook.
from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re
# Enter OpenAI API-Code
= "sk-XXXXXXXXX"
openai.api_key
# Get one row: Not checked, highest Citation count.
= all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
highest_cites_unchecked = highest_cites_unchecked.name
index
# Retrieve Abstract from Crossref
= get_publication_as_json(highest_cites_unchecked['DOI'])
response = response.get("abstract", "")
abstract
# Remove XML
= re.sub(r'<[^>]+>', '', abstract)
abstract
'Abstract'] = abstract
all_data.loc[index,
# Display all information (before we send the request to OpenAI)
=True)
IPython.display.clear_output(wait= IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
title_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
authors_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
doi_disp
display(title_disp, authors_disp, doi_disp)print(textwrap.fill(abstract, 80))
= f"""
gpt_prompt **Title**: {highest_cites_unchecked['Title']}
**Abstract**: {abstract}
"""
# Sending request, takes a moment. In the meantime you may read the abstract.
= [
messages "role": "system", "content": system_prompt},
{"role": "user", "content": abstract}
{
]
try:
= openai.ChatCompletion.create(
api_response ="gpt-3.5-turbo",
model=messages,
messages=0,
temperature=30
timeout
)
= api_response['choices'][0]['message']['content']
gpt_result
# Display the GPT result
f"<h3>GPT Extracted Data</h3>"))
display(IPython.display.HTML(
display(IPython.display.Markdown(gpt_result))except:
print("GPT API Error")
= input('Relevant? (y/n): ').lower().strip() == 'y'
relevant_input
# Save user input
'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input all_data.loc[index,
The Brand-Generated Content Interaction of Instagram Stories and Publications: A Comparison between Retailers and Manufacturers
Paloma de H. Sánchez-Cobarro, Francisco-Jose Molina-Castillo, Cristina Alcazar-Caceres
The last decade has seen a considerable increase in entertainment-oriented
communication techniques. Likewise, the rise of social networks has evolved,
offering different formats such as publication and stories. Hence, there has
been a growing interest in knowing which strategies have the greatest social
impact to help position organizations in the mind of the consumer. This research
aims to analyze the different impact that stories and publications can have on
the Instagram social network as a tool for generating branded content. To this
end, it analyses the impact of the different Instagram stories and publications
in various sectors using a methodology of structural equations with composite
constructs. The results obtained, based on 800 stories and publications in four
different companies (retailers and manufacturers), show that the reach of the
story generally explains the interaction with Instagram stories. In contrast, in
the case of publications, impressions are of greater importance in explaining
the interaction with the publication. Among the main contributions of the work,
we find that traditional pull communication techniques have been losing
effectiveness in front of new formats of brand content generation that have been
occupying the time in the relationship between users and brands.
GPT Extracted Data
Feature | Value |
---|---|
Research questions | - What strategies have the greatest social impact on Instagram? - How do stories and publications on Instagram impact the consumer’s perception of brands? - What is the relationship between reach and interaction with Instagram stories? - What is the relationship between impressions and interaction with Instagram publications? |
Operationalization | - Analyzing the impact of Instagram stories and publications in various sectors - Using a methodology of structural equations with composite constructs |
Data sources | - 800 stories and publications on Instagram |
Population | - Four different companies (retailers and manufacturers) |
Scientific disciplines | - Marketing - Communication |
Relevant? (y/n): y
The above output shows a formatted table listing all extracted features. In this short warm-up session on GPT we have seen one use case of the LLM: The extraction of text feautures. In future sessions we are going to dive deeper into this topic.
Did you create an excellent prompt? Share it with us! Enter your prompt into this Excel Sheet
Save your Progress
The following line saves all progress to file_name
. If file_name
is a path to Google Drive you will be able to pick up your work later on.
all_data.to_csv(file_name)
Reuse
Citation
@online{achmann-denkler2023,
author = {Achmann-Denkler, Michael},
title = {GPT {Literature} {Review} {Assistant}},
date = {2023-10-31},
url = {https://social-media-lab.net/notebooks/literature-review.html},
doi = {10.5281/zenodo.10039756},
langid = {en}
}