GPT Literature Review Assistant

Author

Michael Achmann-Denkler

Published

October 31, 2023

Modified

October 28, 2024

Introduction

In this session, we build upon our previous exploration of literature research methods and guidelines for composing related work sections in your projects. As you prepare to develop your project reports, a thorough and efficient literature review becomes crucial.

We will introduce you to the use of GPT, a state-of-the-art language model, to assist in extracting key information from research paper abstracts. This technique will not only streamline your literature review process but also serve as a practical introduction to automated text analysis—a topic we’ll delve deeper into in upcoming sessions.

By integrating Python with the OpenAI GPT API, you’ll learn how to automate the extraction of features such as research questions, methodologies, data sources, populations, and scientific disciplines from abstracts. This hands-on experience will demonstrate the potential of LLMs.

What You’ll Learn:

Automating Literature Review Tasks: Use GPT to quickly identify and summarize key aspects of academic papers.
Practical Python Skills: Enhance your coding abilities by working with APIs and handling data within Jupyter notebooks.
Foundations for Text Classification: Gain insights that will prepare you for our future sessions on automated text classification techniques.

By the end of this session, you’ll have a functional literature review assistant powered by GPT, positioning you well for the advanced text analysis topics ahead.

Warning

This notebook provides a quick introduction to automated text extraction using GPT. Note: We are not evaluating the results in this session, so do not use this notebook for production purposes or in your actual project reports! Additionally, we use certain shortcuts for demonstration purposes—for example, removing search results that lack DOIs.

Setup

At first we need to install necessary packages. Hit run and wait.

print("Install Packages")
!pip install -q openai crossref-commons

Import Publish or Perish Data.

If this is the start of your review process, upload the csv file exported from Publish or Perish in the left-hand Files pane. Enter the filename in publish_or_perish_file_name. Define the output name in file_name. If you want to save the imported file in the google drive add /content/drive/MyDrive/ to the path.
Skip this cell if you want to work with a file that has been imported in the past.

Warning

We delete rows with missing DOIs. Without a DOI our code cannot retrieve abstracts. When importing the Publish or Perish file, the following code will display the number of rows that have been deleted due to missing DOIs. When using this notebook for real-world projects, you should be aware of the missing rows and manually review them!

#@title Import from Publish or Perish Data.
#@markdown If this is the start of your review process, upload the `csv` file exported from [Publish or Perish](https://harzing.com/resources/publish-or-perish) in the left-hand *Files* pane. Enter the filename in `publish_or_perish_file_name`. Define the output name in `file_name`. If you want to save the imported file in the google drive add `/content/drive/MyDrive/` to the path. <br/> **Skip this cell if you want to work with a file that has been imported in the past.**

import pandas as pd
import numpy as np
import io

publish_or_perish_file_name = "scholar.csv" # @param {type: "string"}
file_name = "2023-10-31-Literature-Review.csv" # @param {type: "string"}

# Initialize empty DataFrame
all_data = pd.DataFrame()


try:
    all_data = pd.read_csv(publish_or_perish_file_name)

    # Remove Duplicates
    initial_len = len(all_data)
    all_data = all_data.drop_duplicates(subset='DOI', keep='first')
    removed_len = initial_len - len(all_data)
    print(f'Removed {removed_len} duplicates based on DOI.')

    # Remove missing DOIs
    initial_len = len(all_data)
    all_data = all_data[~pd.isna(all_data['DOI'])]
    removed_len = initial_len - len(all_data)
    print(f'Removed {removed_len} rows without DOI.')

    all_data = all_data.sort_values(by='Cites', ascending=False).reset_index(drop=True)

    print('Sorted Table by Cites.')

    # Create empty columns for Literature Review
    all_data["Relevant"] = ""
    all_data["Notes"] = ""
    all_data["Checked"] = False

    print('Initialized Columns')

    all_data.to_csv(file_name)
    print(f"Success: Saved data to {file_name}")

    print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
    print(f"Error: Failed to load data from File. {str(e)}")

Removed 172 duplicates based on DOI.
Removed 1 rows without DOI.
Sorted Table by Cites.
Initializes Columns
Success: Saved data to 2023-10-31-Literature-Review.csv
Success: Data loaded from File "2023-10-31-Literature-Review.csv".

Read previously imported File

If you want to keep going with a former review process, we can read an uploaded file / a file from google drive. Only run one cell, this one or the above.

#@title Read previously imported File
#@markdown If you want to keep going with a former review process, we can read an uploaded file / a file from google drive. **Only run one cell, this one or the above.**
import pandas as pd
import numpy as np
import io

file_name = "2023-10-31-Literature-Review.csv" # @param {type: "string"}

try:
    all_data = pd.read_csv(file_name)

    print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
    print(f"Error: Failed to load data from File. {str(e)}")

Success: Data loaded from File "2023-10-31-Literature-Review.csv".

In this example we’ve saved the file locally. When working with Colab, the file will be deleted when we disconnect. For colab you should link your google drive (open the files pane on the left, click the Google Drive button). Once connected, save the file in the folder /content/drive/MyDrive/YOUR-FILENAME.csv. It will be accessible through Drive, and Colab is from now on going to connect automatically to drive.

Check the imported data. We’re using pandas, the imported data is saved in the all_datavariable. head(2)displays the two top rows of the table. Additionally, we have added three columns: Relevant, Notes, and Checked. We are going to make use of them to keep track of our progress.

# Check the structure (and content) of the file
all_data.head(2)

	Unnamed: 0.2	Unnamed: 0.1	Unnamed: 0	Cites	Authors	Title	Year	Source	Publisher	ArticleURL	...	Age	Abstract	FullTextURL	RelatedURL	babbage_similarity	babbage_search	similarities	Relevant	Notes	Checked
0	0	746	844	21	Florian Arendt	Suicide on Instagram – Content Analysis of a G...	2019.0	Crisis	Hogrefe Publishing Group	http://dx.doi.org/10.1027/0227-5910/a000529	...	3.0	Abstract. Background: Suicide is the second le...	https://econtent.hogrefe.com/doi/pdf/10.1027/0...	NaN	[-0.0018475924152880907, 0.022463073953986168,...	[-0.014954154379665852, 0.026176564395427704, ...	-1	NaN	NaN	False
1	1	770	868	4	Paloma de H. Sánchez-Cobarro, Francisco-Jose M...	The Brand-Generated Content Interaction of Ins...	2020.0	Journal of Theoretical and Applied Electronic ...	MDPI AG	http://dx.doi.org/10.3390/jtaer16030031	...	2.0	The last decade has seen a considerable increa...	https://www.mdpi.com/0718-1876/16/3/31/pdf	NaN	[-0.0029447057750076056, 0.01190990675240755, ...	[-0.01012819167226553, 0.02539714053273201, -0...	-1	NaN	NaN	False

2 rows × 35 columns

In the next step we are going to start our literature review:

We filter for the first unchecked row, ordered by the cite count.
We retrieve the abstract from CrossRef API using the DOI.
We display all information
We answer whether the paper appear to be relevant by entering y or n for yes or no.

For our session, the cell only runs through one row and finishes afterwards. For a real world application you’d probably like to add some kind of loop.

from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re

# Get one row: Not checked, highest Citation count.
highest_cites_unchecked = all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
index = highest_cites_unchecked.name

# Retrieve Abstract from Crossref
response = get_publication_as_json(highest_cites_unchecked['DOI'])
abstract = response.get("abstract", "")

# Remove XML
abstract = re.sub(r'<[^>]+>', '', abstract)
all_data.loc[index, 'Abstract'] = abstract


# Display all information
IPython.display.clear_output(wait=True)
title_disp = IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
authors_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
doi_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
display(title_disp, authors_disp, doi_disp)
print(textwrap.fill(abstract, 80))
relevant_input = input('Relevant? (y/n): ').lower().strip() == 'y'

# Save user input
all_data.loc[index, 'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input

Suicide on Instagram – Content Analysis of a German Suicide-Related Hashtag

Florian Arendt

10.1027/0227-5910/a000529

Abstract. Background: Suicide is the second leading cause of death among
15–29-year-olds globally. Unfortunately, the suicide-related content on
Instagram, a popular social media platform for youth, has not received the
scholarly attention it deserves. Method: The present study provides a content
analysis of posts tagged as #selbstmord, a German suicide-related hashtag. These
posts were created between July 5 and July 11, 2017. Results: Approximately half
of all posts included words or visuals related to suicide. Cutting was by far
the most prominent method. Although sadness was the dominant emotion, self-hate
and loneliness also appeared regularly. Importantly, inconsistency – a gap
between one's inner mental state (e.g., sadness) and one's overtly expressed
behavior (e.g., smiling) – was also a recurring theme. Conversely, help-seeking,
death wishes, and professional awareness–intervention material were very rare.
An explorative analysis revealed that some videos relied on very fast cutting
techniques. We provide tentative evidence that users may be exposed to
purposefully inserted suicide-related subliminal messages (i.e., exposure to
content without the user's conscious awareness). Limitations: We only
investigated the content of posts on one German hashtag, and the sample size was
rather small. Conclusion: Suicide prevention organizations may consider posting
more awareness–intervention materials. Future research should investigate
suicide-related subliminal messages in social media video posts. Although
tentative, this finding should raise a warning flag for suicide prevention
scholars.
Relevant? (y/n): y

Next, we check whether our input has been saved:

# Check the result
all_data.iloc[index]

Unnamed: 0.2                                                          0
Unnamed: 0.1                                                        746
Unnamed: 0                                                          844
Cites                                                                21
Authors                                                  Florian Arendt
Title                 Suicide on Instagram – Content Analysis of a G...
Year                                                             2019.0
Source                                                           Crisis
Publisher                                      Hogrefe Publishing Group
ArticleURL                  http://dx.doi.org/10.1027/0227-5910/a000529
CitesURL                                                            NaN
GSRank                                                               26
QueryDate                                           2022-09-08 10:44:44
Type                                                    journal-article
DOI                                           10.1027/0227-5910/a000529
ISSN                                                          0227-5910
CitationURL                                                         NaN
Volume                                                             40.0
Issue                                                               1.0
StartPage                                                          36.0
EndPage                                                            41.0
ECC                                                                  21
CitesPerYear                                                        7.0
CitesPerAuthor                                                       21
AuthorCount                                                           1
Age                                                                 3.0
Abstract              Abstract. Background: Suicide is the second le...
FullTextURL           https://econtent.hogrefe.com/doi/pdf/10.1027/0...
RelatedURL                                                          NaN
babbage_similarity    [-0.0018475924152880907, 0.022463073953986168,...
babbage_search        [-0.014954154379665852, 0.026176564395427704, ...
similarities                                                         -1
Relevant                                                           True
Notes                                                               NaN
Checked                                                            True
Name: 0, dtype: object

Using GPT to extract information from abstracts

Now for the fun part: Is it possible to use GPT to help us during the review process? We are going to try and extract text features automatically. For the moment we are going to use gpt3.5-turbo.

Note: Please feel free to test different prompts and questions. The Promptingguide is a good resource to learn more about different prompting techniques. Use the ChatGPT interface to cheaply test prompts prior to using them with the API. Use the OpenAI Playground to optimize your prompts with a visual user interface for different settings and a prompting history (trust me, this can save your life!).

Prompts: We’re going to use the system prompt for our instructions, and the user prompt to send our content.

Caution

A word of warning: You should not trust the quality of the GPT output at this stage. The prompt has not been evaluated, overall LLMs produce output that appears meaningful most of the times. Sometimes, however, it is Hallucinations. Thus, before using prompts and LLMs for production, we have to make sure we can trust their outputs. We will dive deeper into this topic in the classification sessions.

system_prompt = """
You're an advanced AI research assistant. Your task is to extract **research questions**, **operationalization**, **data sources**, **population**, and **scientific disciplines** from user input. Return "None" if you can't find the information in user input.

**Formatting**
Return a markdown table, one row for each extracted feature: **research questions**, **operationalization**, **data sources**, **population**, and **scientific disciplines**.
"""

Please enter your API-Code in the next code cell for the openai.api_key variable. We have changed the cell to include the gpt_prompt variable, which sends the title and abstract as a user prompt. We’re using the openai.ChatCompletion.create() method to send our request to the API. We expect the response in api_response['choices'][0]['message']['content'] to be markdown (see prompt above), as such we display the markdown in our notebook.

from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re

# Enter OpenAI API-Code
openai.api_key = "sk-XXXXXXXXX"

# Get one row: Not checked, highest Citation count.
highest_cites_unchecked = all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
index = highest_cites_unchecked.name

# Retrieve Abstract from Crossref
response = get_publication_as_json(highest_cites_unchecked['DOI'])
abstract = response.get("abstract", "")

# Remove XML
abstract = re.sub(r'<[^>]+>', '', abstract)

all_data.loc[index, 'Abstract'] = abstract

# Display all information (before we send the request to OpenAI)
IPython.display.clear_output(wait=True)
title_disp = IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
authors_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
doi_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
display(title_disp, authors_disp, doi_disp)
print(textwrap.fill(abstract, 80))

gpt_prompt = f"""
**Title**: {highest_cites_unchecked['Title']}
**Abstract**: {abstract}
"""

# Sending request, takes a moment. In the meantime you may read the abstract.
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": abstract}
]

try:
  api_response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=messages,
      temperature=0,
      timeout=30
    )

  gpt_result = api_response['choices'][0]['message']['content']

  # Display the GPT result
  display(IPython.display.HTML(f"<h3>GPT Extracted Data</h3>"))
  display(IPython.display.Markdown(gpt_result))
except:
  print("GPT API Error")

relevant_input = input('Relevant? (y/n): ').lower().strip() == 'y'

# Save user input
all_data.loc[index, 'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input

The Brand-Generated Content Interaction of Instagram Stories and Publications: A Comparison between Retailers and Manufacturers

Paloma de H. Sánchez-Cobarro, Francisco-Jose Molina-Castillo, Cristina Alcazar-Caceres

10.3390/jtaer16030031

The last decade has seen a considerable increase in entertainment-oriented
communication techniques. Likewise, the rise of social networks has evolved,
offering different formats such as publication and stories. Hence, there has
been a growing interest in knowing which strategies have the greatest social
impact to help position organizations in the mind of the consumer. This research
aims to analyze the different impact that stories and publications can have on
the Instagram social network as a tool for generating branded content. To this
end, it analyses the impact of the different Instagram stories and publications
in various sectors using a methodology of structural equations with composite
constructs. The results obtained, based on 800 stories and publications in four
different companies (retailers and manufacturers), show that the reach of the
story generally explains the interaction with Instagram stories. In contrast, in
the case of publications, impressions are of greater importance in explaining
the interaction with the publication. Among the main contributions of the work,
we find that traditional pull communication techniques have been losing
effectiveness in front of new formats of brand content generation that have been
occupying the time in the relationship between users and brands.

GPT Extracted Data

Feature	Value
Research questions	- What strategies have the greatest social impact on Instagram? - How do stories and publications on Instagram impact the consumer’s perception of brands? - What is the relationship between reach and interaction with Instagram stories? - What is the relationship between impressions and interaction with Instagram publications?
Operationalization	- Analyzing the impact of Instagram stories and publications in various sectors - Using a methodology of structural equations with composite constructs
Data sources	- 800 stories and publications on Instagram
Population	- Four different companies (retailers and manufacturers)
Scientific disciplines	- Marketing - Communication

Relevant? (y/n): y

The above output shows a formatted table listing all extracted features. In this short warm-up session on GPT we have seen one use case of the LLM: The extraction of text feautures. In future sessions we are going to dive deeper into this topic.

Note

Did you create an excellent prompt? Share it with us! Enter your prompt into this Excel Sheet

Save your Progress

The following line saves all progress to file_name. If file_name is a path to Google Drive you will be able to pick up your work later on.

all_data.to_csv(file_name)

Reuse

GNU GPLv3

Citation

BibTeX citation:

@online{achmann-denkler2023,
  author = {Achmann-Denkler, Michael},
  title = {GPT {Literature} {Review} {Assistant}},
  date = {2023-10-31},
  url = {https://social-media-lab.net/notebooks/literature-review.html},
  doi = {10.5281/zenodo.10039756},
  langid = {en}
}

For attribution, please cite this work as:

Achmann-Denkler, Michael. 2023. “GPT Literature Review Assistant.” October 31, 2023. https://doi.org/10.5281/zenodo.10039756.