GPT Literature Review Assistant

Author

Michael Achmann-Denkler

Published

October 31, 2023

Modified

October 28, 2024

Introduction

In this session, we build upon our previous exploration of literature research methods and guidelines for composing related work sections in your projects. As you prepare to develop your project reports, a thorough and efficient literature review becomes crucial.

We will introduce you to the use of GPT, a state-of-the-art language model, to assist in extracting key information from research paper abstracts. This technique will not only streamline your literature review process but also serve as a practical introduction to automated text analysis—a topic we’ll delve deeper into in upcoming sessions.

By integrating Python with the OpenAI GPT API, you’ll learn how to automate the extraction of features such as research questions, methodologies, data sources, populations, and scientific disciplines from abstracts. This hands-on experience will demonstrate the potential of LLMs.

What You’ll Learn:

Automating Literature Review Tasks: Use GPT to quickly identify and summarize key aspects of academic papers.
Practical Python Skills: Enhance your coding abilities by working with APIs and handling data within Jupyter notebooks.
Foundations for Text Classification: Gain insights that will prepare you for our future sessions on automated text classification techniques.

By the end of this session, you’ll have a functional literature review assistant powered by GPT, positioning you well for the advanced text analysis topics ahead.

Warning

This notebook provides a quick introduction to automated text extraction using GPT. Note: We are not evaluating the results in this session, so do not use this notebook for production purposes or in your actual project reports! Additionally, we use certain shortcuts for demonstration purposes—for example, removing search results that lack DOIs.

Prepare your Literature List

Use Publish or Persih, enter your search term, and export the results as CSV file.

Right click on the search, select *Save Results* > *Results as CSV…*

Start the notebook on Colab and upload the CSV file. The notebook is available: * GitHub *

Setup

At first we need to install necessary packages. Hit run and wait.

print("Install Packages")
!pip install -q openai crossref-commons gradio backoff

Install Packages
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.6/50.6 kB 3.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 386.9/386.9 kB 15.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.7/56.7 MB 15.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 319.8/319.8 kB 22.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.7/94.7 kB 6.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.4/76.4 kB 5.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.0/78.0 kB 6.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 447.4/447.4 kB 31.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 325.2/325.2 kB 24.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.5/144.5 kB 11.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.0/11.0 MB 99.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.3/73.3 kB 6.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.7/63.7 kB 3.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 4.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.2/130.2 kB 10.3 MB/s eta 0:00:00
  Building wheel for ratelimit (setup.py) ... done

Using GPT to extract information from abstracts

Now for the exciting part: Can we leverage GPT to streamline our literature review process? Here, we’ll experiment with automatically extracting key features from abstracts using gpt-4o-turbo.

Note: Feel free to test different prompts and questions to optimize results. The Promptingguide is a good resource to learn more about different prompting techniques. Use the ChatGPT interface to cheaply test prompt prior to using them with the API. Use the OpenAI Playground to optimize your prompts with a visual user interface for different settings and a prompting history (trust me, this can save your life!). See also the Text Classification chapter.

Prompts: We’re going to use the system prompt for our instructions, and the user prompt to send our content.

Import from Publish or Perish Data.

If this is the start of your review process, upload the csv file exported from Publish or Perish in the left-hand Files pane. Enter the filename in publish_or_perish_file_name. Define the output name in file_name. If you want to save the imported file in the google drive add /content/drive/MyDrive/ to the path.
Skip this cell if you want to work with a file that has been imported in the past.

import pandas as pd
import numpy as np
import io

publish_or_perish_file_name = "/content/2024-10-24-Lit-Review-Seminar.csv" # @param {type: "string"}
file_name = "2024-10-24-Literature-Review.csv" # @param {type: "string"}

# Initialize empty DataFrame
all_data = pd.DataFrame()


try:
    all_data = pd.read_csv(publish_or_perish_file_name)

    # Remove missing DOIs
    initial_len = len(all_data)
    all_data = all_data[~pd.isna(all_data['DOI'])]
    removed_len = initial_len - len(all_data)
    print(f'Removed {removed_len} rows without DOI.')

    # Remove Duplicates
    initial_len = len(all_data)
    all_data = all_data.drop_duplicates(subset='DOI', keep='first')
    removed_len = initial_len - len(all_data)
    print(f'Removed {removed_len} duplicates based on DOI.')

    all_data = all_data.sort_values(by='Cites', ascending=False).reset_index(drop=True)

    print('Sorted Table by Cites.')

    # Create empty columns for Literature Review
    all_data["Relevant"] = ""
    all_data["Notes"] = ""
    all_data["Checked"] = False

    print('Initialized Columns')

    all_data.to_csv(file_name)
    print(f"Success: Saved data to {file_name}")

    print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
    print(f"Error: Failed to load data from File. {str(e)}")

Removed 114 rows without DOI.
Removed 0 duplicates based on DOI.
Sorted Table by Cites.
Initialized Columns
Success: Saved data to 2024-10-24-Literature-Review.csv
Success: Data loaded from File "2024-10-24-Literature-Review.csv".

Read previously imported File

If you want to keep going with a former review process, we can read an uploaded file / a file from google drive. Only run one cell, this one or the above.

import pandas as pd
import numpy as np
import io

file_name = "2023-10-31-Literature-Review.csv" # @param {type: "string"}

try:
    all_data = pd.read_csv(file_name)

    print(f'Success: Data loaded from File "{file_name}".')
except Exception as e:
    print(f"Error: Failed to load data from File. {str(e)}")

Success: Data loaded from File "2023-10-31-Literature-Review.csv".

In this example we’ve saved the file locally. When working with Colab, the file will be deleted when we disconnect. For colab you should link your google drive (open the files pane on the left, click the Google Drive button). Once connected, save the file in the folder /content/drive/MyDrive/YOUR-FILENAME.csv. It will be accessible through Drive, and Colab is from now on going to connect automatically to drive.

Check the imported data. We’re using pandas, the imported data is saved in the all_datavariable. head(2)displays the two top rows of the table. Additionally, we have added three columns: Relevant, Notes, and Checked. We are going to make use of them to keep track of our progress.

# Check the structure (and content) of the file
all_data.head(2)

	Cites	Authors	Title	Year	Source	Publisher	ArticleURL	CitesURL	GSRank	QueryDate	...	CitesPerYear	CitesPerAuthor	AuthorCount	Age	Abstract	FullTextURL	RelatedURL	Relevant	Notes	Checked
0	764	A Ceron, L Curini, SM Iacus, G Porro	Every tweet counts? How sentiment analysis of ...	2014.0	New media &society	journals.sagepub.com	https://journals.sagepub.com/doi/abs/10.1177/1...	https://scholar.google.com/scholar?cites=18273...	175	2024-10-16 10:47:35	...	76.4	191	4	10.0	… Due to recent developments in quantitative t...	https://journals.sagepub.com/doi/full/10.1177/...	https://scholar.google.com/scholar?q=related:J...			False
1	470	M Lalancette, V Raynauld	The power of political image: Justin Trudeau, ...	2019.0	American behavioral scientist	journals.sagepub.com	https://journals.sagepub.com/doi/abs/10.1177/0...	https://scholar.google.com/scholar?cites=15892...	38	2024-10-16 10:47:35	...	94.0	235	2	5.0	… As few studies have been done on Instagram-b...	https://journals.sagepub.com/doi/full/10.1177/...	https://scholar.google.com/scholar?q=related:7...			False

2 rows × 29 columns

In the next step we are going to start our literature review: 1. We filter for the first unchecked row, ordered by the cite count. 2. We retrieve the abstract from CrossRef API using the DOI. 3. We display all information 4. We answer whether the paper appear to be relevant by entering y or n for yes or no.

For our session, the cell only runs through one row and finishes afterwards. For a real world application you’d probably like to add some kind of loop.

from crossref_commons.retrieval import get_publication_as_json
import json
import openai
import textwrap
import IPython
import re

# Get one row: Not checked, highest Citation count.
highest_cites_unchecked = all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
index = highest_cites_unchecked.name

# Retrieve Abstract from Crossref
response = get_publication_as_json(highest_cites_unchecked['DOI'])
abstract =  response.get("abstract", "")

# Remove XML
abstract = re.sub(r'<[^>]+>', '', abstract)
all_data.loc[index, 'Abstract'] = abstract


# Display all information
IPython.display.clear_output(wait=True)
title_disp = IPython.display.HTML("<h2>{}</h2>".format(highest_cites_unchecked['Title']))
authors_disp = IPython.display.HTML("<p>{}</p>".format(highest_cites_unchecked['Authors']))
doi_disp = IPython.display.HTML("<p><a target='_blank' href='https://doi.org/{}'>{}</a></p>".format(highest_cites_unchecked['DOI'],highest_cites_unchecked['DOI']))
display(title_disp, authors_disp, doi_disp)
print(textwrap.fill(abstract, 80))
relevant_input = input('Relevant? (y/n): ').lower().strip() == 'y'

# Save user input
all_data.loc[index, 'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input

Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens' political preferences with an application to Italy and France

A Ceron, L Curini, SM Iacus, G Porro

10.1177/1461444813480466

 The growing usage of social media by a wider audience of citizens sharply
increases the possibility of investigating the web as a device to explore and
track political preferences. In the present paper we apply a method recently
proposed by other social scientists to three different scenarios, by analyzing
on one side the online popularity of Italian political leaders throughout 2011,
and on the other the voting intention of French Internet users in both the 2012
presidential ballot and the subsequent legislative election. While Internet
users are not necessarily representative of the whole population of a country’s
citizens, our analysis shows a remarkable ability for social media to forecast
electoral results, as well as a noteworthy correlation between social media and
the results of traditional mass surveys. We also illustrate that the predictive
ability of social media analysis strengthens as the number of citizens
expressing their opinion online increases, provided that the citizens act
consistently on these opinions.
Relevant? (y/n): y

Next, we check whether our input has been saved:

# Check the result
all_data.iloc[index]

	0
Cites	764
Authors	A Ceron, L Curini, SM Iacus, G Porro
Title	Every tweet counts? How sentiment analysis of ...
Year	2014.0
Source	New media &society
Publisher	journals.sagepub.com
ArticleURL	https://journals.sagepub.com/doi/abs/10.1177/1...
CitesURL	https://scholar.google.com/scholar?cites=18273...
GSRank	175
QueryDate	2024-10-16 10:47:35
Type	NaN
DOI	10.1177/1461444813480466
ISSN	NaN
CitationURL	NaN
Volume	NaN
Issue	NaN
StartPage	NaN
EndPage	NaN
ECC	764
CitesPerYear	76.4
CitesPerAuthor	191
AuthorCount	4
Age	10.0
Abstract	The growing usage of social media by a wider ...
FullTextURL	https://journals.sagepub.com/doi/full/10.1177/...
RelatedURL	https://scholar.google.com/scholar?q=related:J...
Relevant	True
Notes
Checked	True

dtype: object

system_prompt = """
You are an advanced AI research assistant. Your task is to extract the following information from the given research abstract:

- Research Questions
- Operationalization
- Data Sources
- Population
- Scientific Disciplines

If any of the above information is not available in the abstract, return "None" for that field.

**Output Format:**

Return the extracted information in the following JSON format:

{
  "research_questions": "...",
  "operationalization": "...",
  "data_sources": "...",
  "population": "...",
  "scientific_disciplines": "..."
}

Ensure the JSON is properly formatted.
"""

The following lines initialize access to the OpenAI API, allowing us to make GPT requests directly from this notebook. Here’s a breakdown of each part:

API Key Setup: We retrieve the API key securely from Colab’s userdata storage. Ensure your key is stored with the name specified in api_key_name.
OpenAI Initialization: By passing the API key to the OpenAI client, we can now make requests to the API for GPT-powered text analysis.

OpenAI Basics

Information on how to set up the OpenAI API Keys is available on the OpenAI Basics page.

import openai
from google.colab import userdata
import backoff
import json

# Fetch the API key from Colab
api_key_name = "openai-forschung-mad"  # @param {type: "string"}
api_key = userdata.get(api_key_name)

# Initialize OpenAI with the provided API key
client = openai.OpenAI(api_key=api_key)

Next, we define the get_gpt_extracted_data function, which sends a formatted request to GPT-4 with the research paper’s title and abstract. This function uses retry logic to handle any API errors, automatically retrying the request if needed. Upon success, the extracted information is returned in JSON format, making it straightforward to parse and analyze the results.

# Retry logic in case of API errors using backoff
@backoff.on_exception(backoff.expo, openai.APIError, max_tries=8)
def get_gpt_extracted_data(title, abstract):
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",  #
            temperature=0,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"""**{title}** \n **Abstract:** {abstract}"""},
            ],
            response_format= { "type": "json_object" }
        )

        gpt_result = response.choices[0].message.content
        return json.loads(gpt_result)

    except Exception as e:
        return f"GPT API Error: {str(e)}"

We have changed the cell to include the gpt_prompt variable, which sends the title and abstract as a user prompt. We’re using the get_gpt_extracted_data(title, abstract) method that we defined above, to send our request to the API. We expect the response in response.choices[0].message.content to be JSON. Visit the OpenAI Docs for more information.

from crossref_commons.retrieval import get_publication_as_json
import json
import textwrap
import IPython
import re

# Get one row: Not checked, highest Citation count.
highest_cites_unchecked = all_data[all_data['Checked'] == False].sort_values(by="Cites", ascending=False).iloc[0]
index = highest_cites_unchecked.name

# Retrieve Abstract from Crossref
response = get_publication_as_json(highest_cites_unchecked['DOI'])
abstract = response.get("abstract", "")

# Remove XML tags
abstract = re.sub(r'<[^>]+>', '', abstract)
all_data.loc[index, 'Abstract'] = abstract

# Display all information (before we send the request to OpenAI)
IPython.display.clear_output(wait=True)
title_disp = IPython.display.Markdown(f"## {highest_cites_unchecked['Title']}")
authors_disp = IPython.display.Markdown(f"**Authors**: {highest_cites_unchecked['Authors']}")
doi_disp = IPython.display.Markdown(f"**DOI**: [https://doi.org/{highest_cites_unchecked['DOI']}](https://doi.org/{highest_cites_unchecked['DOI']})")
abstract_disp = IPython.display.Markdown(f"### Abstract\n\n{abstract}")
display(title_disp, authors_disp, doi_disp, abstract_disp)

try:
    gpt_results = get_gpt_extracted_data(highest_cites_unchecked['Title'], abstract)

    research_questions = gpt_results.get('research_questions', 'None')
    operationalization = gpt_results.get('operationalization', 'None')
    data_sources = gpt_results.get('data_sources', 'None')
    population = gpt_results.get('population', 'None')
    disciplines = gpt_results.get('scientific_disciplines', 'None')

    # Add callout block warning for Quarto about reliability
    warning_disp = IPython.display.Markdown("""
::: warning
The following results are generated using AI extraction and may contain inaccuracies. Please review and verify the information before relying on it for further analysis.
:::
""")
    display(warning_disp)

    # Display the GPT result with Markdown formatting
    gpt_disp = IPython.display.Markdown(f"""
## GPT Extracted Data

### Research Questions
{research_questions}

### Operationalization
{operationalization}

### Data Sources
{data_sources}

### Population
{population}

### Scientific Disciplines
{disciplines}
""")
    display(gpt_disp)
except Exception as e:
    print(f"Error displaying GPT results: {str(e)}")

relevant_input = input('Relevant? (y/n): ').lower().strip() == 'y'

# Save user input
all_data.loc[index, 'Checked'] = True
all_data.loc[index, 'Relevant'] = relevant_input

The power of political image: Justin Trudeau, Instagram, and celebrity politics

Authors: M Lalancette, V Raynauld

DOI: https://doi.org/10.1177/0002764217744838

Abstract

This article explores dynamics of online image management and its impact on leadership in a context of digital permanent campaigning and celebrity politics in Canada. Recent studies have shown that images can play a critical role when members of the public are evaluating politicians. Specifically, voters are looking for specific qualities in political leaders, including honesty, intelligence, friendliness, sincerity, and trustworthiness, when making electoral decisions. Image management techniques can help create the impression that politicians possess these qualities. Heads of governments using social media to capture attention through impactful images or videos on an almost daily basis seems like a new norm. Specifically, this article takes interest in Justin Trudeau’s use of Instagram during the first year immediately following his election on October 19, 2015. Through a hybrid quantitative and qualitative approach, we examine how Trudeau and his party convey a specific image to voters in a context of permanent and increasingly personalized campaigning. We do so through an analysis of his Instagram feed focusing on different elements, including how he frames his governing style visually, how his personal life is used on his Instagram to support the Liberal Party of Canada’s values and ideas, and how celebrity culture codes are mobilized to discuss policy issues such as environment, youth, and technology. This analysis sheds light on the effects and implications of image management in Canada. More generally, it offers a much-needed look at image-based e-politicking and contributes to the academic literature on social media, permanent campaigning, as well as celebrity and politics in Canada.

The following results are generated using AI extraction and may contain inaccuracies. Please review and verify the information before relying on it for further analysis.

GPT Extracted Data

Research Questions

How does Justin Trudeau’s use of Instagram impact the perception of his leadership qualities among voters in the context of digital permanent campaigning and celebrity politics?

Operationalization

Analysis of Trudeau’s Instagram feed focusing on visual framing of his governing style, use of personal life to support party values, and mobilization of celebrity culture codes in discussing policy issues.

Data Sources

Justin Trudeau’s Instagram feed.

Population

Voters in Canada.

Scientific Disciplines

Political Science, Media Studies, Communication Studies.

Relevant? (y/n): y

The above output shows a formatted table listing all extracted features. In this short warm-up session on GPT we have seen one use case of the LLM: The extraction of text feautures. In future sessions we are going to dive deeper into this topic.

Save your Progress

The following line saves all progress to file_name. If file_name is a path to Google Drive you will be able to pick up your work later on.

all_data.to_csv(file_name)

Reuse

GNU GPLv3

Citation

BibTeX citation:

@online{achmann-denkler2023,
  author = {Achmann-Denkler, Michael},
  title = {GPT {Literature} {Review} {Assistant}},
  date = {2023-10-31},
  url = {https://social-media-lab.net/notebooks/literature-review-v2.html},
  doi = {10.5281/zenodo.10039756},
  langid = {en}
}

For attribution, please cite this work as:

Achmann-Denkler, Michael. 2023. “GPT Literature Review Assistant.” October 31, 2023. https://doi.org/10.5281/zenodo.10039756.