Instagram Posts

In order to download posts and stories from Instagram, we use the package instaloader. You can install package for python using pip install <package>, the command -q minimizes the output.

!pip -q install instaloader
     |████████████████████████████████| 60 kB 3.0 MB/s eta 0:00:011
  Building wheel for instaloader (setup.py) ... done

Once you install instaloader we log in using your username and password. Session information (not your credentials!) is stored in Google Drive to minimize the need for signing in.

In order to minimize the risk for your account to be disabled we suggest creating a new account on your phone before proceeding!

username = 'your.username'

# We save the sessionfile to the following directory. Default is the new folder `.instaloader` in your google drive. (This is optional)
session_directory = '/content/drive/MyDrive/.instaloader/'

import instaloader
from os.path import exists
from pathlib import Path

# Creating session directory, if it does not exists yet
Path(session_directory).mkdir(parents=True, exist_ok=True)

filename = "{}session-{}".format(session_directory, username)
sessionfile = Path(filename)


# Get instance
L = instaloader.Instaloader(compress_json=False)

# Check if sessionfile exists. If so load session,
# else login interactively
if exists(sessionfile):
  L.load_session_from_file(username, sessionfile)

else:
  L.interactive_login(username)
  L.save_session_to_file(sessionfile)
Loaded session from /content/drive/MyDrive/.instaloader/session-mi_sm_lab05.

Downloading first Posts

Next, we try to download all posts of a profile. Provide a username and folder:

dest_username = 'some.profile' 
dest_dir = '/content/drive/MyDrive/insta-posts/' # Once more we save the files to Google Drive. Replace this with a local directory if necessary.

t = Path("{}{}".format(dest_dir, dest_username))
t.mkdir(parents=True, exist_ok=True)

profile = instaloader.Profile.from_username(L.context, dest_username)
for post in profile.get_posts():
    L.download_post(post, target=t)

Well, you just downloaded your first posts! Open Google Drive and check the folder insta-posts/ (or whatever folder you chose above)! There should be three files for each post, the image, a .json file and a .txt file. The .txt includes the image caption, the .json lots of metadata about the post.

Diving into the metadata

The next cell reads all .json files of the downloaded posts. Then we browse through some interesting data.

# Reading the paths of all JSON files from dest_dir
import os

json_files = []

for subdir, dirs, files in os.walk(t):
    for file in files:
        fullpath = os.path.join(subdir, file)
        filename, file_extension = os.path.splitext(fullpath)
        if file_extension == ".json":
          json_files.append(fullpath)
# Reading all JSON files
from tqdm.notebook import tqdm
import json

json_data = []

for file in tqdm(json_files):
  with open(file, 'r') as f:
    data = json.load(f)
    json_data.append(data)

Ok, now all metadata for all posts is saved to the variable json_data. Run the next line and copy its output to http://jsonviewer.stack.hu/. Your output should look similar, go ahead and play around to explore your data! What information can you extract?

print(json.dumps(json_data[0]))

Metadata Preprocessing

Posts contain plenty of data, like time and location of the post, the authoring user, a caption, tagged users and more. The following cells demonstrate how to normalize the data into a table format, which is useful when working with pandas. Nevertheless, this is optional!

# Use booleans (True / False) values to select what type of data you'd like to analyse. 
username = True #@param {type:"boolean"}
timestamp = True #@param {type:"boolean"}
caption = True #@param {type:"boolean"}
location = True #@param {type:"boolean"}
shortcode = True #@param {type:"boolean"}
id = True #@param {type:"boolean"}
tagged_users = True #@param {type:"boolean"}

Next we loop through the data and create a new pandas DataFrame. The DataFrame will have one column for each variable selected above and one row for each downloaded posts.

If you are not yet familiar with the concept of dataframes have a look at YouTube, there’s plenty of introductory videos available.

import pandas as pd

posts = []  # Initializing an empty list for all posts
for post in tqdm(json_data):
  row = {} # Initializing an empty row for the post

  node = post.get("node")

  if username:
    owner = node.get("owner")
    row['username'] = owner.get("username")

  if timestamp:
    row['timestamp'] = node.get("taken_at_timestamp")

  if location:
    l = node.get("location", None)
    if l:
      row['location'] = l.get("name")

  if shortcode:
    row['shortcode'] = node.get("shortcode")

  if id:
    row['id'] = node.get("id")
  
  if tagged_users:
    pass

  if caption:
    c = ""
    emtc = node.get("edge_media_to_caption")
    edges = emtc.get("edges")
    for element in edges:
      caption_node = element.get("node")
      c = c + caption_node.get("text")
    row['caption'] = c

  # Finally add row to posts
  posts.append(row)

# After looping through all posts create data frame from list
posts_df = pd.DataFrame.from_dict(posts)

Now all information selected above is saved to the dataframe posts_df. Run the next cell and it will return a nicely formatted table. If your data is quite long, output will be cropped. Click the wand and after a few seconds you are able to browse through the data or filter by columns

posts_df

In order to get a first impression of dataframes, the head() method is also useful. Run the next cell to see the result

posts_df.head()

The dataframe is only saved in memory, thus when disconnecting and deleting the runtime, the dataframe is lost. Running the next cell saves the table to a CSV-file on your drive.

Now the processed data may be recovered or used in another notebook.

posts_df.to_csv('{}{}.csv'.format(dest_dir, username))