The first step of analyzing Instagram Data: Scraping + Parsing JSON format data using Python
The COVID-19 epidemic led to decreasing consumer spending in non-essential goods. Garment is one of the industries that suffers the most and luxury fashion houses are taking actions to broadcast the ‘right message’ to consumers through major social media channels. We are going to investigate fashion brands’ Instagram posts in order to study the ‘message’ that they want to deliver on these particularly tough days.
That said, we will capture the latest posts data under the fashion houses’ IG account for our analysis.
Louis Vuitton Official Instagram Account | @louisvuitton
We will be using Python to accomplish this goal given its versatility and wide range of open source libraries that we would easily leverage. Here are the steps that we will be taking:
Choice of Python Packages:
- Data collection:instagram-scraper (an unofficial,open-source API)
- Data cleansing: pandas, json, glob, emoji, nltk
- Data analysis: Google Sentiment Analysis API,matplotlib
import pandas as pd
import glob
import json
import re
import datetime
import emoji
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from google.cloud import language as lg
from google.cloud.language import enums
from google.cloud.language import types
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenizestop_words = set(stopwords.words('english'))
In Part 1, we are going to talk about how to scrape and process JSON data. We will work on some simple text analyses in Part 2.
Data Collection
Command-line application: Instagram-scraper
This is an open-source API that simplifies the scraping tasks with wide-range of methods (e.g getting user profile description, download user media). Please find the installation and detail usage in this Github Repo.
After everything is set-up, run the command in terminal (Mac OS) or Command Line (Windows):
For example, we could structure the following command in order to provide a list of user IDs that we want to scrape (in user.txt). In addition to the default media metadata, we also want to include the user profile information; last but not least, we want to scrape the 50 most recent posts and save them in a folder call output.
instagram-scraper -f user.txt — media-metadata — profile-metadata -m 50 -d output
Output: JSON files that containing the post metadata
Some people might prefer scraping IG posts by writing their own script given particular limitations within this API. Not a problem! There are methods to perform a similar task by using Selenium and web-browser agents. Here is an amazing article that shows you the steps.
Extract & Cleanse Data From JSON Tree
Back to what we have been discussing so far — Unfortunately, the output we got by running the commands isn’t ready to be used yet! Notice that the data we want is nested in a JSON tree. Therefore we will start writing a python script that parses and cleanses the data and makes them applicable for analysis.
Always the first step: Investigate the JSON tree, and we locate the post text with the key ‘text’.
Remember that each user id generates one distinct JSON file, and is being saved under the output folder, therefore we could use the glob library with wild-card matching to find all the file with a ‘JSON’ file ending and start reading data inside each of them:
file_list = glob.glob("*.json")file_merge=[]for file in file_list:
with open(file, 'r') as text:
jdata = json.load(text)
if jdata:
file_merge.append(jdata)
Output this step: a list with all the JSON files
— — -
We will start by creating a data frame that saves the parsed and cleaned data:
df=pd.DataFrame(columns=['Id','post','likes','comments','date','sentiment','followers'])
— —
Here are what we will do next:
- Loop through each JSON string. In each string, search for the path to the keys that correspond to each value we want to extract (e.g ‘title’: ‘post content’).
- For value captured, apply simple text processing methods such as removing stop words, emoji, and special characters:
stop_words = set(stopwords.words('english'))
def filter_stop(txt):
txt_tokens=word_tokenize(txt)txt_tokens=[word for word in txt_tokens if word not in stop_words]
return ' '.join(txt_tokens)def strip_emoji(text):new_text = re.sub(emoji.get_emoji_regexp(), r"", text)return new_text
Start looping through the list!
Output this step: we have a data frame of cleaned data that is ready to be export as any excel-friendly format:
— —
WHEW! I hope we have enough at this point! Now we’ve got both media contents & some user metrics for us to create different types of analysis! In the next blog post, we’ll be exploring WordCloud and leveraging the Google Sentiment Analysis API to take an in-depth look at our dataset. See Ya!
Remarks in regards to the ethical and legal issue of scraping without an official API: please keep in mind that this is just a short case study that demonstrates one of the many ways of collecting IG media. Large scale and volume of scraping bear the risk of potential legal consequences. Think carefully and act smart!