TITLE: Scraping instagram without an account
DATE: 2019-10-20
AUTHOR: John L. Godlee
====================================================================


There are lots of people I would like to follow on Instagram, mostly
woodworkers, bicycle people, and outdoors people. It seems to be a
really good method of delivering content. Unfortunately for
Instagram, there is absolutely no way I would make an account with
them. I fear it would be too much of a time sink, and I’m paranoid
of giving too much detail of my personal interests to Facebook.

I found a command line tool called [InstaLooter] which you can use
to scrape public Instagram profiles without an account and save the
images on my local machine which I can then read at my leisure, in
the spirit of RSS. This is how I implemented the program.

  [InstaLooter]: https://github.com/althonos/InstaLooter

I created a text file which lives in my $HOME called .ig_subs.txt.
The file holds a list of Instagram user IDs for the accounts I want
to scrape from:

    kelsoparadiso
    lloyd.kahn
    exploringalternatives
    barnthespoon
    terrybarentsen
    woodlands.co.uk
    zedoutdoors
    mossy_bottom

Then I made a shell script which lives in my path, called insta_dl:

    #!/bin/bash

    # Make directory if it doesn't exist
    mkdir -p $HOME/Downloads/ig

    # make newlines the only separator
    IFS=$'\n' 

    # disable globbing
    set -f          

    # Loop
    for i in $(cat < "$HOME/.ig_subs.txt"); do
      instalooter user $i $HOME/Downloads/ig/ -n 1 -N -T {username}.{date}.{id} 
    done

instalooter user $i downloads photos from each user i. -n 1 only
downloads the most recent post, whether that post is one photo or
multiple. -N only downloads images which don’t already exist in the
destination directory ($HOME/Downloads/ig/), based on the filename.
-T {username}.{date}.{id} sets the filename of each photo. {id} is
unique for each photo on Instagram, so it uniquely identifies each
file downloaded for use by -N. The filenames then look something
like this:

    exploringalternatives.2019-09-27.2142383070393557093.jpg
    kelsoparadiso.2019-10-09.2150831532411304437.jpg
    kelsoparadiso.2019-10-09.2150831532419588103.jpg
    kelsoparadiso.2019-10-09.2150831532419839765.jpg
    lloyd.kahn.2019-10-11.2152638264107259024.jpg
    mossy_bottom.2019-10-09.2151026330651686709.jpg
    terrybarentsen.2019-10-03.2146722625883638769.jpg
    terrybarentsen.2019-10-03.2146722625900303797.jpg
    terrybarentsen.2019-10-03.2146722625950630270.jpg
    woodlands.co.uk.2019-10-11.2152273592812162360.jpg
    zedoutdoors.2019-10-02.2145942922787735607.jpg

If I wanted to I guess I could further file each image into its own
directory based on username or date, but I don’t want that.

I can now create a cronjob or a LaunchAgents script to automate this
to run everyday or every week in the background.

Update - 2019_10_31

I updated the insta_dl shell script so that it also grabs the
caption of each instagram post downloaded and stores it in a text
file. InstaLooter can download post metadata as a JSON file by
adding the -d flag (--dump-json). Then I use jq to parse the JSON
file for each post to extract the full name of the account
(.owner.full_name), the @username of the account (.owner.username)
and the content of the caption of the post
(.edge_media_to_caption[][].text). Then I use sed to put a blank
line between each caption to make it easier to read and delete the
original JSON files:

    #!/bin/bash

    # Make directory if it doesn't exist
    mkdir -p $HOME/Downloads/ig

    DIR=$HOME/Downloads/ig

    # make newlines the only separator
    IFS=$'\n' 

    # Loop
    for i in $(cat < "$HOME/.ig_subs.txt"); do
        instalooter user $i $DIR -v -d -n 1 -N -T {username}.{date}.{id} 
    done

    for i in $DIR/*json ; do
        cat $i | jq '(.owner.full_name + " (" + .owner.username + "): " + .edge_media_to_caption[][].text)'
    done > $DIR/description.txt

    sed -i 'G' $DIR/description.txt

    rm $HOME/Downloads/ig/*.json