proxy70

___________________________________________
title: Bookmarking and Creating a Local Internet Archive
tags: hack bookmarks archiving
date: 2022-01-04
___________________________________________

Intro

While I’m a heavy user of RSS with Newsblur, I’ve never had a coherent
bookmarking solution. Last week I finally setup a proper bookmark
manager, and as a bonus, archived bookmarked pages locally in my own
personal Internet Archive.

  [Newsblur]: https://www.newsblur.com
  [Internet Archive]: https://archive.org

Bookmarking With Raindrop.io

A long time ago I was a user of del.icio.us, and first looked into
using Pinboard for bookmarking. While Pinboard had most of the
features I wanted - it lacked a native mobile app. I looked into other
alternatives, and I ended up subscribing to Raindrop.io after using it
for a few days on a trial basis.

  [del.icio.us]: https://en.wikipedia.org/wiki/Delicious_(website)
  [Pinboard]: https://pinboard.in
  [Raindrop.io]: https://raindrop.io

Raindrop.io had all the features I was looking for,

-   Free and Paid Subscription models
-   Mobile App
-   Permanent Copies
-   Full text search
-   Tagging
-   Dropbox sync

  [Paid Subscription]: https://help.raindrop.io/premium-features
  [Mobile App]: https://help.raindrop.io/mobile-app
  [Permanent Copies]: https://help.raindrop.io/backups/#permanent-library
  [Full text search]: https://help.raindrop.io/using-search/#full-text-search
  [Tagging]: https://help.raindrop.io/tags
  [Dropbox sync]: https://help.raindrop.io/backups#backup-to-dropbox

Using the apps or browser extensions is seamless, making it easy to
save bookmarks no matter what device or browser I’m using. I also
imported all of my Firefox bookmarks which required some cleaning up,
but gave a good impression of how the tool looks with content. I’m
still experimenting with categories and tagging, but overall there’s
some order to the chaos, and with tags and full text search I can
quickly find sites without having to resort to a search engine.

I also setup an IFTTT automation so whenever a page is bookmarked in
Raindrop.io, it will add it to Saved Stories in Newsblur with the
tags. While it’s not really that useful, it can act as a sort of
backup if needed and maybe in the future Newsblur will add some
feature that makes it more useful.

  [IFTTT]: https://ifttt.com

[Raindrop.io Mac App]

  [Raindrop.io Mac App]: /assets/images/posts/bookmarking/raindropapp.png

Raindrop.io Mac App
Archiving Locally with ArchiveBox

While researching bookmarking services, I stumbled across ArchiveBox,
which can create a local Internet Archive of webpages, media, and
other thing from the web. It can also import bookmarks, archiving them
locally and sending them to archive.org.

  [ArchiveBox]: https://archivebox.io
  [Internet Archive]: https://archive.org

This got me thinking, the Backup feature in Raindrop.io saves an
Export.html to Dropbox. I could setup Dropboxy sync on my server and
have a cronjob import every hour, syncing all new bookmarks from
Raindrop.io into ArchiveBox automatically.

  [Backup feature]: https://help.raindrop.io/backups
  [Dropbox]: https://www.dropbox.com/home

My first attempt at this was successful, but the Export.html backup is
a <!DOCTYPE NETSCAPE-Bookmark-file-1> format, and while ArchiveBox can
import it, it doesn’t do it all that well - creating archives of parts
of pages, not applying tags, and just overall wasn’t very consistent.

I found that ArchiveBox can also import a json formatted file with a
simple array schema with a url: and tags: fields, so I wrote a quick
script to convert this Backup.html into an import.json and then import
it into ArchiveBox using docker-compose. Putting this into an hourly
cronjob then automatically imports any new bookmarks and archives
them.

This script gets all bookmarked URLs and tags from Export.html and
creates a import.json with a structure of,

import.json:

    [
    { "url": "https://www.friendlyskies.net/notebook/giving-haiku-os-beta-3-a-try", "tags": "haiku,os" },
    { "url": "https://github.com/dwmkerr/hacker-laws", "tags": "patterns" },
    ...
    { "url": "https://www.benoakley.co.uk/tartan-asia-extreme", "tags": "korean,japanese,film,movies" },
    { "url": "https://drodio.com/creating-your-own-remote-workspace-for-under-5k/", "tags": "wfh,shed,backyard" }
    ]

raindrop-import.sh:

    #!/bin/bash
    #Imports bookmarks from Raindrop.io

    #Backup file from Raindrop sync on Dropbox
    exportfile="~/Dropbox/Apps/Raindrop.io/Export.html"
    importfile="~/Dropbox/Apps/Raindrop.io/import.json"

    count=0
    delimiter=','

    echo "[" > ${importfile}

    #Cleanup import to only get raw links
    for url in $(grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' ${exportfile} | sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'); do
      tags=$(grep "A HREF=\"${url}\"" "${exportfile}" | grep -io 'tags=['"'"'"][^"'"'"']*['"'"'"]' | sed -e 's/^tags=["'"'"']//i' -e 's/["'"'"']$//i' | tr '\n' ',')
      munged_tags=${tags%?}
      echo "{ \"url\": \"${url}\", \"tags\": \"${munged_tags}\" }${delimiter}" >> ${importfile}
    done

    #Trim last , in file before finishing array to make it valid JSON
    sed -i '$ s/.$//' ${importfile}

    echo "]" >> ${importfile}

    #Check that JSON is valid before importing
    if jq -e . >/dev/null 2>&1 < "${importfile}"; then
      echo "Successfully Parsed JSON from ${importfile}"
    else
      echo "Unsuccessfully Parsed JSON from ${importfile}"
      exit 1
    fi

    #Import exportfile into archivebox using docker-compose
    docker-compose -f ~/archivebox/docker-compose.yml run --rm archivebox add --parser json < ${importfile}

When imported, only these URLs are archived and the tags are applied
properly, creating an complete archived copy of bookmarks saved in
Raindrop.io.

[ArchiveBox]

  [1]: /assets/images/posts/bookmarking/archivebox.png

ArchiveBox
Since Raindrop.io has preview/live/archive views as well as fulltext
search, I probably won’t use ArchiveBox frequently, but it’s good to
have a local backup in multiple formats just-in-case. It also
automatically sends the page to archive.org, so it’s another guarentee
that it is archived somewhere in case the page disappears 10 years
from now. By default it will also save a PDF file and text using
Mercury and Readability which I may work on setting up to send to my
Kindle for offline reading.

  [Mercury]: https://github.com/postlight/mercury-parser
  [Readability]: https://github.com/mozilla/readability

Conclusion

I now have a featureful bookmarking service that I can use almost
anywhere, and have started making extensive use of it already. I wish
I had setup something like this years ago, as there are many sites
I’ve come across that I wish I could reference but completely forget
how to find them using a search engine. Now can I not only quickly
find them in my bookmark manager, but I also have my own local archive
I can rely on in the future as well.

Links

-   Raindrop.io
-   ArchiveBox
-   Web Archiving Community
-   On the Importance of Web Archiving

  [Raindrop.io]: https://raindrop.io
  [ArchiveBox]: https://archivebox.io
  [Web Archiving Community]: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community
  [On the Importance of Web Archiving]: https://parameters.ssrc.org/2018/09/on-the-importance-of-web-archiving/