DATA IMMORTALITY

Part of my plan for my "Internet Client" computer was that it would 
help me organise my bookmarks between different machines. For 
various reasons, it hasn't. Well it has definately helped to an 
extent, but again I'm thinking that the only way I'll get bookmarks 
to work efficiently is using a separate bookmark manager program. 
I've been looking at bookmark managers without satisfaction for 
years though, so I've finally given in and decided to come up with 
a solution of my own.

Feature summary:

* Bookmarks stored as individual files in a directory structure 
  equivalent to the bookmark menu structure - worryingly the only 
  other bookmark system developers who seem to have gone with this 
  approach were those of Microsoft Internet Explorer. That's probably 
  a bad sign, but it still seems like the most flexible solution to 
  me. Like MailDir, but for bookmarks.

* Firefox-like add-bookmark dialogue, but run in a terminal window. 
  Triggered by a keyboard conbination and automatically copies the 
  current X selection of a URL. Downloads the page itself in order to 
  grab the title.

* Statically generated HTML interface which can be accessed either 
  locally (file://) or from a local web server. directory tree + 
  top-level bookmarks in either a small frame or table cell on the 
  left, and directory contents in the main view.

* In the frame view, have an option to browse all bookmarks in the 
  small left frame, and open links in the larger frame, emulating 
  Firefox's Ctrl-B bookmark selector.

* List of all bookmarks on one page, usable with browser's page 
  search function for searching.

* Optionally save a local copy of the page being bookmarked using 
  wget, also grabbing any file linked from that page up to a certain 
  size limit. This goes into a separate directory tree, where I can 
  also manually go into and grab the whole site using HTTrack if 
  desired.

The last feature is the one I really want to discuss, and it has 
been whirring around in my head for a long time ever since I read 
this post by Solderpunk:
gopher://zaibatsu.circumlunar.space/0/%7esolderpunk/phlog/the-individual-archivist-and-ghosts-of-gophers-past.txt

There he proposes a Gopher client (though I'd probably try to do it 
with a Gopher proxy myself) which archives every visited page 
locally. Just recently he's come up with a new approach to the 
problem, proposing instead that sites be hosted as Git repos:
gopher://zaibatsu.circumlunar.space/0/%7esolderpunk/phlog/low-budget-p2p-content-distribution-with-git.txt

Looking back on my earlier bookmarks, this is definately a problem 
that I do need to solve. I seem to have had a remarkable knack 
about a decade ago for finding websites that were about to go 
offline in the next ten years, and were obviously of so little 
interest to the world at large that the Wayback Machine often 
didn't bother archiving images (which are kind-of the key point if 
they're talking about electrical circuits) or much of the sites at 
all. Even when they did get archived, the Internet Archive is just 
another single point of failure anyway. Archive.is, for example, 
got blocked by the Australian government a few years ago for 
archiving terrorist content (the gov. did a rubbish job of it and 
you could still access the site via some of their alternative 
domains because it was done at the DNS level, but the fact that the 
people in power are idiots doesn't negate the potential of their 
power). Unfortunately I don't like either of Solderpunk's solutions.

That may be a little harsh on Solderpunk. My objection to the 
client local mirroring approach is mainly just philosophical and 
the related practical problems are likely solvable. For his second 
suggestion, I disagree with using Git, but propose the same thing 
using Rsync (which also solves the URL problem at the cost of 
losing a pre-baked changelog system) and I'd be happy.

The difference between us is simply whether to attribute importance 
to needless data storage.

For me, storing data is a commitment. You don't need one copy of 
the data, the way I do things I need at least four. One copy on the 
PC you're working from, two on your local backup drive (the latest 
backup and your previous backup in case the backup process goes 
haywire, granted incremental backups are another approach which I 
don't use myself), and at least one copy off-site. I try to keep 
all the data I can't easily cope with losing on my laptop, with its 
40GB HDD. Relying on a 20yo HDD probably isn't all that wise, but 
just to focus on the 40GB, that actually translates into up to 
160GB of data stored, and 120GB needing to be processed to complete 
a full backup cycle.

Maybe that's nothing these days, but to me it's already 
inconvenient:

* It means the backup process takes a non-trivial amount of time 
  during which the laptop's performance is poor, so I leave it to run 
  overnight only once a week. That's a waste of power, and limits the 
  regularity of my backup routine.

* It means my only practical medium is HDDs. DVDs, CDs, ZIP disks, 
  might be an option otherwise. I'm not managing to pick up SSDs or 
  sufficiently large flash drives/cards in my free to $5 second-hand 
  price range.

* It means I can't use the internet in my laptop's backup strategy, 
  because my connection is too slow and I'd have to pay a lot more 
  than for my current 3GB/month deal. That combines with the first 
  problem to make offsite backups more of a pain.

(I've got my Internet Client computer set up on a 2GB SD card. All 
important files get synced with the laptop daily including all 
system/user configuration files which make up only ~30MB compressed)

Now back to Solderpunk's concept. you can say that Gopher content 
(or probably Gemini, though I don't look at that much) is small so 
you might as well grab everything. But my Gopher hole currently 
totals 80MB. I've got about 70 sites bookmarked in the Gopher 
client on this PC (UMN Gopher); if I'm average (alright I'm 
probably not, but I'm the only one I can run "du" on) then at 5.6GB 
that's enough data to fill up over 1/8th of my 40GB laptop drive 
right there. Including my backups, that would be 22.4GB of data 
sitting somewhere, regularly read and copied at the expense of time 
and energy.

Now of that data, the largest share (34MB) is my archive of Firetext
logs. I should purge that again actually - I do keep it all myself,
and it may have some use for historical purposes, but the average 
Gopher user surely doesn't give a stuff. With the caching client 
scheme, it's not a fair assumption that the hourly log you look at 
one day is going to be what you want to find later either. With the 
Git hosting scheme, someone who just wants to read this phlog post 
is obliged to pull all that Firetext data in even if they've got no 
interest in it anyway. In fact the Photos and History Snippets 
sections make up the bulk of the other data, and yet the only part 
that I've ever received feedback on is the phlog, so for all I know 
this one 700KB corner is the only bit of content that anyone 
actually wants to view, yet using Git they'd be storing 80MB of 
data in order to do so.

Should I just ditch everything but the Phlog and just host
that with Git (or Gopher, for that matter, it's potentially just 
clogging up the Aussies.space server, which is why I cull the 
firetext archive already)? For you, maybe. For me the favourite 
part, the part I'd be most thrilled to find in my own browsing, is 
the History Snippets section (19MB), even though I've been 
struggling to get around to adding new entries there (by the way, 
if someone does actually like viewing it, letting me know would 
certainly help my motivation). So if I drop that then I'm dropping 
my favourite content for the sake of popularity, now embodied in 
the sheer efficiency of data storage and transfer.

At the same time I don't think the client caching approach is 
right, because everyone who drops into the History Snippets 
section, clicks a couple of links, decides it's just something some 
weirdo's needlessly put together, and leaves never to return, then 
ends up carrying around the gophermap and photos they viewed 
purposelessly for as long as they can keep all their data intact. 
Yet the person who drops in, looks at a few entries, bookmarks it 
for later when they have the time (what I'd probably do), then goes 
away - they find that when they return after it's gone offline, all 
they can view is the same stuff they saw before.

With the Git proposal, Rsync is an alternative which would solve 
the problem of fetching unwanted data. You just pick the directory 
with the content you're interested in and Rsync only mirrors that 
bit. Server load may be a problem, though public Rsync sites do 
already exist for software downloads, so maybe it's practical. You 
could also just Rsync individual files for browsing around and 
maybe before committing them to parmanent storage.

But with my bookmark system, if I ever get around to creating it, 
I've got my own equivalent to the client caching system, which 
works with existing protocols (well, I guess most easily just with 
the web). It specifically grabs what I think I might want to look 
at. Rather than enforcing some rigid system that theoretically 
grabs all the data I'll ever want to find again, I'd rather just 
make that decision myself.

 - The Free Thinker.