proxy70

Offline computing: HTTP browsing

After a shift of work in daylight, I'm back again at night shifts. 
For the first of the three to come, let's talk a little about offline
HTTP browsing.

It happens frequently that, while I offline browse my content in my
mail client (simple emails, newsletters, or RSS feeds), an URL catches
my curiosity.  Like everyone else in these cases, I keep it aside so
that I can visit it when I have access to the Internet.

Unfortunately, when the time comes, I do not necessarily have all the
time I want to read articles that can be long.  Therefore, I needed a
way to save these URLs for offline viewing.

For quite a long time, and for lack of better, I used the "-dump"
option of lynx(1) to save the pages I wanted in plain text format. 
Exactly, I was using this command:

 $ lynx -force_html -dump -width=72 -verbose -with_backspaces <URL>

I even made a small script to save the pages according to a
DOMAIN-TITLE.txt scheme with a header including the original URL, the
title, as well as the timestamp of the dump.  But obviously, the web
being what it is, the results were often messy and I had to use my
Emacs kung-fu to clean it up.  I wasn't very happy with this solution;
I wanted an other one.

***

Especially I had heard about a very popular method in the 1990s.  At
the time, many didn't have direct access to the web; but having an
email they were offered the possibility of repatriating pages into
their mailbox through a gateway.  It seems that some old timers (RMS to
cite one) are still actively using this solution; and I was thinking
that this is something that would suit me, given the time I already
spend in my email client and the modularity it brings me.

And then, a few months ago, I had the pleasant surprise to learn that
Anirudh 'icyphox' Oppiliappan had just opened a service working on this
principle: forlater.email.  This is very simple.  We send one or more
URLs in the body of an email to save@forlater.email.  And the server
returns the items to us (from saved@forlater.email) in their simplest
form; that is to say just the textual content, without the frills
(menus and others), somehow following the idea of what does the reader
mode of Firefox.

I would love to host this service at home, on my own server (the
sources are open).  But I lack the skills to instinctively know what
exactly I should do (most of all about the necessary self-hosted mail
server).  Above all, I lack the time to look into it seriously.  So I
am currently using forlater.email with pleasure, because I find it very
convenient and it makes my life much easier.  Indeed, I just have to
keep the links to download when I am offline in my msmtp queue in order
to send them the next time I connect (as I've discussed in a previous
post).

***

But what if tomorrow this service does not exist anymore?  Well I will
come back to a third solution that I like and that I still use
sometimes: a python script written by David Larlet.

David Larlet is a French developer who currently lives in Montreal.  He
has a blog that I particularly like.  Aware of the probable
disappearance of the articles he frequently cite, he wrote a script
that allows him to keep a cache of the articles cited, a cache that he
hosts himself and that can be browsed (for instance, this year's cache:
https://larlet.fr/david/cache/2022/).  If you visit the link, you'll
see that the cached articles contain only the essential, namely the
body of the text (thanks to the readability python module).

And one day it just hit me: I could use his script to host my own
offline cache of URLs.

He had the good idea to share the sources of his script at
https://git.larlet.fr/davidbgk/larlet-fr-david-cache.  Many thanks to
him.  I had to modify it just a little to suit my needs but nothing
exceptional: modify the paths, remove one or two functions that do not
serve me, and clean up the HTML templates provided that are specific to
his website.

Once the requirements installed, the use is as easy as a :

 $ python /path/to/cache.py new <URL>

The script will generate a clean archive of the URL content as well as
it will update the main index page containing all saved links.  And
when I want to read those when I'm offline, I just have to point lynx
to the index.html to browse all the cached articles I saved.

Erratum: to generate the main index, one must run "cache.py gen".

I have to say that this third solution is by far my favorite.  It's
written by someone I like and respect; even if I don't fluently read
and write python, I understand most of the code; and I don't rely on
any external service.  So, yes, I use forlater.email because I like the
idea to have my own offline HTTP cache in my mail client.  But that
being said, and now that I think about it, I could make a small script
that would automatically send me by email the result produced by David
Larlet's script...

Hey, I think I just found a small project that will keep me busy a
little during this night shift!

***

To conclude, I would like to quickly mention another solution that may
be interesting: Offpunk (https://notabug.org/ploum/offpunk).  This is a
browser designed by Lionel 'Ploum' Dricot to be offline first.  As the
README explains it in the source repository "the goal of Offpunk is to
be able to synchronise your content once (a day, a week, a month) and
then browse/organise it while staying disconnected".  It is based on
the gemini CLI browser AV-98 and handles Gemini, Gopher, Spartan and
the Web.  I have just tested it a little but it's an interesting
solution for me to keep an offline cache for my gopher browsing.  Maybe
more on that later if I happen to use it for real in that scenario.

In the meantime, I wish a "bon courage" to those who work, and good
rest to those who have a day off.  In anyway, take care of yourselves
and your loved ones.