proxy70

# Quarry gopher search engine

Quarry[1] began life somewhere in the middle of 2021. After spending
several weeks working on it and getting it reasonably functional,
something undoubtedly happened which took my focus away from it.
Since that time I hadn't really invested any time in it.

I hadn't re-indexed any of the sites I had already in there for
months and I had never gone back to remove any dead links. That was
something I had intended to fix some time ago but hadn't got around
to.

To cut a long story short, towards the end of last week the subject
of Quarry came up and I found some time and motivation to put into it
once more.


## Re-indexing

To begin with I ran the indexer against the existing hosts that I had
in the database. This was unbearably slow and took the best part of 3
days to complete. When it was complete I found several problems with
the data that it collected.

I had some rogue strings (\xC2\x80\xC2\x98yo...) breaking my database
inserts. I thought this may have been cured by setting up the perl
script to use utf8 for input and output but it doesn't appear to have
made much difference, if any. A bit of modification to various sleep
times that I'd put in various parts of the indexing code however,
made a big improvement to the speed.

I had made some small database change to set an update date stamp
whenever a selector was added / updated so that I could spot ones
that hadn't been updated after an indexing run and remove them. This
gets rid of stale links collected from menus but currently there is
no checking before adding selectors to the index.


## Search results

There was a lot of junk collected during the initial re-indexing. In
order to deal with this I modified the rudimentary filtering that I'd
created first time around and extended it with more filters as I
manually trawled through the database and also from what I was seeing
in search results. I also wasn't happy with what was being returned
to searches, the results didn't seem particularly relevant in spite
of my using a fulltext search index. It turns out that I had also
indexed the selector itself, probably not the best idea. Having
removed that index and matching only on the title, the search result 
relevancy improved quite a lot. This also had the added benefit of
reducing the number of results returned.


## Future indexing

I created a script to take all of the hosts and iterate through them,
checking they are active before running the indexer against them and
finally removing any selectors that weren't updated, meaning they
were no longer present on the site. I ran this a couple of hours ago
and the total indexing process took 6 hours and 44 mins, a bit better
than the 3 days it took previously. Now that this is working I will
put it on a cron-job and run it once a week. 

If you have a site and don't want to wait, there is an API feature
you can use to request spidering or you can add your URL through the
search interface[2]. 


## To Do

    * Fix the rogue characters breaking my inserts
    * Extend filters so they can be restricted by host
    * Verify selectors before including in index


[1](gopher://gopher.icu/0/phlog/Computing/Quarry.md)
[2](gopher://gopher.icu/1/quarry)