proxy70

# Quarry (indexer, search interface and supporting tools)

> A place, cavern, or pit where stones are dug from the earth, or 
> separated, as by blasting with gunpowder, from a large mass of rock.

> Hunted or slaughtered game, or any object of eager pursuit.

Quarry contains a number of components:
  1. Crawler/indexer (quarry.pl)
  2. Gopher search, front end to search index (search.dcgi) 
  3. Wrapper for quarry.pl to process pending host index requests
      (indexPending.pl)
  4. Sitemap generator (generateSitemap.pl)
  5. Host and selector maintenance (checkHosts.pl)

Requirements:
  * Perl
  * curl
  * MariaDB/MySQL 

Try it: gopher://gopher.icu/1/quarry

Get it: git clone git://gopher.icu/quarry


## 1. Crawler/indexer
The indexer will by default visit every link on a gopher site and 
store the type, link-title, selector, hostname and port in the 
'selectors' table. It will do this only for those types defined in 
HARVEST_TYPES.

The robots.txt standard file format is supported and honoured. A
bespoke sitemap file format is also supported and will be used to
populate the database if found.

There are a number of parameters which can be set at the top of
the file to change it's behavior:

  DEBUG (Default 1)
     Display verbose status messages.

  MAX_DEPTH (Default 5) 
     Defines the maximum number of levels of recursion.

  IGNORE_ROBOTS (Default 0) 
     Ignores robots.txt and any directives therein.

  IGNORE_SITEMAP (Default 0)
     Ignores the sitemap and instead indexes the site by recursion.

  CRAWL_DELAY (Default 2)
     Default delay in seconds between requests. This parameter is
     overridden by robots.txt if found.

  HARVEST_TYPES (Default '10Igs9')
     Defines the gopher types captured.

  TRAVERSE_DOMAINS (Default 0)
     *Best avoided* as sitemap and robots are only parsed for the 
     start domain. It is better to index each host individually and 
     use alternative means of host discovery.

  REINDEX (Default 0)
     Removes all selectors for host before re-indexing.

Usage: quarry.pl some-gopher-domain.net

The port can be optionally specified eg. some-gopher-domain.net:7070.

        
## 2. Gopher search 
This provides a font end to the index generated by quarry.pl.

Features iclude:
  * General search
  * Image search
  * Sound search
  * Video search
  * Submit site to be indexed 

The current search function is basic and tries to match the search 
string against the selector or title fields and returns any that 
match. This will change once metadata is added and implemented in 
the search.


## 3. Wrapper
This program simply looks in the 'pending' database table for hosts
submitted to be indexed, via the gopher search front end, and passes
them to quarry.pl to be indexed.

Usage: indexPending.pl


## 4. Sitemap generator
The sitemap generator uses data from the index generated by quarry.pl.

The reasons for the sitemap are twofold:
  1. Efficiency, downloading a single index file rather than crawling.
  2. The format supports additional metadata:
      * Description
      * Categories
      * Keywords

These extra metadata fields can be used to greatly enhance search 
results.

Example of records:
```
Type: 1
Selector: /contact.dcgi
Host: gopher.icu
Port: 70
LinkName: Contact
Description: My contact details
Categories:
Keywords:
--------
Type: 1
Selector: /gutenberg
Host: gopher.icu
Port: 70
LinkName: Gutenberg (unofficial book and audio search interface)
Description: Gopher search interface to the official Gutenberg book
 repository
Categories: Books
Keywords: Books
--------
```

Usage: generateSitemap.pl some-gopher-domain.net > sitemap.txt


## 5. Host and selector maintenance
Only basic host checking has been implemented using 'hostcheck'.

Currently the checkHosts.pl script checks each host twice in a 24 
hour period. If the host fails two concurrent checks then it is 
flagged as inactive and selectors will not display in search results. 
If on a subsequent check the host has recovered then it is again 
flagged as active.

Hostcheck: git clone git://gopher.icu/hostcheck


## 6. IndexNow
IndexNow[1] is an easy way for website owners to instantly inform
search engines about content changes on their website. 

It has been implemented in a basic way to allow submission of a
single URI per request:
curl -s 'gopher://gopher.icu/7/quarry/indexnow.dcgi?url=<MY-URL>&key=<MY-KEY>'

[1](https://www.indexnow.org/)