So, I've been quiet lately. That's because I got distracted from  the
distraction from the distraction and churned out another project real
quick that I got exited about. I've been messing about with usenet of
late and  found  that there really isn't a lot of good search engines
for it out there. The ones that _are_ out there that don't completely
suck usually cost money to access. Which got me into this rabbit hole
of looking at usenet indexers,  and trying to figure out how to index
the usenet myself.
                                 ▁▃▄▅▆▇▀█▀▜▍▀█▀▇▆▅▄▃▁
Usenet uses the NNTP protocol▁▃▆██▄▃▃▊▄▄▙ ▕▎ ▟▄▄▜▃▃▄██▆▃▁
and, like gopher, NNTP is  ▃▆▀▔▘▗▀ ▗▘  ▋  ▕▎  ▚  ▝▖ ▀▖▝▔▀▆▃
a pretty simple, plain   ▄█▛▗▝▘▆▀▘▕▛▔▔▐▎▔▔▔▎▔▔▝▍▔▔▜▎▝▀▆▝▘▖▜█▄
text, human-readable   ▄▛▔▘▗▘ ▗▘ ▁▞▁▁▁▟▂▂▂▂▂▂▂▂▙▁▁▁▚▁ ▝▖ ▝▖▝▔▜▄
protocol. So writing  ▟▊▞▀▆▀▀▜▀▔▔▐▔▔▔▔▋▔▔▔▕▎▔▔▔▐▔▔▔▔▍▔▔▀▛▀▀▆▀▚▐▙
a client is not all ▗▜▘▞ ▕▘  ▞   ▞   ▕▌   ▕▎   ▐▏   ▚   ▚  ▝▖ ▚▝▛▖
that difficult.    ▗▙▌▟▎ ▞  ▐▀▀▀▕▌   ▐▎   ▕▎   ▝▍   ▐▏▀▀▀▍  ▚ ▕▙▐▟▖
                   ▞▞▕▍ ▗▎  ▞   ▐▎   ▐▏   ▕▎   ▕▌   ▕▍   ▊  ▕▍ ▐▏▚▚
No, the problem   ▐_______ _______ _______ _______ _______ _______▎▍
with writing an   ▐   |   |     __|    ___|    |  |    ___|_     _|▉
indexer is rather ▐   |   |__     |    ___|       |    ___| |   |▏▐▐
very much a       ▐_______|_______|_______|__|____|_______| |___| ▐▐
problem of scale. ▌▌▕▍ ▕▍  ▐▏   ▊    ▐    ▕▎    ▋    ▊   ▕▍  ▐▏ ▐▏▐▐
                  ▚▂▁▌▁ ▌  ▕▍   ▐    ▐    ▕▎    ▋    ▋   ▗▎  ▐ ▁▐▁▂▉
You see, usenet   ▐▕▔▜▔▔▜▔▀▀▋▀▀▀▜▀▀▀▀▜▀▀▀▀▜▛▀▀▀▀▛▀▀▀▀▋▀▀▀▜▀▀▔▛▔▔▛▔▎▍
has a lot of posts.▚▚▕▍ ▝▎  ▚   ▐▎   ▐▏   ▕▎   ▕▌   ▕▍   ▊  ▕▍ ▐▏▞▞
Nearly 500 billion ▝▛▛▜▎ ▚▅▄▟▄▄▗▄▙▃▃▃▟▃▃▃▃▃▎▃▃▃▃▍▃▃▃▟▄▖▄▄▙▄▅▞ ▕▛▜▜▘
at this time, with  ▝▟▖▚ ▕▖  ▚   ▚   ▕▌   ▕▎   ▐▏   ▞   ▞  ▗▘ ▞▗▙▘
another 171 million   ▜▅▚▖▟▃▂▂▖▁▁▐▁▁  ▋   ▕▎   ▐  ▁▁▍▁▁▗▂▂▃▋▗▞▅▛
being added every day. ▀▙▁▖▝▖ ▝▔▔▔▜▔▔▔▜▔▔▔▐▎▔▔▔▛▔▔▔▛▔▔▔▘ ▗▘▗▁▟▀
                         ▀█▆▞▄▖▐▃▂▂▌▂▁▐▎▁▁▁▎▁▁▁▍▁▂▐▂▂▃▊▗▄▚▆█▀
This means you'll have to  ▀▜▄▂▃▝▄ ▝▖ ▔▊▔▔▔▎▔▔▞▔ ▗▘ ▄▘▃▂▄▛▀
optimize the crap out of how ▔▀▜█▙▀▇▀▆ ▝▖ ▕▎ ▗▘ ▆▀▇▀▟█▛▀▔
you're storing what you've       ▔▀▀▀▇▇▆█▆▁▎▆█▆▇▇▀▀▀▔
indexed,  as well as try everything 
you can to make it as fast as possible to search through that massive
amount of data.  I found this to be a very interesting problem,  so I 
kept at it.  The result is 'UsenetSearch'  -  yes, very inspiring and 
creative name, I know. 

  link to the code: https://tildegit.org/jns/UsenetSearch

I'll repeat some of the back-story in the readme there below:

  "
      I got interested in indexing usenet because I was
      disappointed with the existing  options.  There's
      horrible things like newznab, nntmux, etc,... all
      written  in  php,  and  seemingly  depend  on the 
      entire  body  of software ever written by mankind 
      in  the history of the world. I actually tried to 
      get  nntmux  running  on my bsd server, failed of 
      course. Why became immediately obvious as soon as
      I  looked  at  their  code.  It's php code, often 
      shelling  out  linux-only  shell commands. Ugh. I 
      was  so  appalled  I had to  go and write  my own 
      thing. After all, how hard can it be :)         
                                                      "

( If you are  the author of nntmux or newznab I apologize! - It's not 
  personal ;) -)

Anyway, ok  -  so how does this thing work... well, probably, for the
most part, I guess it's similar to how most search engines work,  but
I will spell it out for those not familar.

The  indexer makes a connection to the NNTP server,  and  requests  a
list  of  all  newsgroups.  We look for any newsgroups that match the
patterns of newsgroups we  care  to index.  UsenetSearch lets you set
regular expressions for newsgroup names to index,  rather than having
to  explicitly list specific ones.  If none are set, we are indexing
all of usenet (Hope you've got a lot of storage!!)

For every  newsgroup,we ask for all subject headers - starting at the
last indexed post. At this time we are only indexing subject headers.
I might support body indexing or storing dates and such in the future
but I'm really trying to get the storage costs down to a minimum here
obviously.

These  posts then get split up  in batches that are fed into a thread
pool where we start doing the indexing magic in parallel. Time to put
that CPU to work!

Every  subject  line gets filtered,  then  tokenized.  So,  how  does 
tokenization  work  you ask? What is it? Well, basically we're trying
to generate every possible search string that would match the subject
line.  So this means every possible combination of sequential  words.
Ok, here's an example, say you wanted to tokenize the phrase: 

    " gophers are cute "

The generated tokens would be:

 - gophers are cute
 - gophers are
 - gophers
 - are cute
 - are
 - cute

So you can see, a 3 word sentence generated us 6  tokens.  As you can
imagine,  this  can blow up pretty quickly in size for long pieces of
text. So we also do a  bunch of extra token filtering after we do the
initial tokenization.  We  remove  duplicate  tokens, some tokens may
just be dropped,  for  others  we may only care to store them if they 
are an exact match to the subject. I tried to make all that filtering
configurable  via the usenetsearch configuration file,  such that the
end-user may make their own tradeoffs in search result quality versus
storage consumption. Another important thing is that we can limit the
recursion  depth of the tokenization algorithm.  The algorithm itself
basically is just a loop within a loop,  where the outer loop removes
a word from the start  and the inner loop removes a word from the end
- each time storing a token of the entire string as it goes through.

The actual C++ implementation of this splits the string by whitespace
into a vector of strings,  and then  makes  use of  iterators to take 
subsets  of word tokens.  -  So instead of: 'removing a word from the 
left',  you can simply increment  the outer iterator,  and instead of
'removing a word from the right',  you  simply  decrement  the  inner 
vector iterator. 

Anyway, then each token gets hashed. We don't ever actually store the
subjects themselves, just the hashes of the subjects. This is because
a hash is a fixed size. It doesn't matter how long a sentence is, the
hash of the sentence will be a predictable number of bytes.  This may
improve storage efficiency when the subjects are longer than the hash
but may also harm it if the subjects are shorter. In reality, usually
they are longer.  But  much more  important than that, storing things 
with a predictable size in a file, makes it so you can easily seek to
a specific entry really quickly without having to do a linear search,
because  you  can  calculate where something is going to be by simply
multiplying it's index with the size of an entry. And what's more, we
can  very  quickly  and efficiently search for a hash, because we can 
pre-sort them in a tree structure on disk. That way, we don't have to
search through an entire file of billions of hashes every time.  This
is especially  important because we're  building  a  reverse-index of
hashes.  We're  not matching some ID to a hash,  we're  matching some
hash to an (article) id.

So let's take our example string from before, and look at the hashes:

 - gophers are cute : 06e50092d57e0db6219c2ca6b80edbe5
 - gophers are      : 6f9d9543a11e72dc4f0134fd4874a666
 - gophers          : de96350d92c5eb383a0de9331933a212
 - are cute         : 7c4860f2c7ac43aab4345a34d190e2e0
 - are              : 4015e9ce43edfb0668ddaa973ebc7e87
 - cute             : 02829fb05c3076ec5a6caebd12477dec

So each of these tokens could be stored on the filesystem like:

./06/e5/00/06.db (contains the 06e50092d57e0db6219c2ca6b80edbe5 hash)
./6f/9d/95/6f.db (contains the 6f9d9543a11e72dc4f0134fd4874a666 hash)

etc... - in other words,  you can take the first n digits of the hash
and build a super simple tree from them.  Since these are hex digits,
the  number  of  files  on  the  filesystem  is  always  tunable  and 
predictable - you don't want to end up building such a deep structure
that you run out of inodes.  Moreover,  using such a scheme one could 
also  distribute individual top-level folders over multiple different
filesystems which could be useful for scalability reasons.

Anyway, that's the rough outlines of how it works.  There's of course
a lot more to it. You don't want 2 threads to write to the same file,
so you have to use file locking, but then you also don't want  to end
up in a situation where everything always wants to write to  the same
token file because then your program will just  be waiting for  locks
all the time. You also want to do  whatever you can to avoid database
corruption should the daemon get killed uncleanly and what not. Right
now I just have a simple fixed header and footer to check for db file
integrity, although I may add a CRC later.

At some point I may re-purpose some of this  code and write a  gopher
search engine. That would be a fun project too. But it would be a lot
more complicated, since there you'd have to index  (potentially long)
bodies of text, account for endless loops, read robots.txt files, and
so on,... so I might not ever get to it.  I've  also  got  a bunch of 
gopherclients to finish, a gopher server, a game engine, etc etc etc.

Enough drivel for now :) Catch ya on the flip side!