proxy70

                                  Markov Chain

   Markov chain is a relatively simple [1]stochastic (working with
   probability) mathematical model for predicting or generating sequences of
   symbols. It can be used to describe some processes happening in the
   [2]real world such as behavior of some animals, Brownian motion or
   structure of a language. In the world of programming Markov chains are
   pretty often used for generation of texts that look like some template
   text whose structure is learned by the Markov chain (Markov chains are one
   possible model used in [3]machine learning). Chatbots are just one
   example.

   There are different types of Markov chains. Here we will be focusing on
   discrete time Markov chains with finite state space as these are the ones
   practically always used in programming. They are also the simplest ones.

   Such a Markov chain consists of a finite number of states S0, S1, ..., Sn.
   Each state Si has a certain probability of transitioning to another state
   (including transitioning back to itself), i.e. P(Si,S0), P(Si,S1), ...,
   P(Si,Sn); these probabilities have to, of course, add up to 1, and some of
   them may be 0. These probabilities can conveniently be written as a n x n
   matrix.

   Basically Markov chain is like a [4]finite state automaton which instead
   of input symbols on its transition arrows has probabilities.

Example

   Let's say we want to create a simple [5]AI for an NPC in a video [6]game.
   At any time this NPC is in one of these states:

     * Taking cover (state A):
          * 50% chance to stay in cover
          * 50% chance to start looking for a target
     * Searching for a target (state B):
          * 50% chance to remain searching for a target
          * 25% chance to start shooting at what it's looking at
          * 25% chance to throw a grenade at what it's looking at
     * Shooting a bullet at the target (state C):
          * 70% chance to remain shooting
          * 10% chance to throw a grenade
          * 10% chance to start looking for another target
          * 10% chance to take cover
     * Throwing a grenade at the target (state D):
          * 50% chance to shoot a bullet
          * 25% chance to start looking for another target
          * 25% chance to take cover

   Now it's pretty clear this description gets a bit tedious, it's better,
   especially with even more states, to write the probabilities as a matrix
   (rows represent the current state, columns the next state):

     A    B    C    D    
   A 0.5  0.5  0    0    
   B 0    0.5  0.25 0.25 
   C 0.1  0.1  0.7  0.1  
   D 0.25 0.25 0.5  0    

   We can see a few things: the NPC can't immediately attack from cover, it
   has to search for a target first. It also can't throw two grenades in
   succession etc. Let's note that this model will now be yielding random
   sequences of actions such as [cover, search, shoot, shoot, cover] or
   [cover, search, search, grenade, shoot] but some of them may be less
   likely (for example shooting 3 bullets in a row has a probability of 0.1%)
   and some downright impossible (e.g. two grenades in a row). Notice a
   similarity to for example natural language: some words are more likely to
   be followed by some words than others (e.g. the word "number" is more
   likely to be followed by "one" than for example "cat").

Code Example

   Let's write an extremely primitive Markov bot that will work on the level
   of individual text characters. It will take a training text on input, for
   example a book, and learn the probabilities with which any letter is
   followed by another letter. Then it will generate a random output
   according to these probabilities, something that should resemble the
   training text. Yes, you may say we are doing a super simple [7]machine
   learning.

   Keep in mind this example is really extremely simple, it only looks one
   letter back and makes some further simplifications, for example it only
   approximates the probabilities with kind of a [8]KISS hack -- we won't
   record any numeric probability, we'll only hold a table of letters, each
   one having a "bucket" of letters that may possibly follow; during training
   we'll always throw a preceding letter's follower to a random place in the
   preceding letter's bucket, with the idea that once we finish training,
   statistically in any bucket there will remain more letters that are more
   likely to follow given letter, just because we simply threw more such
   letters in. Similarly when generating the output text we will choose a
   letter to follow the current one by looking into the table and pulling out
   a random follower from that letter's bucket, again hoping that letters
   that have greater presence in the bucket will be more likely to be
   randomly selected. This approach has issues, for example regarding the
   question of ideal bucket size, and it introduces statistical biases
   (maximum probability is limited by bucket size, order matters, later
   letters are kind of privileged), but it kind of works. Try to think of how
   we could make a better text generator -- for starters it might work on the
   level of words and could take into account a history of let's say three
   letters, i.e. it would record triplets of words and then list of words
   that likely follow, along with each one's probability that we would record
   as an actual number to make the probabilities accurate.

   Anyway with all this said, below is a [9]C code implementing the above
   described text generator. To use it just pipe some input ASCII text to it,
   however make it reasonably sized (a few thousand lines maybe, please don't
   feed it whole Britannica, the output won't be better), keep in mind the
   program always trains itself from scratch (in practice we might separate
   training from generation, as serious training might take very long, i.e.
   we would have a separate training program that would output a trained
   model, i.e. the learned probabilities, and then a generator that would
   only take the trained model and generate output text). Here is the code:

 #include <stdio.h>
 #include <stdlib.h>

 #define OUTPUT_LEN 10000   // length of generated text
 #define N 16               // bucket size for each letter
 #define SEED 123456
 #define IGNORE_NEWLINES 1

 unsigned char charFollowers[256][N];

 int main(void)
 {
   srand(SEED);

   for (int i = 0; i < 256; ++i)
     for (int j = 0; j < N; ++j)
        charFollowers[i][j] = ' ';

   unsigned char prevChar = 0;

   while (1)
   {
     int c = getchar();

     if (c == EOF)
       break;

 #if IGNORE_NEWLINES
     if (c == '\n')
       c = ' ';
 #endif

     charFollowers[prevChar][rand() % N] = c; // put char at random place
     prevChar = c;
   }

   prevChar = ' ';

   for (int j = 0; j < OUTPUT_LEN; ++j) // now generate the output
   {
     prevChar = charFollowers[prevChar][rand() % N]; // take random follower
     putchar(prevChar);
   }

   puts("\n");
   return 0;
 }

   Trying it out on the text of [10]this wiki may output something like this:

 Ther thellialy oris threstpl/pifragmediaragmagoiby s agmexes, den
 atss pititpenaraly d thiplio s ts gs, tis wily NU gmarags gos
 aticel/EEECTherixed atstixedells, s s ores agolltixes tixe. TO: N
 s, s, TOpedatssth NUCAPorag: puffrits, pillly ars agmen No tpix abe
 aghe. aragmed ssh titixen plioix ag: Th tingoras TOD s wicipixe d
 tpllifr.edarenexeramed Thecospix ts ts s osth s pes ovipingor
 g: agors agass s TOnamand s aghech th wopipistalioiaris agontibuf
 ally Thrixtply tiaceca th oul/EEEEEEEECPU), wicth NU athed wen
 aragag athichipl Thechixthass s gmelliptilicex th ostunth gmagh
 atictpixe. ar Th on wipixexepifrag gman g: sthabopl/te.

   We see at first glance it looks a bit like English text, even with some
   quirks specific to this wiki, for example here and there having FULL CAPS
   words (due to acronyms and also rants that often appear here). It even
   generated the word "CPU". Notice the algorithm correctly learned
   punctuation, i.e. it knows that after commas and periods there is almost
   always space and after space there is usually not another space. For
   comparison here is a Spanish-like text generated from Don Quixote (with
   accents removed):

 Diloma Dadro hacaci gua usta lesano strore sto do diaco; ma ro
 hiciso stue ue dita. do que menotamalmeci ma quen do gue lo;
 denestajo qucos rdo horor Da que qunca. quadombuce que queromiderbre
 hera ha rlabue F de querdos Dio macino; dombidrompo mi ste derdiba
 l, mbiolo Ferbes l ste s lolo que ha Du hano quenore Dio ueno que
 hala F uano he Dorame de qus rl, ha didesa que halanora Fla quco
 dil qucio ue do mestostaloste hados de gusta querana. stuce F s s
 Do lo dre s hal Fro gue sa sa. la sido la dico; hado mbuno Do.
 mororo; rdenaja. qunolole Diba. do. Fa gor stamestamo ha quno
 unostabro quero mue s Diado Didota. quencoralor dio sotomo Fuen
 que halora. gunore quabrbe rol gostuno hadolmbe Da que unendor
 que le di so; qunta rajos s F de qucol

   We see some shorter words like lo, le, de, he, que and sido are real
   Spanish words. Though punctuation is quite nice, the algorithm fails to
   learn that after period the word of the next sentence should start with a
   capital letter (it only does so sometimes by pure chance) -- this is due
   to the algorithm only seeing one character back; after a period there is
   also one space which already makes the algorithm forget about the period.
   This would be addressed by keeping longer history, as said above. Now
   let's try a difference kind of text altogether, let's try to feed in the
   source code of [11]Anarch:

   2  camechererea = 20;
 #erereppon.xereponioightFuaighe16_ARABEIUnst 
 chtreraySqua->rarepL_RCL_CL_PE;
   caminsin.yDINeramaxer = costRCL_PERCL_ditsins->pL_ime1
    = 0;

   * = RCL_dime1,y 1)
   0;
 }
 }
   ck;
   camererayDimameaxSqua ca  = ca->ra caininin.xS_UAME;
   caminstFua-> 0 0;
 }   ca->ponstramiomereaxSquts  chts 154;
   1)

   Here it's pretty clear the code won't work but its structure really does
   resemble the original source: curly brackets and semicolons are correctly
   followed by newlines, assignments look pretty correct as well, dereference
   arrows (->) appear too -- the code even generated the RCL_ prefix of the
   [12]raycastlib functions that's widely seen in the original code.

Links:
1. stochastic.md
2. real_world.md
3. machine_learning.md
4. finite_state_automaton.md
5. ai.md
6. game.md
7. machine_learning.md
8. kiss.md
9. c.md
10. LRS_wiki.md
11. anarch.md
12. raycastlib.md