proxy70

                                Pseudorandomness

   Pseudorandom data is [1]data that appears (for example by its statistical
   properties) to have been generated by a [2]random process despite in fact
   having been generated by a [3]deterministic (i.e. non-random) process.
   I.e. it's a kind of "fake" but mostly [4]good enough randomness that
   arises from [5]chaotic systems -- systems that behave without randomness,
   by strict rules, but which scramble, shuffle, twist and transform the
   numbers in such a complicated way that they eliminate obvious patterns and
   leave the data looking very "random", though the numbers would be
   scrambled exactly the same way if the process was repeated with the same
   conditions, i.e. it is possible (if we know how the generator works) to
   exactly predict which numbers will fall out of a pseudorandom generator.
   This is in contrast to "true randomness" that (at least to what most
   physicists probably think) appears in some processes in nature (most
   notably in [6]quantum physics) and which are absolutely unpredictable,
   even in theory. Pseudorandomness is typically used to emulate true
   randomness in [7]computers because for many things ([8]games, [9]graphics,
   audio, random sampling, ...) it is absolutely sufficient, it is easy to do
   AND the repeatability of a pseudorandom sequence is actually an advantage
   to engineers, e.g. in [10]debugging in which we have to replicate bugs we
   find, or in programs that simply have to behave deterministic (e.g. many
   network games). True randomness is basically only ever needed for
   [11]cryptography/[12]security (or possibly for rare applications where we
   absolutely NEED to ensure lack of any patterns in the data), it is a bit
   harder to achieve because we need some unbiased source of real-world
   random data. Pseudorandom generators are so common that in most contexts
   in programming the word "random" silently actually means just
   "pseudorandom".

   A saying about psedorandom numbers states that "randomness is a task too
   important to be left to chance".

How It Works

   Firstly let's mention that we can use [13]look up tables, i.e. embed some
   high quality random data right into our program and then use that as our
   random numbers, taking one after another and getting back to start once we
   run out of them. This is for example how [14]Doom's pseudorandom generator
   worked. This is easy to do and extremely fast, but will take up some
   memory and will offer only a quite limited sized sequence (your generator
   will have a short period), so ponder on the pros and cons for your
   specific needs. From now on we'll leave this behind and will focus on
   really GENERATING the pseudorandom values with some [15]algorithm, but
   look up tables may still be kept in mind (they might even perhaps be
   somehow combined with the true generators).

   There are possibly many ways to approach generating pseudorandom numbers
   (for example you can just run some chaotic [16]cellular automaton and then
   convert it to numbers somehow), however we'll be describing the most
   typical way, used in implementation of programming languages etc.

   Pseudorandom generators generate an infinite (but in practice eventually
   repeating) sequence of numbers from some initial starting number which we
   call a [17]seed (i.e. it's important to rather think of sequences than of
   individual "random numbers" -- a number itself can hardly be random). For
   the same seed number the same sequence will always be generated (of course
   supposing we use the same generator). If you ask the generator for the
   next pseudorandom number, it will just generate the next number in the
   sequence and pass it to you. (In practice programs often use system time
   as the generator's seed number because that will make the program generate
   different numbers each time it is run.)

   The next number in the sequence is typically only generated from the
   previous number by performing some chaotic operations with it that
   transform it into a new number. However this is not entirely true because
   then the generator would have the following weakness: if it generated
   number A and then B, then every number A would ALWAYS be followed by
   number B -- that doesn't look exactly random. For this reason the
   generator internally keeps a big number with which it operates and it only
   returns some N [18]bits (for example the highest 16 bits) of that number
   to you. This way it is possible (from your point of view) to have the same
   number be followed by different numbers (though this doesn't hold for the
   generator's internal value of course, but that doesn't bother us, we are
   only looking at the output sequence).

   The number of bits that the generator takes from its internal number and
   gives you determines the range of random values that you can get. For
   example if the generator gives you 16 bit numbers, you know the numbers
   will be in range 0 to 65535. (This can be used to also generate numbers in
   any range but we'll look at that later.)

   Now let's realize another important thing -- if the generator has some
   internal number, which is the only thing that determines the next number,
   and if its internal number has some limited size -- let's say for example
   32 bits -- then the sequence HAS TO start repeating sometimes because
   there is a limited number of values the internal number can be in and once
   we get to the same number, it will have to evolve the same way it evolved
   before (because we have a deterministic generator and the number is the
   whole generator's state). Imagine this as a [19]graph: numbers are nodes,
   the seed is the node we start in, there are finitely many nodes and each
   one leads to exactly one other node -- going through this graph you
   inevitably have to end up running in some loop. For this reason we talk
   about the period of a pseudorandom generator -- this period says after how
   many values the sequence will start to repeat. In fact it is possible that
   only last N values of the initial sequence will start to repeat -- again,
   if you imagine the graph, it is possible that an initial path leads us to
   some smaller loop in which we then keep cycling. This may depend on the
   seed, so the whole situation can get a bit messy, but we can resolve this,
   just hold on.

   It's not hard to see that the period of the generator can be at most 2 to
   the power of the number of bits of the generator's internal value (i.e.
   the number of possible distinct values the number can be, or the nodes in
   the graph). We want to achieve this maximum period -- indeed, it is ideal
   if we can make it as long as possible, but achieving the maximum period
   will also mean the period won't depend on the initial seed! If you imagine
   the graph, having a big loop over all the values means that there are no
   other loops, there's just one long repeating sequence in which each
   internal value appears exactly once, so no matter where we start (which
   seed we set), we'll always end up being in the same big loop. In addition
   to this we ALSO get another awesome thing: the histogram (the count) of
   all values over the whole period will be absolutely uniform, i.e. every
   value generated during one period will appear exactly the same number of
   times (which is what we expect from a completely random, uniform
   generator) -- this can be seen from the fact that we are returning N bits
   of some bigger internal number of N + M bits, which will come through each
   possible value exactly once, so each possible value of N will have to
   appear and each of these values will have to appear with all possible
   values of the remaining M bits, which will be the same for all values.

   Now let's take a look at specific generators, i.e. types of algorithms to
   generate the numbers. There are many different kinds, for example
   [20]Mersenne Twister, middle square etc., however probably the most common
   type is the [21]linear congruential generator (LCG) -- though for many
   decades now it's been known this type of generator has some issues (for
   example less significant bits have shorter and shorter periods, for which
   we usually want to use a very big internal value and return its highest
   bits as the result), it will likely suffice for most of your needs, but
   you have to be careful about choosing the generator's parameters
   correctly. It works like this: given a current (internal) number x[n]
   (which is initially set to the seed number), the next number in the
   sequence is computed as

   x[n + 1] = (A * x[n] + C) mod M

   where A, C and M are constants -- these are the parameters of the
   generator. I.e. the next number in the sequence is computed by taking the
   current number, multiplying it by some constant, adding some constant and
   then taking modulo (remainder after division) of this by some constant.
   The catch is in choosing the right constants A, C and M -- it seems there
   are only a few combinations of these constants that will yield quality
   sequences. Furthermore we try to make the equation fast to compute, so we
   often aim to choose M to be some power of two -- if we are for example
   using 32 bit numbers, then choosing M to be 2^32 is extremely convenient,
   we simply let the number overflow (the overflow does the same thing as
   modulo by 2^32) or, at worse, perform a simple logical [22]and to mask out
   the lowest bits. It would be cool to also have A a power of two because
   then also the multiplication would be very fast and simple (just a bit
   shift), BUT this can't really be done because then the multiplication
   wouldn't actually achieve much randomness, it would just shift the number
   left, meaning the higher bits of x[n + 1] would really be the same (or at
   least corelated with) the lower bits of x[n].

   So how to choose the A, C and M constants? Let's say we take M to be a
   power of two, for reasons stated above. Now it is proven (Hull-Dobell
   theorem) that we'll get the maximum possible period (the thing we
   absolutely WANT) if exactly all of the following conditions hold: M and C
   must have the highest common divisor 1 (i.e. they must be coprime) AND A -
   1 is divisible by all prime factors of M AND A - 1 is divisible by 4 if M
   is divisible by 4. So basically we want all of these conditions to always
   hold -- however even if they hold, we don't necessarily have a
   statistically good generator yet (to ensure this see the tests below). The
   value of C isn't very significant, it's enough if the above conditions
   hold for it, so many just use e.g. 1 or some prime number, but choosing A
   correctly turns out to be quite hard.

   Here is a template for a [23]C linear congruential generator:

 T _currentRand; // the internal number

 void randomSeed(unsigned int seed)
 {
   _currentRand = seed;
 }

 unsigned int random()
 {
   _currentRand = A * _currentRand + C;
   return _currentRand >> S;
 }

   Here T is the data type of the internal number (implying the M constant)
   -- use some fixed width type from stdint.h here. A is the multiplicative
   constant, C is the additive constant and S is a shift that implies how
   many highest bits of the internal number will be taken (and therefore what
   range of numbers we'll be getting). The following table gives you a few
   hopefully good values you can just plug into this snippet to get an
   alright generator:

   T        A                   C         S        note               
   uint32_t 32310901            37        16 or 24 A from L'Ecuyer    
   uint32_t 2891336453          123       16 or 24 A from L'Ecuyer    
   uint32_t 2147001325          715136305 16       from Entacher 1999 
   uint32_t 22695477            1         16       from Entacher 1999 
   uint64_t 2862933555777941757 12345     32 or 48 A from L'Ecuyer    
   uint64_t 3935559000370003845 1719      32 or 48 A from L'Ecuyer    

   { I pulled the above numbers from various sources I found, mentioned in
   the note, tried to select the ones that were allegedly good, I also
   quickly tested them myself, checked the period was at maximum at least for
   the 32 bit generators and lower and ran it through ent which reported good
   results. ~drummyfish }

   Let's also quickly mention another kind of generator as an alternative --
   the middle square plus Weyl sequence generator. Middle square generator
   was one of the first and is very simple, it simply starts with a number
   (seed), squares it, takes its middle digits as the next number, squares
   it, takes its middle digits and so on. The issue with this was mainly
   getting a number 0, at which we get stuck. A 2022 paper by Wydinski seems
   to have solved this issue by employing so called Weyl sequence --
   basically just adding some odd number in each step, though the theory is a
   bit more complex, the paper goes on to prove a high period of this
   generator. An issue seems to be with seeding the sequence -- the generator
   has three internal numbers and they can't just be blindly set to
   "anything" (the paper gives some hints on how to do this). Here is a 32
   bit variant of such generator (the paper gives a 64 bit one):

   { I tried to make a 32 bit version of the generator, tried to choose the
   _rand3 constant well -- after quickly testing this the values of the
   generator looked alright, though I just mostly eyeballed the numbers, each
   bit separately, checked the mean of some 4000 values and the histogram of
   1 million values. I also ran this through ent and it gave good results
   except for the chi square test that reported the rarity of the sequence at
   something over 5% (i.e. not impossible, but something like 1 in 20
   chance). I'm not claiming this version to be statistically good, but it
   may be a start for implementing something nice, use at own risk.
   ~drummyfish }

 #include <stdint.h>

 uint32_t _rand1, _rand2, _rand3 = 0x5e19dbae;

 uint16_t random()
 {
   _rand2 += _rand3;
   _rand1 = _rand1 * _rand1 + _rand2;
   _rand1 = (_rand1 >> 16) | (_rand1 << 16);

   return _rand1;
 }

   NOTE on the code: the (_rand1 >> 16) | (_rand1 << 16) operation
   effectively makes the function return lower 16 bits of the squared
   number's middle digits, as multiplying _rand1 (32 bit) by itself results
   in the lower half of a 64 bit result.

   Yet another idea might be to use some good [24]hash just on numbers 1, 2,
   3, 4 ... The difference here is we are not computing the pseudorandom
   number from the previous one, but we're computing Nth pseudorandom number
   directly from N. This will probably be slower. For example: { Again, no
   big guarantees, but I ran this through ent again and got very good
   results. ~drummyfish }

 #include <stdint.h>

 uint32_t _rand = 0;

 uint32_t random()
 {
   uint32_t x = _rand;
   _rand++;
  
   x = 303484085 * (x ^ (x >> 15));
   x = 985455785 * (x ^ (x >> 15));
   return x ^ (x >> 15);
 }

 void randomSeed(uint32_t seed)
 {
   _rand = seed;    // this alone just offsets the sequence
   seed = random(); // this is an attempt at fix
 }

   How to generate a number in certain desired range? As said your generator
   will be giving you numbers of certain fixed number of bits, usually
   something like 16 or 32, which means you'll be getting numbers in range 0
   to 2^bits - 1. But what if you want to get numbers in some specific range
   A to B (including both)? To do this you just need to generate a number in
   range 0 to B - A and then add A to it (e.g. to generate number from 20 to
   30 you generate a number from 0 to 10 and add 20). So let's just suppose
   we want a number in range 0 to N (where N can be B - A). Let's now suppose
   N is lower than the upper range of our generator, i.e. that we want to get
   the number into a small range (if this is not the case, we can arbitrarily
   increase the range of our generator simply by generating more random bits
   with it, i.e we can join two 16 bit numbers to get a 32 bit number etc.).
   Now the most common way to get the number in the desired range is by using
   modulo (N + 1) operation, i.e. in [25]C we simply do something like int
   random0to100 = random() % 101;. This easily forces the number we get into
   the range we want. HOWEVER beware, there is one statistical trap about
   this called the modulo bias that makes some numbers slightly more likely
   to be generated than others, i.e. it biases our distribution a little bit.
   For example imagine our generator gives us numbers from 0 to 15 and we
   want to turn it into range 0 to 10 using the modulo operator, i.e. we'll
   be doing mod 11 operation -- there are two ways to get 0 (0 mod 11 and 11
   mod 11) but only one way to get 9 (9 mod 11), so number 0 is twice as
   likely here. In practice this effect isn't so strong and in many
   situations we don't really mind it, but we have to be aware of this
   effects for the sake of cases where it may matter. If necessary, the
   effect can be reduced -- we may for example realize that modulo bias will
   be eliminated if the upper range of our generator is a multiple of the
   range into which we'll be converting, so we can just repeatedly generate
   numbers until it falls under a limit that's a highest multiple of our
   desired range lower than the true range of the generator.

   What if we want [26]floating point/[27]fixed point numbers? Just convert
   the integer result to that format somehow, for example ((float) random())
   / ((float) RANDOM_MAX_VALUE) will produce a floating point number in range
   0 to 1.

   How to generate other probability distributions? Up until now we supposed
   a uniform probability distribution, i.e. the most random kind of generator
   that has an equal probability of generating any number. But sometimes we
   want a different distribution, i.e. we may want some numbers to be more
   likely to be generated than others. For this we normally start with the
   uniform generator and then convert the number into the new distribution.
   For that we may make use of the following:

     * Averaging many uniform distributions converges to normal distribution
       -- this is called central limit theorem and in fact works even more
       generally (the averaged distribution doesn't have to be uniform), but
       to us it's enough to know that if we want normally distributed random
       numbers, we can just average many uniformly distributed variables.
       Intuitively this makes sense -- averaging many numbers will likely be
       close to the mean value, it's very unlikely to be close to either end
       as to get an extreme average we would have to roll only numbers close
       to that one extreme.
     * Elimination method: This is a very straightforward method, which
       however requires iteration (i.e. it may not be the fastest). If you
       know the probability density function of your desired distribution,
       you can randomly generate 2D points [x,y] and once one of them falls
       under the graph of the density function, you take x as the generated
       value. Again, this makes intuitive sense -- values that have low
       probability according to the distribution have also lower probability
       to be generated this way as they have a smaller range of passing
       values.
     * Inverse transform method: In this method we take the inverse function
       of our desired distribution's cumulative distribution function
       (function that for any x says the probability the outcome will be
       lower than or equal to x) and we pass to it a number in range 0 to 1
       generated with the vanilla uniform distribution -- this gives us the
       number in the desired distribution. However for this we have to know
       the cumulative distribution function's inverse function.
     * ...

   { If you want to look at some C code that tries to generate good
   pseudorandom numbers, take a look at gsl (GNU Scientific Library).
   ~drummyfish }

   TODO: some algorithms from the gsl library described at
   https://www.gnu.org/software/gsl/doc/html/rng.html

Quality/Testing Of Pseudorandom Sequences/Generators

   This topic alone can be extremely complex, you could probably do a [28]PhD
   on it, let's just do some overview for mere noobs like us.

   Firstly you want your generator to be simply a good program in general,
   i.e. you want it to be as fast, small and as [29]simple as possible while
   generating nice sequences. Consider that someone will want to use it to
   for example generate a white noise for a whole HD screen 60 times per
   second -- that may be something like 100 million numbers per second! We
   often judge generators by type and number of operations they need to
   generate the next number, as that will imply the speed -- as always, you
   generally want to avoid general division, branching, minimize
   multiplications and so on. Typically you'll probably want something like
   one multiplication and a few fast operations like an addition and modulo
   by power of two. That's basically what the congruential generators do. See
   also [30]optimization.

   However the core of a pseudorandom generator is the quality of the
   sequence itself, i.e. ensuring it really looks very random, that it has no
   patterns. To a noob this sounds easy, he thinks that if you just make some
   random shit with the numbers it's going to look random, he also thinks
   that the more things you do with the numbers the more random they'll look
   -- this is FALSE (typically if you just throw in more operations without
   thinking you'll make the sequence worse). It's not easy at all to make a
   random looking sequence of numbers! There is a danger in that with some
   effort you can make a sequence that will look random to you, but once you
   use the sequence to generate something -- for example some kind of noise
   -- you'll see there is actually some pattern, for example you'll see some
   straight lines in the noise etc. For some applications it may be
   sufficient to have a lower quality generator, but the danger lies in the
   cases in which you need a good generator and you think you have one while
   in fact you don't -- typically scientific simulation have to be extremely
   careful about this. For example if you're doing [31]genetic programming
   and you need to somehow randomly pick organisms so as to make the
   evolution fair -- if you have a bad generator, your program won't work
   because it may secretly be not too random in its choices, it will be
   unfair -- in this case it will be VERY hard for you to find the cause of
   why your program doesn't work (in the worse case you'll think it works and
   will trust its results when in fact the results are bad). To avoid this
   trap you need to actually test the quality of your sequence (TBH if you
   want a really good generator you should definitely use something better
   than the mentioned congruential generator). Let's see some general stuff
   you can try (considering the base case of generating uniformly distributed
   numbers):

     * Intuitive check with your senses: Obviously you'll take a look at the
       numbers and see if they look random, additionally you can also for
       example convert the numbers to another base, interpret the sequence as
       a sequence of different length words (8bit, 10bit, 16bit, ...),
       convert them to a string, reverse their bits and so on -- it should
       always look random. Next convert the data to something visual and
       maybe even auditory, your senses are trained to spot patterns in data.
       Firstly make a normal plot of the values, it should look like random
       noise, going up and down completely randomly. Then make a 2D image of
       the data -- for example plot the values as colors, plot each bit
       separately etc. -- keep changing the picture width and see if patterns
       appear. You can also try to make something like [32]Ulam spiral.
       Another way to make a picture (or a 3D point cloud) is to interpret
       the sequence as a sequence of 2D (or 3D) coordinates and plotting them
       as points. You must see no patterns in these images, such as clear
       lines, rectangles, ripples or chunks, it has to look like static
       [33]noise. You can possibly also try to play the data as a sound, you
       should hear no tones, or rhythm or anything, just static.
     * Each bit separately has to look random: Check each bit of the data
       separately, i.e. firstly take a look only at sequence of the lowest
       bits of each number -- it has to look random. If you see for example
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, ..., you see it's not really random (if
       this is the lowest bit the sequence here implies you're getting an odd
       number, then even, then odd, then even etc.). Do all the tests for
       this bit alone, then check another bit etc.
     * Check the sequence period: You want the longest possible period of
       your generator, i.e. if your generator has an N bit internal number
       and you start with number A, you get back to A after 2^N steps and no
       sooner -- this will ensure not only the maximum period length, but
       also that the period length will be the same for every starting seed!
       That's because in this ideal case you simply have a single cycle over
       all the possible internal number values.
     * Try to [34]compress the sequence: Truly random data should be
       basically impossible to compress -- you can exploit this fact and try
       to compress your sequence with some compression programs. It is ideal
       if the compression programs end up enlarging the file.
     * Statistical tests: Here you use objective mathematical tests -- there
       exist many very advanced tests, we'll only mention the very simple
       ones.
          * [35]Histogram: Generate many numbers (but not more than the
            generator's period) and make a histogram (i.e. for every number
            count how many times it appeared) -- all numbers should appear
            roughly with the same frequency. If you make a nice generator,
            you should even see exactly the same count for every value
            generated -- this is explained above. Also count 1 and 0 bits in
            the whole sequence -- again there should be about the same number
            of them (exactly the same if you do it correctly). But keep in
            mind this only checks if you have correct frequencies of numbers,
            it says nothing about their distribution. Even a sequence 1, 2,
            3, 4, 5, .... will pass this.
          * Check statistical properties on (non-short) subintervals of the
            series: This will already take into account where the numbers
            appear in the sequence. For example check if the average value
            over some smaller intervals are always close to middle value,
            which should hold in a series of random numbers; also the minimum
            and maximum value, histogram (distribution of the values) and
            other statistical measures should basically be close to being the
            same on smaller intervals as they are over the whole series if
            the intervals aren't very short -- i.e. just be careful about not
            picking too short intervals -- the smaller interval you pick, the
            more likely it will be (even in the ideal random sequence) that a
            statistical property will diverge from its expected value, i.e.
            don't test intervals smaller than let's say 10000 values.
          * [36]Fourier transform (and similar methods that give you the
            spectrum) -- the spectrum of the data should have equal amount of
            all frequencies, just like white noise.
          * [37]Correlation coefficients: This is kind of the real proof of
            randomness, ideally no values should be correlated in your data,
            so you can try to compute some correlation coefficients, for
            example try to compute how much correlation there is between
            consecutive numbers (it's similar to plotting the data as
            coordinates and seeing if they form a line or not) -- you should
            ideally find no significant correlations. [38]Autocorrelation may
            be a good test (basically you keep performing [39]dot product of
            the series with a shifted version of the same series -- this
            mustn't diverge too much from 0; ideal white noise has a high
            peak for 0 shift and then 0 values elsewhere).
          * Chi square test: Very common test for this kind of thing, see
            also poker test.
          * [40]Monte Carlo tests: Monte Carlo algorithms use random sampling
            to compute a certain desired value -- for example the value of
            [41]pi. These suppose that we can sample certain space randomly,
            so we can exploit this -- if we know what result we want to get
            (for example we already know the value of pi) we can test the
            algorithm with our generator and see if we get the desired result
            -- if we come close to the desired result, we can be a bit more
            confident that our sampling was random, however we cannot be
            certain of it -- like with any testing we can only ever be
            certain about the presence of an error, not about the lack of it.
            Even a very dense, regular grid of points would probably pass
            this.
          * The cool uber randomness test described in article on
            [42]randomness ;) Basically every number (and by extension any
            sequence of numbers) should be equally likely to be followed by
            any other number.
          * For the linear congruential generators there's a so called
            spectral test, it seems to be the one true test for that kind of
            generators, make sure to do it if you're aiming for the top
            generator.
          * ...
     * Test programs: There exist programs that do the automatic tests for
       you, for example [43]ent.
     * ...

See Also

     * [44]pseudo
     * [45]randomness
     * [46]noise
     * [47]bytebeat

Links:
1. data.md
2. random.md
3. determinism.md
4. good_enough.md
5. chaos.md
6. quantum.md
7. computer.md
8. game.md
9. graphics.md
10. debugging.md
11. cryptography.md
12. security.md
13. lut.md
14. doom.md
15. algorithm.md
16. cellular_automaton.md
17. seed.md
18. bit.md
19. graph.md
20. mersenne_twister.md
21. lcg.md
22. and.md
23. c.md
24. hash.md
25. c.md
26. float.md
27. fixed_point.md
28. phd.md
29. simple.md
30. optimization.md
31. genetic_programming.md
32. ulam_spiral.md
33. noise.md
34. compression.md
35. histogram.md
36. fourier_transform.md
37. correlation.md
38. autocorrelation.md
39. dot_product.md
40. monte_carlo.md
41. pi.md
42. randomness.md
43. ent.md
44. pseudo.md
45. randomness.md
46. noise.md
47. bytebeat.md