proxy70

                                 Floating Point

   In programming floating point (colloquially just float) is a way of
   representing [1]fractional numbers (such as 5.13) and approximating
   [2]real numbers (i.e. numbers with higher than [3]integer precision),
   which is a bit more complex than simpler methods for doing so (such as
   [4]fixed point). The core idea of it is to use a radix ("decimal") point
   that's not fixed but can move around so as to allow representation of both
   very small and very big values. Nowadays floating point is the standard
   way of [5]approximating [6]real numbers in computers (floating point types
   are called real in some programming languages, even though they represent
   only [7]rational numbers, floats can't e.g. represent [8]pi exactly),
   basically all of the popular [9]programming languages have a floating
   point [10]data type that adheres to the IEEE 754 standard, all personal
   computers also have the floating point hardware unit ([11]FPU) and so it
   is widely used in all [12]modern programs. However most of the time a
   simpler representation of fractional numbers, such as the mentioned
   [13]fixed point, suffices, and weaker computers (e.g. [14]embedded) may
   lack the hardware support so floating point operations are emulated in
   software and therefore slow -- remember, float rhymes with [15]bloat.
   Prefer fixed point.

   Floating point is tricky, it works most of the time but a danger lies in
   programmers relying on this kind of [16]magic too much, some new
   generation programmers may not even be very aware of how float works. Even
   though the principle is not so hard, the emergent complexity of the math
   is really complex. One floating point expression may evaluate differently
   on different systems, e.g. due to different rounding settings. Floating
   point can introduce [17]chaotic behavior into linear systems as it
   inherently makes rounding errors and so becomes a nonlinear system
   (source: http://foldoc.org/chaos). One common pitfall of float is working
   with big and small numbers at the same time -- due to differing precision
   at different scales small values simply get lost when mixed with big
   numbers and sometimes this has to be worked around with tricks (see e.g.
   [18]this devlog of The Witness where a float time variable sent into
   [19]shader is periodically reset so as to not grow too large and cause the
   mentioned issue). Another famous trickiness of float is that you shouldn't
   really be comparing them for equality with a normal == operator as small
   rounding errors may make even mathematically equal expressions unequal
   (i.e. you should use some range comparison instead).

   And there is more: floating point behavior really depends on the language
   you're using (and possibly even compiler, its setting etc.) and it may not
   be always completely defined, leading to possible [20]nondeterministic
   behavior which can cause real trouble e.g. in physics engines.

   { Really as I'm now getting down the float rabbit hole I'm seeing what a
   huge mess it all is, I'm not nearly an expert on this so maybe I've
   written some BS here, which just confirms how messy floats are. Anyway,
   from the articles I'm reading even being an expert on this issue doesn't
   seem to guarantee a complete understanding of it :) Just avoid floats if
   you can. ~drummyfish }

   Is floating point literal evil? Well, of course not, but it is extremely
   overused. You may need it for precise scientific simulations, e.g.
   [21]numerical integration, but as our [22]small3dlib shows, you can
   comfortably do even [23]3D rendering without it. So always consider
   whether you REALLY need float. You mostly do NOT need it.

   Simple example of avoiding floating point: many noobs think that if they
   e.g. need to multiply some integer x by let's say 2.34 they have to use
   floating point. This is of course false and just proves most retarddevs
   don't know elementary school [24]math. Multiplying x by 2.34 is the same
   as (x * 234) / 100, which we can [25]optimize to an approximately equal
   division by power of two as (x * 2396) / 1024. Indeed, given e.g. x = 56
   we get the same integer result 131 in both cases, the latter just
   completely avoiding floating point.

How It Works

   The very basic idea is following: we have digits in memory and in addition
   we have a position of the radix point among these digits, i.e. both digits
   and position of the radix point can change. The fact that the radix point
   can move is reflected in the name floating point. In the end any number
   stored in float can be written with a finite number of digits with a radix
   point, e.g. 12.34. Notice that any such number can also always be written
   as a simple fraction of two integers (e.g. 12.34 = 1 * 10 + 2 * 1 + 3 *
   1/10 + 4 * 1/100 = 617/50), i.e. any such number is always a rational
   number. This is why we say that floats represent fractional numbers and
   not true real numbers (real numbers such as [26]pi, [27]e or square root
   of 2 can only be approximated).

   More precisely floats represent numbers by representing two main parts:
   the base -- actual encoded digits, called mantissa (or significand etc.)
   -- and the position of the radix point. The position of radix point is
   called the exponent because mathematically the floating point works
   similarly to the scientific notation of extreme numbers that use
   exponentiation. For example instead of writing 0.0000123 scientists write
   123 * 10^-7 -- here 123 would be the mantissa and -7 the exponent.

   Though various numeric bases can be used, in [28]computers we normally use
   [29]base 2, so let's consider it from now on. So our numbers will be of
   format:

   mantissa * 2^exponent

   Note that besides mantissa and exponent there may also be other parts,
   typically there is also a sign bit that says whether the number is
   positive or negative.

   Let's now consider an extremely simple floating point format based on the
   above. Keep in mind this is an EXTREMELY NAIVE inefficient format that
   wastes values. We won't consider negative numbers. We will use 6 bits for
   our numbers:

     * 3 leftmost bits for mantissa: This allows us to represent 2^3 = 8 base
       values: 0 to 7 (including both).
     * 3 rightmost bits for exponent: We will encode exponent in [30]two's
       complement so that it can represent values from -4 to 3 (including
       both).

   So for example the binary representation 110011 stores mantissa 110 (6)
   and exponent 011 (3), so the number it represents is 6 * 2^3 = 48.
   Similarly 001101 represents 1 * 2^-3 = 1/8 = 0.125.

   Note a few things: firstly our format is [31]shit because some numbers
   have multiple representations, e.g. 0 can be represented as 000000,
   000001, 000010, 000011 etc., in fact we have 8 zeros! That's unforgivable
   and formats used in practice address this (usually by prepending an
   implicit 1 to mantissa).

   Secondly notice the non-uniform distribution of our numbers: while we have
   a nice resolution close to 0 (we can represent 1/16, 2/16, 3/16, ...), our
   resolution in high numbers is low (the highest number we can represent is
   56 but the second highest is 48, we can NOT represent e.g. 50 exactly).
   Realize that obviously with 6 bits we can still represent only 64 numbers
   at most! So float is NOT a magical way to get more numbers, with integers
   on 6 bits we can represent numbers from 0 to 63 spaced exactly by 1 and
   with our floating point we can represent numbers spaced as close as 1/16th
   but only in the region near 0, we pay the price of having big gaps in
   higher numbers.

   Also notice that things like simple addition of numbers become more
   difficult and time consuming, you have to include conversions and
   [32]rounding -- while with fixed point addition is a single machine
   instruction, same as integer addition, here with software implementation
   we might end up with dozens of instructions (specialized hardware can
   perform addition fast but still, not all computer have that hardware).

   Rounding errors will appear and accumulate during computations: imagine
   the operation 48 + 1/8. Both numbers can be represented in our system but
   not the result (48.125). We have to round the result and end up with 48
   again. Imagine you perform 64 such additions in succession (e.g. in a
   loop): mathematically the result should be 48 + 64 * 1/8 = 56, which is a
   result we can represent in our system, but we will nevertheless get the
   wrong result (48) due to rounding errors in each addition. So the behavior
   of float can be non intuitive and dangerous, at least for those who don't
   know how it works.

Standard Float Format: IEEE 754

   IEEE 754 is THE standard that basically all computers use for floating
   point nowadays -- it specifies the exact representation of floating point
   numbers as well as rounding rules, required operations applications should
   implement etc. However note that the standard is kind of [33]shitty --
   even if we want to use floating point numbers there exist better ways such
   as [34]posits that outperform this standard. Nevertheless IEEE 754 has
   been established in the industry to the point that it's unlikely to go
   anytime soon. So it's good to know how it works.

   Numbers in this standard are signed, have positive and negative zero
   (oops), can represent plus and minus [35]infinity and different [36]NaNs
   (not a number). In fact there are thousands to billions of different NaNs
   which are basically wasted values. These inefficiencies are addressed by
   the mentioned [37]posits.

   Briefly the representation is following (hold on to your chair): leftmost
   bit is the sign bit, then exponent follows (the number of bits depends on
   the specific format), the rest of bits is mantissa. In mantissa implicit
   1. is considered (except when exponent is all 0s), i.e. we "imagine" 1. in
   front of the mantissa bits but this 1 is not physically stored. Exponent
   is in so called biased format, i.e. we have to subtract half (rounded
   down) of the maximum possible value to get the real value (e.g. if we have
   8 bits for exponent and the directly stored value is 120, we have to
   subtract 255 / 2 = 127 to get the real exponent value, in this case we get
   -7). However two values of exponent have special meaning; all 0s signify
   so called denormalized (also subnormal) number in which we consider
   exponent to be that which is otherwise lowest possible (e.g. -126 in case
   of 8 bit exponent) but we do NOT consider the implicit 1 in front of
   mantissa (we instead consider 0.), i.e. this allows storing [38]zero
   (positive and negative) and very small numbers. All 1s in exponent signify
   either [39]infinity (positive and negative) in case mantissa is all 0s, or
   a [40]NaN otherwise -- considering here we have the whole mantissa plus
   sign bit unused, we actually have many different NaNs ([41]WTF), but
   usually we only distinguish two kinds of NaNs: quiet (qNaN) and signaling
   (sNaN, throws and [42]exception) that are distinguished by the leftmost
   bit in mantissa (1 for qNaN, 0 for sNaN).

   The standard specifies many formats that are either binary or decimal and
   use various numbers of bits. The most relevant ones are the following:

   name               M bits E bits smallest and biggest precision <= 1 up to 
                                    number               
   binary16 (half     10     5      2^(-24), 65504       2048                 
   precision)         
   binary32 (single                 2^(-149), 2^127 * (2                      
   precision, float)  23     8      - 2^-23) ~= 3 *      16777216
                                    10^38                
   binary64 (double   52     11     2^(-1074), ~10^308   9007199254740992     
   precision, double) 
   binary128                                                                  
   (quadruple         112    15     2^(-16494), ~10^4932 ~10^34
   precision)         

   Example? Let's say we have float (binary34) value
   11000000111100000000000000000000: first bit (sign) is 1 so the number is
   negative. Then we have 8 bits of exponent: 10000001 (129) which converted
   from the biased format (subtracting 127) gives exponent value of 2. Then
   mantissa bits follow: 11100000000000000000000. As we're dealing with a
   normal number (exponent bits are neither all 1s nor all 0s), we have to
   imagine the implicit 1. in front of mantissa, i.e. our actual mantissa is
   1.11100000000000000000000 = 1.875. The final number is therefore -1 *
   1.875 * 2^2 = -7.5.

See Also

     * [43]posit
     * [44]fixed point

Links:
1. rational_number.md
2. real_number.md
3. integer.md
4. fixed_point.md
5. approximation.md
6. real_number.md
7. rational_number.md
8. pi.md
9. programming_language.md
10. data_type.md
11. fpu.md
12. modern.md
13. fixed_point.md
14. embedded.md
15. bloat.md
16. magic.md
17. chaos.md
18. http://the-witness.net/news/2022/02/a-shader-trick/
19. shader.md
20. determinism.md
21. numerical_integration.md
22. small3dlib.md
23. 3d_rendering.md
24. math.md
25. optimization.md
26. pi.md
27. e.md
28. computer.md
29. binary.md
30. twos_complement.md
31. shit.md
32. rounding.md
33. shit.md
34. posit.md
35. infinity.md
36. nan.md
37. posit.md
38. zero.md
39. infinity.md
40. nan.md
41. wtf.mf
42. exception.md
43. posit.md
44. fixed_point.md