Optimization

   Optimization means making a program more efficient in terms of consumption
   of some computing resource or by any similar metric, commonly aiming for
   greater execution speed or lower memory usage (but also e.g. lower power
   consumption, lower network usage etc.) while preserving how the program
   functions externally; this can be done manually (by rewriting parts of
   your program) or automatically (typically by [1]compiler when it's
   translating your program). Unlike [2]refactoring, which aims primarily for
   a better readability of source code, optimization changes the inner
   behavior of the executed program to a more optimal one. Apart from
   optimizing programs/[3]algorithms we may also more widely talk about
   optimizing e.g. [4]data structures, file formats, [5]hardware, [6]protocol
   and so on.

Manual Optimization

   These are optimizations you do yourself by writing better code.

  General Tips'N'Tricks

   These are mainly for [7]C, but may be usable in other languages as well.

     * Tell your compiler to actually optimize (-O3, -Os flags etc.). Also
       check out further compiler flags that may help you turn off
       unnecessary things you don't need, AND try out different compilers,
       some may just produce better code. If you are brave also check even
       more aggressive flags like -Ofast and -Oz, which may be even faster
       than -03, but may break your program too.
     * [8]gprof is a utility you can use to profile your code.
     * <stdint.h> has fast type nicknames, types such as uint_fast32_t which
       picks the fastest type of at least given width on given platform.
     * Actually measure the performance to see if your optimizations work or
       not. Sometimes things behave counterintuitively and you end up making
       your program perform worse by trying to optimize it! Also make sure
       that you MEASURE THE PERFORMANCE CORRECTLY, many beginners for example
       just try to measure run time of a single simple function call which
       doesn't really work, you want to try to measure something like a
       million of such function calls in a loop and then average the time.
     * Keywords such as inline, static, const and register can help compiler
       optimize well.
     * Optimize the [9]bottlenecks! Optimizing in the wrong place is a
       complete waste of time. If you're optimizing a part of code that's
       taking 1% of your program's run time, you will never speed up your
       program by more than that 1% even if you speed up the specific part by
       10000%. Bottlenecks are usually inner-most loops of the main program
       loop, you can identify them with [10]profiling. A typical bottleneck
       code is for example a [11]shader that processes millions of pixels per
       second. Generally initialization code that runs only once in a long
       time doesn't need much optimization -- no one is going to care if a
       program starts up 1 millisecond faster (but of course in special cases
       such as launching many processes this may start to matter).
     * You can almost always trade space (memory usage) for time (CPU demand)
       and vice versa and you can also fine-tune this. You typically gain
       speed by [12]precomputation ([13]look up tables, more demanding on
       memory) and memory with [14]compression (more demanding on CPU).
     * [15]Static things are faster and smaller than [16]dynamic things. This
       means that things that are somehow fixed/unchangeable are better in
       terms of performance (and usually also safer and better testable) than
       things that are allowed to change during [17]run time -- for example
       calling a function directly (e.g. myVar = myFunc();) is both faster
       and requires fewer instructions than calling a function by pointer
       (e.g. myVar = myFuncPointer();): the latter is more flexible but for
       the price of performance, so if you don't need flexibility (dynamic
       behavior), use static behavior. This also applies to using
       [18]constants (faster/smaller) vs [19]variables, static vs dynamic
       [20]typing, normal vs dynamic [21]arrays etc.
     * Be smart, use [22]math, for example simplify [23]expressions using
       known formulas, minimize [24]logic circuits etc. Example: let's say
       you want to compute the radius of a zero-centered [25]bounding sphere
       of an N-point [26]point cloud. Naively you might be computing the
       Euclidean distance (sqrt(x^2 + y^2 + z^2)) to each point and taking a
       maximum of them, however you can just find the maximum of squared
       distances (x^2 + y^2 + z^2) and return a square root of that maximum.
       This saves you a computation of N - 1 square roots.
     * Fancier algorithms will very often be slower than simpler ones, even
       if they are theoretically faster, i.e. don't get too smart and first
       try the simple algorithm, greater complexity has to be justified. This
       was commented on e.g. by [27]Rob Pike who said that "fancy algorithms
       are slow when n is small, and n is usually small", i.e. if you're
       sorting an array of 10 items, just use bubble sort, not quick sort
       etc.
     * Learn about [28]dynamic programming.
     * Avoid branches (ifs) if you can (remember [29]ternary operators, loop
       conditions etc. are branches as well). They break prediction in CPU
       pipelines and instruction preloading and are often source of great
       performance losses. Don't forget that you can many times compare and
       use the result of operations without using any branching (e.g. x = (y
       == 5) + 1; instead of x = (y == 5) ? 2 : 1;).
     * Use iteration instead of [30]recursion if possible (calling a function
       costs something). Know that it is ALWAYS possible to remove recursive
       function calls.
     * Use [31]good enough [32]approximations instead of completely accurate
       calculations, e.g. taxicab distance instead of Euclidean distance,
       capsule shape to represent the player's collision shape rather than
       the 3D model's mesh etc. With a physics engine instead of running the
       simulation at the same FPS as rendering, you may just run it at half
       and [33]interpolate between two physics states at every other frame.
       Nice examples can also be found in [34]computer graphics, e.g. some
       [35]software renderers use perspective-correct texturing only for
       large near triangles and cheaper affine texturing for other triangles,
       which mostly looks OK.
     * Use quick [36]bailout conditions: many times before performing some
       expensive calculation you can quickly check whether it's even worth
       performing it and potentially skip it. For example in physics
       [37]collision detections you may first quickly check whether the
       bounding spheres of the bodies collide before running an expensive
       precise collision detection -- if bounding spheres of objects don't
       collide, it is not possible for the bodies to collide and so we can
       skip further collision detection.
     * Operations on static data can be accelerated with accelerating
       structures ([38]look-up tables for functions, [39]indices for database
       lookups, spatial grids for collision checking, various [40]trees ...).
     * Use powers of 2 (1, 2, 4, 8, 16, 32, ...) whenever possible, this is
       efficient thanks to computers working in [41]binary. Not only may this
       help nice utilization and alignment of memory, but mainly
       multiplication and division can be optimized by the compiler to mere
       bit shifts which is a tremendous speedup.
     * Memory alignment usually helps speed, i.e. variables at "nice
       addresses" (usually multiples of the platform's native integer size)
       are faster to access, but this may cost some memory (the gaps between
       aligned data).
     * Write [42]cache-friendly code (minimize long jumps in memory).
     * Compare to [43]0 rather than other values. There's usually an
       instruction that just checks the zero flag which is faster than
       loading and comparing two arbitrary numbers.
     * Use [44]bit tricks, hacks for manipulating binary numbers in clever
       ways only using very basic operations without which one might naively
       write complex inefficient code with loops and branches. Example of a
       simple bit trick is checking if a number is power of two as !(x & (x -
       1)) && x.
     * Consider moving computation from run time to compile time, see
       [45]preprocessor, [46]macros and [47]metaprogramming. E.g. if you make
       a resolution of your game constant (as opposed to a variable), the
       compiler will be able to partially precompute expressions with the
       display dimensions and so speed up your program (but you won't be able
       to dynamically change resolution).
     * On some platforms such as [48]ARM the first arguments to a function
       may be passed via registers, so it may be better to have fewer
       parameters in functions.
     * Passing arguments costs something: passing a value to a function
       requires a push onto the stack and later its pop, so minimizing the
       number of parameters a function has, using global variables to pass
       arguments and doing things like passing structs by pointers rather
       than by value can help speed. { from Game Programming Gurus
       -drummyfish }
     * Optimize when you already have a working code and when you can measure
       your optimizations. As [49]Donald Knuth put it: "premature
       optimization is the root of all evil". Nevertheless you should get
       used to simple nobrainer efficient patterns by default and just write
       them automatically. Also do one optimization at a time, don't try to
       put in more optimizations at once.
     * Use your own [50]caches where they help, for example if you're
       frequently working with some database item you better pull it to
       memory and work with it there, then write it back once you're done (as
       opposed to communicating with the DB there and back).
     * [51]Single compilation unit (one big program without [52]linking) can
       help compiler optimize better because it can see the whole code at
       once, not just its parts. It will also make your program compile
       faster.
     * Search literature for algorithms with better [53]complexity class
       ([54]sorts are a nice example).
     * For the sake of simple computers such as [55]embedded platforms avoid
       [56]floating point as that is often painfully slowly emulated in
       software. Use [57]fixed point, or at least offer it as a [58]fallback.
       This also applies to other hardware requirements such as [59]GPU or
       sound cards: while such hardware accelerates your program on computers
       that have the hardware, making use of it may lead to your program
       being slower on computers that lack it.
     * Factoring out invariants from loops and early branching can create a
       speed up: it's sometimes possible to factor things out of loops (or
       even long non-looping code that just repeats some things), i.e.
       instead of branching inside the loop create two versions of the loop
       and branch in front of them. This is a kind of space-time tradeoff.
       Consider e.g. while (a) if (b) func1(); else func2(); -- if b doesn't
       change inside the loop, you can rewrite this as if (b) while (a)
       func1(); else while (a) func2();. Or in while (a) b += c * d; if c and
       d don't change (are invariant), we can rewrite to cd = c * d; while
       (a) b += cd;. And so on.
     * Division can be replaced by multiplication by [60]reciprocal, i.e. x /
       y = x * 1/y. The point is that multiplication is usually faster than
       division. This may not help us when performing a single division by
       variable value (as we still have to divide 1 by y) but it does help
       when we need to divide many numbers by the same variable number OR
       when we know the divisor at compile time; we save time by precomputing
       the reciprocal before a loop or at compile time. Of course this can
       also easily be done with [61]fixed point and integers!
     * Consider the difference between logical and bitwise operators! For
       example [62]AND and [63]OR boolean functions in C have two variants,
       one bitwise (& and |) and one logical (&& and ||) -- they behave a bit
       differently but sometimes you may have a choice which one to use, then
       consider this: bitwise operators usually translate to only a single
       fast (and small) instruction while the logical ones usually translate
       to a branch (i.e. multiple instructions with potentially slow jumps),
       however logical operators may be faster because they are evaluated as
       [64]short circuit (e.g. if first operand of OR is true, second operand
       is not evaluated at all) while bitwise operators will evaluate all
       operands.
     * Consider the pros and cons of using indices vs pointers: When working
       with arrays you usually have the choice of using either pointers or
       indices, each option has advantages and disadvantages; working with
       pointers may be faster and produce smaller code (fewer instructions),
       but array indices are portable, may be smaller and safer. E.g. imagine
       you store your game sprites as a continuous array of images in RAM and
       your program internally precomputes a table that says where each image
       starts -- here you can either use pointers (which say directly the
       memory address of each image) or indices (which say the offset from
       the start of the big image array): using indices may be better here as
       the table may potentially be smaller (an index into relatively small
       array doesn't have to be able to keep any possible memory address) and
       the table may even be stored to a file and just loaded next time
       (whereas pointers can't because on next run the memory addresses may
       be different), however you'll need a few extra instructions to access
       any image (adding the index to the array pointer), which will however
       most definitely be negligible.
     * Reuse variables to save space. A warning about this one: readability
       may suffer, mainstreamers will tell you you're going against "good
       practice", and some compilers may do this automatically anyway. Be
       sure to at least make this clear in your comments. Anyway, on a lower
       level and/or with dumber compilers you can just reuse variables that
       you used for something else rather than creating a new variable that
       takes additional RAM; of course a prerequisite for "merging" variables
       is that the variables aren't used at the same time.
     * To save memory use [65]compression techniques. Compression doesn't
       always have to mean you use a typical compression algorithm such as
       [66]jpeg or [67]LZ77, you may simply just throw in a few compression
       techniques such as [68]run length or word dictionaries into your data
       structures. E.g. in [69]Anarch maps are kept small by consisting of a
       small dictionary of tile definitions and map cells referring to this
       dictionary (which makes the cells much smaller than if each one held a
       complete tile definition).
     * What's fast on one platform may be slow on another. This depends on
       the instruction set as well as on compiler, operating system,
       available hardware, [70]driver implementation and other details. In
       the end you always need to test on the specific platform to be sure
       about how fast it will run. A good approach is to optimize for the
       weakest platform you want to support -- if it runs fasts on a weak
       platform, a "better" platform will most likely still run it fast.
     * Prefer preincrement over postincrement (typically e.g. in a for loop),
       i.e. rather do ++i than i++ as the latter is a bit more complex and
       normally generates more instructions.
     * Mental calculation tricks, e.g. multiplying by one less or more than a
       power of two is equal to multiplying by power of two and
       subtracting/adding once, for example x * 7 = x * 8 - x; the latter may
       be faster as a multiplication by power of two (bit shift) and
       addition/subtraction may be faster than single multiplication,
       especially on some primitive platform without hardware multiplication.
       However this needs to be tested on the specific platform. Smart
       compilers perform these optimizations automatically, but not every
       compiler is high level and smart.
     * With more than two branches use switch instead of ifs (if possible) --
       it should be common knowledge but some newcomers may not know that
       switch is fundamentally different from if branches: switch statement
       generates a jump table that can branch into one of many case labels in
       constant time, as opposed to a series of if statements which keeps
       checking conditions one by one, however switch only supports
       conditions of exact comparison. So prefer using switch when you have
       many conditions to check (but know that switch can't always be used,
       e.g. for string comparisons). Switch also allows hacks such as label
       fall through which may help some optimizations.
     * Else should be the less likely branch, try to make if conditions so
       that the if branch is the one with higher probability of being
       executed -- this can help branch prediction.
     * Similarly order if-sequences and switch cases from most probable: If
       you have a sequences of ifs such as if (x) ... else if (y) ... else if
       (z) ..., make it so that the most likely condition to hold gets
       checked first, then second most likely etc. Compiler most likely can't
       know the probabilities of the conditions so it can't automatically
       help with this. Do the same with the switch statement -- even though
       switch typically gets compiled to a table of jump addresses, in which
       case order of the cases doesn't matter, it may also get compiled in a
       way similar to the if sequence (e.g. as part of size optimization if
       the cases are sparse) and then it may matter again.
     * Variable aliasing: If in a function you are often accessing a variable
       through some complex dereference of multiple pointers, it may help to
       rather load it to a local variable at the start of the function and
       then work with that variable, as dereferencing pointers costs
       something. { from Game Programming Gurus -drummyfish }
     * You can save space by "squeezing" variables -- this is a space-time
       tradeoff, it's a no brainer but nubs may be unaware of it -- for
       example you may store 2 4bit values in a single char variable (8bit
       data type), one in the lower 4bits, one in the higher 4bits (use bit
       shifts etc.). So instead of 16 memory-aligned booleans you may create
       one int and use its individual bits for each boolean value. This is
       useful in environments with extremely limited RAM such as 8bit
       Arduinos.
     * Consider [71]lazy evaluation (only evaluate what's actually needed).
     * You can optimize critical parts of code in [72]assembly, i.e. manually
       write the assembly code that takes most of the running time of the
       program, with as few and as inexpensive instructions as possible (but
       beware, popular compilers are very smart and it's often hard to beat
       them). But note that such code loses [73]portability! So ALWAYS have a
       C (or whatever language you are using) [74]fallback code for other
       platforms, use [75]ifdefs to switch to the fallback version on
       platforms running on different assembly languages.
     * Loop unrolling/splitting/fusion, function inlining etc.: there are
       optimizations that are usually done by high level languages at
       [76]assembly level (e.g. loop unrolling physically replaces a loop by
       repeated commands which gains speed but also makes the program
       bigger). However if you're writing in assembly or have a dumb compiler
       (or are even writing your own) you may do these manually, e.g. with
       macros/templates etc. Sometimes you can hint a compiler to perform
       these optimizations, so look this up.
     * [77]Parallelism ([78]multithreading, [79]compute shaders, ...) can
       astronomically accelerate many programs, it is one of the most
       effective techniques of speeding up programs -- we can simply perform
       several computations at once and save a lot of time -- but there are a
       few notes. Firstly not all problems can be parallelized, some problem
       are sequential in nature, even though most problems can probably be
       parallelized to some degree. Secondly it is hard to do, opens the door
       for many new types of bugs, requires hardware support (software
       simulated parallelism can't work here of course) and introduces
       [80]dependencies; in other words it is huge [81]bloat, we don't
       recommend parallelization unless a very, very good reason is given.
       Optional use of [82]SIMD instructions can be a reasonable midway to
       going full parallel computation.
     * Optimizing [83]data: it's important to remember we can optimize both
       algorithm AND data, for example in a 3D game we may simplify our 3D
       models, remove parts of a level that will never be seen etc.
     * Specialized hardware (e.g. a [84]GPU) astronomically accelerates
       programs, but as with the previous point, portablity and simplicity
       greatly suffers, your program becomes bloated and gains dependencies,
       always consider using specialized hardware and offer software
       fallbacks.
     * Smaller code may also be faster as it allows to fit more instructions
       into [85]cache.
     * Do not optimize everything and for any cost: optimization often makes
       the code more cryptic, it may [86]bloat it, bring in more bugs etc.
       Only optimize if it is worth the prize. { from Game Programming Gurus
       -drummyfish }

  When To Actually Optimize?

   Nubs often ask this and this can also be a very nontrivial question.
   Generally fine, sophisticated optimization should come as one of the last
   steps in development, when you actually have a working thing. These are
   optimizations requiring significant energy/time to implement -- you don't
   want to spend resources on this at the stage when they may well be dropped
   in the end, or they won't matter because they'll be outside the
   bottleneck. However there are two "exceptions".

   The highest-level optimization is done as part of the initial design of
   the program, before any line of code gets written. This includes the
   choice of data structures and mathematical models you're going to be
   using, the very foundation around which you'll be building your castle.
   This happens in your head at the time you're forming an idea for a
   program, e.g. you're choosing between [87]server-client or [88]P2P,
   [89]monolithic or micro kernel, [90]raytraced or [91]rasterized graphics
   etc. These choices affect greatly the performance of your program but can
   hardly be changed once the program is completed, so they need to be made
   beforehand. This requires wide knowledge and experience as you work by
   intuition.

   Another kind of optimization done during development is just automatically
   writing good code, i.e. being familiar with specific patterns and using
   them without much thought. For example if you're computing some value
   inside a loop and this value doesn't change between iterations, you just
   automatically put computation of that value before the loop. Without this
   you'd simply end up with a shitty code that would have to be rewritten
   line by line at the end. Yes, compilers can often do this simple kind of
   optimization for you, but you don't want to rely on it.

Automatic Optimization

   Automatic optimization is typically performed by the compiler; usually the
   programmer has the option to tell the compiler how much and in what way to
   optimize (no optimization, mild optimization, aggressive optimization,
   optimization for speed, size; check e.g. the man pages of [92]gcc where
   you can see how to turn on even specific types of optimizations). Some
   compilers perform extremely complex reasoning to make the code more
   efficient, the whole area of optimization is a huge science -- here we'll
   only take a look at the very basic techniques. We see optimizations as
   transformations of the code that keep the semantics the same but minimize
   or maximize some measure (e.g. execution time, memory usage, power usage,
   network usage etc.). Automatic optimizations are usually performed on the
   intermediate representation (e.g. [93]bytecode) as that's the ideal way
   (we only write the optimizer once), however some may be specific to some
   concrete instruction set -- these are sometimes called peephole
   optimizations and have to be delayed until code generation.

   The following are some common methods of automatic optimization (also note
   that virtually any method from the above mentioned manual optimizations
   can be applied if only the compiler can detect the possibility of applying
   it):

   { Tip: man pages of gcc or possibly other compilers detail specific
   optimizations they perform under the flags that turn them on, so see these
   man pages for a similar overview. ~drummyfish }

     * Replacing instructions with faster equivalents: we replace an
       instruction (or a series of instructions) with another one that does
       the same thing but faster (or with fewer instructions etc.). Typical
       example is replacing multiplication by power of two with a bit shift
       (e.g. x * 8 is the same as x << 3).
     * Inlining: a function call may usually (not always though, consider
       e.g. [94]recursion) be replaced by the function code itself inserted
       in the place of the call (so called inlining). This is faster but
       usually makes the code bigger so the compiler has to somehow judge and
       decide when it's worth to inline a function -- this may be affected
       e.g. by the function size (inlining a short function won't make the
       code that much bigger), programmer's hints (inline keyword, optimize
       for speed rather than size etc.) or guesstimating how often the
       function will be called. Function that is only called in one place can
       be safely inlined.
     * Loop unrolling: duplicates the body of a loop, making the code bigger
       but increasing its speed (a condition check is saved). E.g. for (int i
       = 0; i < 3; ++i) func(); may be replaced with func(); func(); func();.
       Unrolling may be full or just partial.
     * [95]Lazy evaluation/short circuit/test reordering: the principles of
       lazy evaluation (evaluate function only when we actually need it) and
       short circuit evaluation (don't further evaluate functions when it's
       clear we won't need them) may be auto inserted into the code to make
       it more efficient. Test reordering may lead to first testing simpler
       things (e.g. equality comparison) and leaving complex tests (function
       calls etc.) for later.
     * Algebraic laws, expression evaluation: expressions may be partially
       preevaluated and manipulated to stay mathematically equivalent while
       becoming easier to evaluate, for example 1 + 3 + 5 * 3 * x / 6 may be
       transformed to just 4 + 5 * x / 2.
     * Removing instructions that cancel out: for example in [96]Brainfuck
       the series of instructions +++-- may be shortened to just +.
     * Removing instructions that do nothing: generated code may contain
       instructions that just do nothing, e.g. NOPs that were used as
       placeholders that never got replaced; these can be just removed.
     * Register allocation: most frequently used variables should be kept in
       CPU registers for fastest access.
     * Removing branches: branches are often expensive due to not being CPU
       pipeline friendly, they can sometimes be replaced by a branch-free
       code, e.g. if (a == b) c = 1; else c = 0; can be replaced with c = a
       == b;.
     * Memory alignment, reordering etc.: data stored in memory may be
       reorganized for better efficiency, e.g. an often accessed array of
       bytes may actually be made into array of ints so that each item
       resides exactly on one address (which takes fewer instructions to
       access and is therefore faster). Data may also be reordered to be more
       [97]cache friendly.
     * Generating [98]lookup tables: if the optimizer judges some function to
       be critical in terms of speed, it may auto generate a lookup table for
       it, i.e. precompute its values and so sacrifice some memory for making
       it run extremely fast.
     * Dead code removal: parts of code that aren't used can be just removed,
       making the generated program smaller -- this includes e.g. functions
       that are present in a [99]library which however aren't used by the
       specific program or blocks of code that become unreachable e.g. due to
       some #define that makes an if condition always false etc.
     * [100]Compression: compression methods may be applied to make data
       smaller and optimize for size (for the price of increased CPU usage).
     * ...

See Also

     * [101]refactoring
     * [102]bit hacks
     * [103]fizzbuzz

Links:
1. compiler.md
2. refactoring.md
3. algorithm.md
4. data_structure.md
5. hardware.md
6. protocol.md
7. c.md
8. gprof.md
9. bottleneck.md
10. profiling.md
11. shader.md
12. precomputation.md
13. lut.md
14. compression.md
15. static.md
16. dynamic.md
17. runtime.md
18. constant.md
19. variable.md
20. typing.md
21. array.md
22. math.md
23. expression.md
24. logic_circuit.md
25. bounding_sphere.md
26. point_cloud.md
27. rob_pike.md
28. dynamic_programming.md
29. ternary_operator.md
30. recursion.md
31. good_enough.md
32. approximation.md
33. interpolation.md
34. graphics.md
35. sw_rendering.md
36. bailout.md
37. collision_detection.md
38. lut.md
39. indexing.md
40. tree.md
41. binary.md
42. cache-friendly.md
43. zero.md
44. bit_hack.md
45. preprocessor.md
46. macro.md
47. metaprogramming.md
48. arm.md
49. knuth.md
50. cache.md
51. single_compilation_unit.md
52. linking.md
53. complexity_class.md
54. sorting.md
55. embedded.md
56. floating_point.md
57. fixed_point.md
58. fallback.md
59. gpu.md
60. reciprocal.md
61. fixed_point.md
62. and.md
63. or.md
64. short_circuit_eval.md
65. compression.md
66. jpg.md
67. lz77.md
68. run_length.md
69. anarch.md
70. driver.md
71. lazy_evaluation.md
72. assembly.md
73. portability.md
74. fallback.md
75. ifdef.md
76. assembly.md
77. parallelism.md
78. multithreading.md
79. compute_shader.md
80. dependency.md
81. bloat.md
82. simd.md
83. data.md
84. gpu.md
85. cache.md
86. bloat.md
87. server_client.md
88. p2p.md
89. kernel.md
90. ray_tracing.md
91. rasterization.md
92. gcc.md
93. bytecode.md
94. recursion.md
95. lazy_eval.md
96. brainfuck.md
97. cache.md
98. lut.md
99. library.md
100. compression.md
101. refactoring.md
102. bit_hack.md
103. fizzbuzz.md