Optimizing compiler


In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory requirement, and power consumption.
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources and/or executes faster. It has been shown that some code optimization problems are NP-complete, or even undecidable. In practice, factors such as the programmer's willingness to wait for the compiler to complete its task place upper limits on the optimizations that a compiler implementer might provide. In the past, computer memory limitations were also a major factor in limiting which optimizations could be performed. Because of these factors, optimization rarely produces "optimal" output in any sense, and in fact, an "optimization" may impede performance in some cases; rather, they are heuristic methods for improving resource usage in typical programs.

Types of optimization

Techniques used in optimization can be broken up among various scopes which can affect anything from a single statement to the entire program. Generally speaking, locally scoped techniques are easier to implement than global ones but result in smaller gains. Some examples of scopes include:
;Peephole optimizations: Usually performed late in the compilation process after machine code has been generated. This form of optimization examines a few adjacent instructions to see whether they can be replaced by a single instruction or a shorter sequence of instructions. For instance, a multiplication of a value by 2 might be more efficiently executed by left-shifting the value or by adding the value to itself.
;Local optimizations: These only consider information local to a basic block. Since basic blocks have no control flow, these optimizations need very little analysis, but this also means that no information is preserved across jumps.
;Global optimizations: These are also called "intraprocedural methods" and act on whole functions. This gives them more information to work with but often makes expensive computations necessary. Worst case assumptions have to be made when function calls occur or global variables are accessed.
;Loop optimizations: These act on the statements which make up a loop, such as a for loop. Loop optimizations can have a significant impact because many programs spend a large percentage of their time inside loops.
;Prescient store optimizations: Allow store operations to occur earlier than would otherwise be permitted in the context of threads and locks. The process needs some way of knowing ahead of time what value will be stored by the assignment that it should have followed. The purpose of this relaxation is to allow compiler optimization to perform certain kinds of code rearrangement that preserve the semantics of properly synchronized programs.
;Interprocedural, whole-program or link-time optimization: These analyze all of a program's source code. The greater quantity of information extracted means that optimizations can be more effective compared to when they only have access to local information. This kind of optimization can also allow new techniques to be performed. For instance, function inlining, where a call to a function is replaced by a copy of the function body.
;Machine code optimization and object code optimizer: These analyze the executable task image of the program after all of an executable machine code has been linked. Some of the techniques that can be applied in a more limited scope, such as macro compression, are more effective when the entire executable task image is available for analysis.
In addition to scoped optimizations, there are two further general categories of optimization:
;Programming language–independent vs. language-dependent: Most high-level languages share common programming constructs and abstractions: decision, looping, and encapsulation. Thus similar optimization techniques can be used across languages. However, certain language features make some kinds of optimizations difficult. For instance, the existence of pointers in C and C++ makes it difficult to optimize array accesses. However, languages such as PL/1 nevertheless have available sophisticated optimizing compilers to achieve better performance in various other ways. Conversely, some language features make certain optimizations easier. For example, in some languages functions are not permitted to have side effects. Therefore, if a program makes several calls to the same function with the same arguments, the compiler can immediately infer that the function's result need be computed only once. In languages where functions are allowed to have side effects, another strategy is possible. The optimizer can determine which function has no side effects, and restrict such optimizations to side effect free functions. This optimization is only possible when the optimizer has access to the called function.
;Machine independent vs. machine dependent: Many optimizations that operate on abstract programming concepts are independent of the machine targeted by the compiler, but many of the most effective optimizations are those that best exploit special features of the target platform. E.g.: Instructions which do several things at once, such as decrement register and branch if not zero.
The following is an instance of a local machine dependent optimization. To set a register to 0, the obvious way is to use the constant '0' in an instruction that sets a register value to a constant. A less obvious way is to XOR a register with itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and probably faster, as there will be no need to decode an immediate operand, nor use the internal "immediate operand register".

Factors affecting optimization

;The machine itself: Many of the choices about which optimizations can and should be done depend on the characteristics of the target machine. It is sometimes possible to parameterize some of these machine dependent factors, so that a single piece of compiler code can be used to optimize different machines just by altering the machine description parameters. GCC is a compiler which exemplifies this approach.
;The architecture of the target CPU: Number of CPU registers: To a certain extent, the more registers, the easier it is to optimize for performance. Local variables can be allocated in the registers and not on the stack. Temporary/intermediate results can be left in registers without writing to and reading back from memory.
;The architecture of the machine:
;Intended use of the generated code:

Common themes

To a large extent, compiler optimization techniques have the following themes, which sometimes conflict.
;Optimize the common case: The common case may have unique properties that allow a fast path at the expense of a slow path. If the fast path is taken most often, the result is better overall performance.
;Avoid redundancy: Reuse results that are already computed and store them for later use, instead of recomputing them.
;Less code: Remove unnecessary computations and intermediate values. Less work for the CPU, cache, and memory usually results in faster execution. Alternatively, in embedded systems, less code brings a lower product cost.
;Fewer jumps by using straight line code, also called branch-free code: Less complicated code. Jumps interfere with the prefetching of instructions, thus slowing down code. Using inlining or loop unrolling can reduce branching, at the cost of increasing binary file size by the length of the repeated code. This tends to merge several basic blocks into one.
;Locality: Code and data that are accessed closely together in time should be placed close together in memory to increase spatial locality of reference.
;Exploit the memory hierarchy: Accesses to memory are increasingly more expensive for each level of the memory hierarchy, so place the most commonly used items in registers first, then caches, then main memory, before going to disk.
;Parallelize: Reorder operations to allow multiple computations to happen in parallel, either at the instruction, memory, or thread level.
;More precise information is better: The more precise the information the compiler has, the better it can employ any or all of these optimization techniques.
;Runtime metrics can help: Information gathered during a test run can be used in profile-guided optimization. Information gathered at runtime can be used by a JIT compiler to dynamically improve optimization.
;Strength reduction: Replace complex or difficult or expensive operations with simpler ones. For example, replacing division by a constant with multiplication by its reciprocal, or using induction variable analysis to replace multiplication by a loop index with addition.

Specific techniques

Loop optimizations

Some optimization techniques primarily designed to operate on loops include:
;Induction variable analysis: Roughly, if a variable in a loop is a simple linear function of the index variable, such as j := 4*i + 1, it can be updated appropriately each time the loop variable is changed. This is a strength reduction, and also may allow the index variable's definitions to become dead code. This information is also useful for bounds-checking elimination and dependence analysis, among other things.
;Loop fission or loop distribution: Loop fission attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the code in the loop's body.
;Loop fusion or loop combining or loop ramming or loop jamming : Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times, their bodies can be combined as long as they make no reference to each other's data.
;Loop inversion: This technique changes a standard while loop into a do/while loop wrapped in an if conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check but is more efficient because jumps usually cause a pipeline stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect-free, the if guard can be skipped.
;Loop interchange: These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of reference, depending on the array's layout.
;Loop-invariant code motion: If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with loop inversion, because not all code is safe to be hoisted outside the loop.
;Loop nest optimization: Some pervasive algorithms such as matrix multiplication have very poor cache behavior and excessive memory accesses. Loop nest optimization increases the number of cache hits by performing the operation over small blocks and by using a loop interchange.
;Loop reversal: Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependencies and thus enable other optimizations. Furthermore, on some architectures, loop reversal contributes to smaller code, as when the loop index is being decremented, the condition that needs to be met in order for the running program to exit the loop is a comparison with zero. This is often a special, parameter-less instruction, unlike a comparison with a number, which needs the number to compare to. Therefore, the amount of bytes needed to store the parameter is saved by using the loop reversal. Additionally, if the comparison number exceeds the size of word of the platform, in standard loop order, multiple instructions would need to be executed in order to evaluate the comparison, which is not the case with loop reversal.
;Loop unrolling: Unrolling duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which hurt performance by impairing the instruction pipeline. A "fewer jumps" optimization. Completely unrolling a loop eliminates all overhead, but requires that the number of iterations be known at compile time.
;Loop splitting: Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop.
;Loop unswitching: Unswitching moves a conditional from inside a loop to outside the loop by duplicating the loop's body inside each of the if and else clauses of the conditional.
;Software pipelining: The loop is restructured in such a way that work done in an iteration is split into several parts and done over several iterations. In a tight loop, this technique hides the latency between loading and using values.
;Automatic parallelization: A loop is converted into multi-threaded or vectorized code in order to utilize multiple processors simultaneously in a shared-memory multiprocessor machine, including multi-core machines.

Data-flow optimizations

optimizations, based on data-flow analysis, primarily depend on how certain properties of data are propagated by control edges in the control flow graph. Some of these include:
;Common subexpression elimination: In the expression - /4, "common subexpression" refers to the duplicated . Compilers implementing this technique realize that will not change, and so only calculate its value once.
; Constant folding and propagation: replacing expressions consisting of constants with their final value at compile time, rather than doing the calculation in run-time. Used in most modern languages.
;Induction variable recognition and elimination: see discussion above about induction variable analysis.
;Alias classification and pointer analysis: in the presence of pointers, it is difficult to make any optimizations at all, since potentially any variable can have been changed when a memory location is assigned to. By specifying which pointers can alias which variables, unrelated pointers can be ignored.
;Dead store elimination: removal of assignments to variables that are not subsequently read, either because the lifetime of the variable ends or because of a subsequent assignment that will overwrite the first value.

SSA-based optimizations

These optimizations are intended to be done after transforming the program into a special form called Static Single Assignment, in which every variable is assigned in only one place. Although some function without SSA, they are most effective with SSA. Many optimizations listed in other sections also benefit with no special changes, such as register allocation.
;Global value numbering: GVN eliminates redundancy by constructing a value graph of the program, and then determining which values are computed by equivalent expressions. GVN is able to identify some redundancy that common subexpression elimination cannot, and vice versa.
;Sparse conditional constant propagation: Combines constant propagation, constant folding, and dead code elimination, and improves upon what is possible by running them separately. This optimization symbolically executes the program, simultaneously propagating constant values and eliminating portions of the control flow graph that this makes unreachable.

Code generator optimizations

;Register allocation: The most frequently used variables should be kept in processor registers for fastest access. To find which variables to put in registers, an interference-graph is created. Each variable is a vertex and when two variables are used at the same time they have an edge between them. This graph is colored using for example Chaitin's algorithm using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.
;Instruction selection: Most architectures, particularly CISC architectures and those with many addressing modes, offer several different ways of performing a particular operation, using entirely different sequences of instructions. The job of the instruction selector is to do a good job overall of choosing which instructions to implement which operators in the low-level intermediate representation with. For example, on many processors in the 68000 family and on the x86 architecture, complex addressing modes can be used in statements like "lea 25, a0", allowing a single instruction to perform a significant amount of arithmetic with less storage.
;Instruction scheduling: Instruction scheduling is an important optimization for modern pipelined processors, which avoids stalls or bubbles in the pipeline by clustering instructions with no dependencies together, while being careful to preserve the original semantics.
;Rematerialization: Rematerialization recalculates a value instead of loading it from memory, preventing a memory access. This is performed in tandem with register allocation to avoid spills.
;Code factoring: If several sequences of code are identical, or can be parameterized or reordered to be identical, they can be replaced with calls to a shared subroutine. This can often share code for subroutine set-up and sometimes tail-recursion.
;Trampolines : Many CPUs have smaller subroutine call instructions to access low memory. A compiler can save space by using these small calls in the main body of code. Jump instructions in low memory can access the routines at any address. This multiplies space savings from code factoring.
;Reordering computations: Based on integer linear programming, restructuring compilers enhance data locality and expose more parallelism by reordering computations. Space-optimizing compilers may reorder code to lengthen sequences that can be factored into subroutines.

Functional language optimizations

Although many of these also apply to non-functional languages, they either originate in, are most easily implemented in, or are particularly critical in functional languages such as Lisp and ML.
;Removing recursion: Recursion is often expensive, as a function call consumes stack space and involves some overhead related to parameter passing and flushing the instruction cache. Tail recursive algorithms can be converted to iteration, which does not have call overhead and uses a constant amount of stack space, through a process called tail recursion elimination or tail call optimization. Some functional languages mandate that tail calls be optimized by a conforming implementation, due to their prevalence in these languages.
;Deforestation : Because of the high level nature by which data structures are specified in functional languages such as Haskell, it is possible to combine several recursive functions which produce and consume some temporary data structure so that the data is passed directly without wasting time constructing the data structure.
;Partial evaluation

Other optimizations

;Bounds-checking elimination:Many languages, such as Java, enforce bounds checking of all array accesses. This is a severe performance bottleneck on certain applications such as scientific code. Bounds-checking elimination allows the compiler to safely remove bounds checking in many situations where it can determine that the index must fall within valid bounds; for example, if it is a simple loop variable.
;Branch offset optimization : Choose the shortest branch displacement that reaches target
;Code-block reordering:Code-block reordering alters the order of the basic blocks in a program in order to reduce conditional branches and improve locality of reference.
;Dead code elimination: Removes instructions that will not affect the behaviour of the program, for example definitions which have no uses, called dead code. This reduces code size and eliminates unnecessary computation.
;Factoring out of invariants : If an expression is carried out both when a condition is met and is not met, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. Also known as total redundancy elimination. A more powerful optimization is partial redundancy elimination.
;Inline expansion or macro expansion: When some code invokes a procedure, it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it. This saves the overhead related to procedure calls, as well as providing great opportunity for many different parameter-specific optimizations, but comes at the cost of space; the procedure body is duplicated each time the procedure is called inline. Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures. A "fewer jumps" optimization. The statements of imperative programming languages are also an example of such an optimization. Although statements could be implemented with function calls they are almost always implemented with code inlining.
;Jump threading: In this pass, consecutive conditional jumps predicated entirely or partially on the same condition are merged.
;Macro compression: A space optimization that recognizes common sequences of code, creates subprograms that contain the common code, and replaces the occurrences of the common code sequences with calls to the corresponding subprograms. This is most effectively done as a machine code optimization, when all the code is present. The technique was first used to conserve space in an interpretive byte stream used in an implementation of Macro Spitbol on microcomputers. The problem of determining an optimal set of macros that minimizes the space required by a given code segment is known to be NP-complete, but efficient heuristics attain near-optimal results.
;Reduction of cache collisions:
;Stack height reduction: Rearrange expression tree to minimize resources needed for expression evaluation.
;Test reordering: If we have two tests that are the condition for something, we can first deal with the simpler tests and only then with the complex tests. This technique complements lazy evaluation, but can be used only when the tests are not dependent on one another. Short-circuiting semantics can make this difficult.

Interprocedural optimizations

works on the entire program, across procedure and file boundaries. It works tightly with intraprocedural counterparts, carried out with the cooperation of a local part and global part. Typical interprocedural optimizations are: procedure inlining, interprocedural dead code elimination, interprocedural constant propagation, and procedure reordering. As usual, the compiler needs to perform interprocedural analysis before its actual optimizations. Interprocedural analyses include alias analysis, array access analysis, and the construction of a call graph.
Interprocedural optimization is common in modern commercial compilers from SGI, Intel, Microsoft, and Sun Microsystems. For a long time, the open source GCC was criticized for a lack of powerful interprocedural analysis and optimizations, though this is now improving. Another open source compiler with full analysis and optimization infrastructure is Open64.
Due to the extra time and space required by interprocedural analysis, most compilers do not perform it by default. Users must use compiler options explicitly to tell the compiler to enable interprocedural analysis and other expensive optimizations.

Problems with optimization

Early in the history of compilers, compiler optimizations were not as good as hand-written ones. As compiler technologies have improved, good compilers typically generate code good enough to generally no longer warrant the much higher effort to program hand-optimized code in assembly language.
For RISC CPU architectures, and even more so for VLIW hardware, compiler optimization is the key for obtaining efficient code, because RISC instruction sets are so compact that it is hard for a human to manually schedule or combine small instructions to get efficient results. Indeed, these architectures were designed to rely on compiler writers for adequate performance.
However, optimizing compilers are by no means perfect. There is no way that a compiler can guarantee that, for all program source code, the fastest or smallest possible equivalent compiled program is output; such a compiler is fundamentally impossible because it would solve the halting problem.
This may be proven by considering a call to a function, foo. This function returns nothing and does not have side effects. The fastest possible equivalent program would be simply to eliminate the function call. However, if the function foo in fact does not return, then the program with the call to foo would be different from the program without the call; the optimizing compiler will then have to determine this by solving the halting problem.
Additionally, there are a number of other more practical issues with optimizing compiler technology:
Moreover, optimization algorithms are complicated and, especially when being used to compile large, complex programming languages, they can have bugs which introduce errors in the generated code or cause internal errors during compilation. Compiler errors of any kind can be disconcerting to the user, but especially so in this case, since it may not be clear that the optimization logic is at fault. In the case of internal errors, the problem can be partially ameliorated by a "fail-safe" programming technique in which the optimization logic in the compiler is coded such that a failure is trapped, a warning message issued, and the rest of the compilation proceeds to successful completion.
Work to improve optimization technology continues. One approach is the use of so-called post-pass optimizers. These tools take the executable output by an "optimizing" compiler and optimize it even further. Post-pass optimizers usually work on the assembly language or machine code level. The performance of post-pass compilers are limited by the fact that much of the information available in the original source code is not always available to them.
As processor performance continues to improve at a rapid pace, while memory bandwidth improves more slowly, optimizations that reduce memory bandwidth requirements will become more useful. Examples of this, already mentioned above, include loop nest optimization and rematerialization.

History

Early compilers of the 1960s were often primarily concerned with simply compiling code correctly or efficiently – compile times were a major concern. One of the earliest notable optimizing compilers was that for BLISS, which was described in The Design of an Optimizing Compiler. By the late 1980s, optimizing compilers were sufficiently effective that programming in assembly language declined. This co-evolved with the development of RISC chips and advanced processor features such as instruction scheduling and speculative execution, which were designed to be targeted by optimizing compilers rather than by human-written assembly code.

List of static code analyses