Loop nest optimization


In computer science and particularly in compiler design, loop nest optimization is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or other loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.
The technique used to produce this optimization is called loop tiling, also known as loop blocking or strip mine and interchange.

Overview

Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused. The partitioning of loop iteration space leads to partitioning of large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
An ordinary loop

for

can be blocked with a block size B by replacing it with

for

where min is a function returning the minimum of its arguments.

Example: matrix-vector multiplication

The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.

int i, j, a, b, c;
int n = 100;
for

After we apply loop tiling using 2 * 2 blocks, our code looks like:

int i, j, x, y, a, b, c;
int n = 100;
for

The original loop iteration space is n by n. The accessed chunk of array a is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration may cross cache lines, causing cache misses.

Tiling size

It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast, cache-oblivious algorithms are designed to make efficient use of cache without explicit blocking.

Example: matrix multiplication

Many large mathematical operations on computers end up spending much of their time doing matrix multiplication. The operation is:
where A, B, and C are N×N arrays. Subscripts, for the following
description, are in the form C.
The basic loop is:

int i, j, k;
for

There are three problems to solve:
The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply–adds, thus solving problem #2. By carrying four accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time. However, the code does not address the third problem.

for

This code has had both the i and j iterations blocked by a factor of two, and had both the resulting two-iteration inner loops completely unrolled.
This code would run quite acceptably on a Cray Y-MP, which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block
instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the ISA. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.
Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This register pressure is why vendors of RISC CPUs, who intended to build machines more parallel than the general purpose x86 and 68000 CPUs, adopted 32-entry floating-point register files.
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once, A is loaded into the cache once, but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N3/ib doubleword loads from main memory. In the code above, ib is 2.
The next step to reducing the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz Pentium 4 system used for this example, the balance number is 16.5. The second code example above cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is blocked over i.

for

With this code, we can set ib to be anything we like, and the number of loads of the B matrix will be reduced by that factor. This freedom has a cost: we are now keeping a N×ib slices of the A matrix in the cache. As long as that fits, this code will not be limited by the memory system.
So what size matrix fits? Our example system, a 2.8 GHz Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the A matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, we'll need another trick.
That trick is reducing the size of the stripe of the B matrix by blocking
the k loop, so that the stripe is of size ib × kb. Blocking the k loop
means that the C array will be loaded and stored N/kb times, for a total
of memory transfers. B is still transferred N/ib times, for
transfers. So long as
the machine's memory system will keep up with the floating point unit and
the code will run at maximum performance. The 16KB cache of the Pentium
4 is not quite big enough: we might choose ib=24 and kb=64, thus using 12KB
of the cache—we don't want to completely fill it, since the C and B
arrays have to have some room to flow through. These numbers comes within
20% of the peak floating-point speed of the processor.
Here is the code with loop k blocked.

for

The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably split the kk 0 iteration off from the rest of the kk iterations, in order to remove the if statement from the i loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.
The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB high-bandwidth level-2 cache as well as the level-1 cache. We are presented with a choice:
Rather than specifically tune for one particular cache size, as in the first example,
a cache-oblivious algorithm is designed to take advantage of any available cache, no matter what its size is.
This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for matrix multiplication are known.