Loop nest optimization

Loop nest optimization

Loop nest optimization (LNO) is a special case of loop transformation, dealing with nested loops, that allows large reductions in the cache bandwidth necessary for some common algorithms.

Example: Matrix multiply

Many large mathematical operations on computers end up spending much of their time doing matrix multiplication. Examining this loop nest can be quite instructive. The operation is:

C = A×B

where A, B, and C are N×N arrays. Subscripts, for the followingdescription, are in the form C [row] [column] .

The basic loop is:

for (i=0; i < N; ++i) { for (j=0; j < N; ++j) { C [i] [j] = 0; for (k=0; k < N; ++k) { C [i] [j] += A [k] [j] * B [i] [k] ; } } }

There are three problems to solve:

* Floating point additions take some number of cycles to complete. In order to keep an adder with multiple cycle latency busy, the code must update multiple accumulators in parallel.

* Machines can typically do just one memory operation per multiply-add, so values loaded must be reused at least twice.

* Typical PC memory systems can only sustain one 8-byte doubleword per 10&ndash;30 double-precision multiply-adds, so values loaded into the cache must be reused many times.

The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply-adds, thus solving problem #2. By carrying four accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1). However, the code does not address the third problem. (Nor does it address the cleanup work necessary when N is odd. Such details will be left out of the following discussion.)

for (i=0; i < N; i += 2) { for (j=0; j < N; j += 2) { acc00 = acc01 = acc10 = acc11 = 0; for (k=0; k < N; ++k) { acc00 += A [k] [j+0] * B [i+0] [k] ; acc01 += A [k] [j+1] * B [i+0] [k] ; acc10 += A [k] [j+0] * B [i+1] [k] ; acc11 += A [k] [j+1] * B [i+1] [k] ; } C [i+0] [j+0] = acc00; C [i+0] [j+1] = acc01; C [i+1] [j+0] = acc10; C [i+1] [j+1] = acc11; } }

This code has had both the i and j iterations blocked by a factor of two, and had both the resulting two-iteration inner loops completely unrolled.

This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply-adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply-adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!

A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 blockinstead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the ISA. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.

Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This "register pressure" is why vendors of RISC CPUs, who intended to build machines more parallel than the general purpose x86 and 68000 CPUs, adopted 32-entry floating-point register files.

The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of B is loaded and the entire matrix A is loaded. For the entire calculation, C is stored once (that's good), B is loaded into the cache once (assuming a stripe of B fits in the cache with a stripe of A), but A is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N3/ib doubleword loads from main memory. In the code above, "ib" is 2.

The next step to reducing the memory traffic is to make ib as large aspossible. We want it to be larger than the "balance" number reportedby streams. In the case of one particular 2.8 GHz Pentium-4 systemused for this example, the balance number is 16.5. Thesecond code example above can't be extended directly, sincethat would require many more accumulator registers. Instead, we "block"the loop over i. (Technically, this is actually the second time we'veblocked i, as the first time was the factor of 2.)

for( ii=0; ii < N; ii += ib ) for( j=0; j < N; j += 2 ) for( i=ii; i < ii+ib; i += 2 ) { acc00 = acc01 = acc10 = acc11 = 0; for( k=0; k < N; k++ ) { acc00 += A [k] [j+0] * B [i+0] [k] ; acc01 += A [k] [j+1] * B [i+0] [k] ; acc10 += A [k] [j+0] * B [i+1] [k] ; acc11 += A [k] [j+1] * B [i+1] [k] ; } C [i+0] [j+0] = acc00; C [i+0] [j+1] = acc01; C [i+1] [j+0] = acc10; C [i+1] [j+1] = acc11; }

With this code, we can set ib to be anything we like, and the number of loads of the A matrix will be reduced by that factor. This freedom has a cost: we are now keeping a Nxib slice of the B matrix in the cache. So long as that fits, this code will not be limited by the memory system.

So what size matrix fits? Our example system, a 2.8 GHz Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the B matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, we'll need another trick.

That trick is reducing the size of the stripe of the B matrix by blockingthe k loop, so that the stripe is of size ib x kb. Blocking the k loopmeans that the C array will be loaded and stored N/kb times, for a totalof 2*N^3/kb memory transfers. A is still transferred N/ib times, for N^3/ibtransfers. So long as 2*N/kb + N/ib < N/balancethe machine's memory system will keep up with the floating point unit andthe code will run at maximum performance. The 16KB cache of the Pentium4 is not quite big enough: we might choose ib=24 and kb=64, thus using 12KBof the cache -- we don't want to completely fill it, since the C and Aarrays have to have some room to flow through. These numbers comes within20% of the peak floating-point speed of the processor.

Here is the code with loop k blocked.

for( ii=0; ii < N; ii += ib ) for( kk=0; kk < N; kk += kb ) for( j=0; j < N; j+= 2 ) for( i=ii; i < ii+ib; i += 2 ) { if( kk=0 ) acc00 = acc01 = acc10 = acc11 = 0; else { acc00 = C [i+0] [j+0] ; acc01 = C [i+0] [j+1] ; acc10 = C [i+1] [j+0] ; acc11 = C [i+1] [j+1] ; } for( k=kk; k < kk+kb; k++ ) { acc00 += A [k] [j+0] * B [i+0] [k] ; acc01 += A [k] [j+1] * B [i+0] [k] ; acc10 += A [k] [j+0] * B [i+1] [k] ; acc11 += A [k] [j+1] * B [i+1] [k] ; } C [i+0] [j+0] = acc00; C [i+0] [j+1] = acc01; C [i+1] [j+0] = acc10; C [i+1] [j+1] = acc11; }

The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably split the kk=0 iteration off from the rest of the kk iterations, in order to remove the if statement from the i loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.

The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the level-1 cache. We are presented with a choice:

* We can adjust the block sizes for the level-2 cache. This will stress the processor's ability to keep many instructions in flight simultaneously, and there is a good chance it will be unable to achieve full bandwidth from the level-2 cache.

* We can block the loops again, again for the level-2 cache sizes. With a total of three levels of blocking (for the register file, for the L1 cache, and for the L2 cache), the code will minimize the required bandwidth at each level of the memory hierarchy. Unfortunately, the extra levels of blocking will incur still more loop overhead, which for some problem sizes on some hardware may be more time consuming than any shortcomings in the hardware's ability to stream data from the L2 cache.

ee also

*Duff's device

External links

* [http://www.cs.virginia.edu/stream/standard/Balance.html Streams benchmark results] , showing the overall balance between floating point operations and memory operations for many different computers

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Loop optimization — In compiler theory, loop optimization plays an important role in improving cache performance, making effective use of parallel processing capabilities, and reducing overheads associated with executing loops. Most execution time of a scientific… …   Wikipedia

  • Loop scheduling — In parallel computing, loop scheduling is the problem of assigning proper iterations of parallelizable loops among n processors to achieve load balancing and maintain data locality with minimum dispatch overhead.Typical loop scheduling methods… …   Wikipedia

  • Compiler optimization — is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the… …   Wikipedia

  • Ant colony optimization algorithms — Ant behavior was the inspiration for the metaheuristic optimization technique. In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be… …   Wikipedia

  • Polytope model — The polyhedral model (also called the polytope method) is a mathematical framework for loop nest optimization in compiler theory. The polytope method models operations within nested manifest loops as mathematical objects called polytopes,… …   Wikipedia

  • Frameworks supporting the polyhedral model — Use of the polyhedral model within a compiler requires software to represent the objects of this framework (sets of integer valued points in regions of various spaces) and perform operations upon them (e.g., testing whether the set is empty). Two …   Wikipedia

  • Manifest expression — A manifest expression is a programming language construct that a compiler can analyse to deduce which values it can take without having to execute the program. This information can enable compiler optimizations, in particular loop nest… …   Wikipedia

  • CPU cache — Cache memory redirects here. For the general use, see cache. A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the… …   Wikipedia

  • Автоматическое распараллеливание — Автоматическое распараллеливание  оптимизация программы компилятором, состоящая в автоматическом ее преобразовании в форму, работающую на параллельном компьютере, например, на SMP или NUMA машине. Целью автоматизации распараллеливания… …   Википедия

  • Automatic parallelization — Automatic parallelization, also auto parallelization, autoparallelization, parallelization, or //ization (shorthand), the last two of which imply automation when used in context, refers to converting sequential code into multi threaded or… …   Wikipedia