# Designing fast linear algebra kernels in the presence of memory hierarchies

This lecture derived from a lecture offered in a course organized by J. Demmel at UCB, CS267 Applications of Parallel Computers. See for the original lecture.

We begin our study of memory hierarchies by asking how fast we can perform certain simple linear algebra "kernel" routines, like matrix multiply. More complicated algorithms are built from these kernels, as we will later describe, so understanding the kernels is essential to understanding the more complicated routines.

The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines or BLAS. The motivation for software and hardware vendors agreeing to a standard definition of these routines is their popularity: The BLAS appear in many applications (as we will see), so if vendors agree on a common interface, and supply an optimized BLAS library for each computer, then users can write programs calling the BLAS, and move their programs from one machine to another and still expect high performance.

For historical reasons, these BLAS as classified into three categories:

The Level 1 BLAS or BLAS1 operate mostly on vectors (1D arrays), or pairs of vectors. If the vectors have length n, these routines perform O(n) operations, and return either a vector or a scalar. Examples are

• A saxpy operation performs y(i) = a * x(i) + y(i), for i=1 to n. We will write this as y = a*x+y for short. Saxpy is an acronym for the operation. S stands for single precision, and there are also daxpy for double precision, caxpy for complex, and zaxpy for double complex, a time-honored Fortran naming convention.
• Scaling computes y = a * x,
• A rotation replaces vectors x and y by c*x+s*y and -s*x+c*y, respectively, where c and s are typically a cosine and sine.
• A dot product computes s = sum_{i=1}^n x(i)*y(i) .
• The Level 2 BLAS or BLAS2 operate mostly on a matrix (2D array) and a vector (or vectors), returning a matrix or a vector. If the array is n-by-n, O(n^2) operations are performed. Here are some examples.

• Matrix-vector multiplication computes y = y + A*x, where A is m-by-n, x is n-by-1 and y is m-by-1. The standard name for this is gemv (with the usual flavors sgemv, etc.)
• A rank-one update computes A = A + y*x', where A is m-by-n, y is m-by-1, x is n-by-1, and x' means the transpose of x. This is a short way of saying A(i,j) = A(i,j) + y(i)*x(j) for all i,j. It is called a rank-one update because y*x' is a rank-one matrix.
• A triangular solve solves y=T*x for x, where T is a triangular matrix.
• The Level 3 BLAS or BLAS3 operate on pairs or triples of matrices, returning a matrix. Here are some examples.

• Matrix-matrix multiplication computes C = C + A*B, where C is m-by-n, A is m-by-k and B is k-by-n. The BLAS routine gemm actually permits the more general operation C = b*C + a*op(A)*op(B), where a and b are scalars, and each op() is an optional transpose operation. Note that the presence of optional transposes lets this single interface supports all indicated operations whether the matrices are stored in column major order (Fortran) or row major order (C).
• A multiple triangular solve solves Y = T*X for X, where T is a triangular matrix, and X is a rectangular matrix.
• Mathematically, many these operations are essentially equivalent. For example, the dot product can be thought of as the product of a 1-by-n vector and an n-by-1 vector, and each entry of A*B is the dot product of a row of A and a column of B. So essentially any linear algebra algorithm can be expressed in terms of any of these constructs. So why distinguish among them? The answer is performance, as indicated on the following plot of BLAS performance on the IBM RS 6000/590.

The slide plots the performance in megaflops of the BLAS on the RS 6000/590, versus matrix or vector dimension. These BLAS are specially optimized by IBM to take advantage of all features of the RS 6000/590 architecture (we will discuss this in more detail later). The top curve is for matrix-matrix multiply (BLAS3), the second highest curve (for large dimension) is matrix-vector multiply (BLAS2), and the lowest curve is for saxpy (BLAS1) on the RS6000. (These plots assume the data is not initially in the cache, or fast memory.) As you can see there is potentially a big speed advantage if an algorithm can be expressed in terms of the BLAS3 instead of BLAS2 or BLAS1. The top speed of the BLAS3, about 250 Mflops, is very close to the peak machine speed of 266 Mflops. Later in the course, we will explain how to reorganize algorithms, like Gaussian elimination, so that they use BLAS3 rather than BLAS1.

Historically, the BLAS1 were invented and standardized first, because they were adequate to reach the top performance of the machines at that time (early 1970s). When they no longer reached peak performance on later machines, the BLAS2 were introduced in the mid 1980s, and the BLAS3 in 1990. As we will now explain, the reason for their performance differences is how they depend on the memory hierarchy.

To quantify our compararisons, we will use a simple model of the way the memory hierarchy works: Assume we have two levels of memory hierarchy, fast and slow, and that all data initially resides in slow memory. For each of saxpy, sgemv and sgemm, applied to vectors and matrices of dimension n, we will count

```       m = number of memory references to slow memory needed just to read
the input data from slow memory, and write the output data back
f = number of floating point operations
q = f/m = average number of flops per slow memory reference

m         justification for m                          f        q
saxpy     3*n        read each x(i), y(i) once, write y(i) once  2*n      2/3
sgemv     n^2+O(n)   read each A(i,j) once                      2*n^2      2
sgemm   4*n^2        read each A(i,j),B(i,j),C(i,j) once,       2*n^3     n/2
write each C(i,j) once
```

There is a clear difference in q for each level of BLAS, with BLAS3 the highest. The significance of q is this: for each word read from slow memory (the expensive operation), one can hope to do at most q operations on it (on average) while it resides in fast memory. The higher q is, the more the algorithm appears to operate at the top of the memory hierarchy, where it is most efficient. Another way to describe q is in terms of the miss ratio, the fraction of all memory references (to fast or slow memory) that ``miss'' the fast memory, and need data from the slow memory; a good problem has a low miss ratio. Assuming all input data initially resides in the slow memory, that the final result will also reside in the slow memory, and that each floating point operation involves 2 references to either fast or slow memory, we may bound

```                 number of references to slow memory       m       1
miss ratio = ------------------------------------- >= ----- = ----- .
total number of memory references        2*f     2*q
```
Thus the higher is q, the lower we can hope to make the miss ratio by a clever choice of algorithm.

Let us use a very simple problem to illustrate why a large q means we can hope to run faster. Suppose one can do operations at 1 Mflop on data in the fast memory, and fetching data from slow memory takes 10 times longer per word. Suppose our simple algorithm is

```   s = 0
for i = 1, n
s = s + f(x(i))
end for
```
where evaluating f(x(i)) requires q-1 operations on x(i), all of which can be done in fast memory once x(i) is available. We can assume s also resides in fast memory. So for this algorithm m=n, and f=q*n. How fast does it run? Assuming x(i) is initially in slow memory, this algorithm takes time t = 10*n microsecs (to read each x(i)) plus q*n microsecs (to do q*n flops), for a megaflop rate of f/t = q/(q+10). As q increases, this megaflop rate approaches the top speed of 1 Mflop.

Here are some more details of our simple memory hierarchy model. We will use them to analyze several different algorithms for matrix multiplication (sgemm) in more detail, and in fact design one that uses the memory hierarchy ``optimally''.

1. There are just two levels in the hierarchy, fast and slow.
2. The small, fast memory has size M words, where M << n^2, so we can only fit a small part of an entire n-by-n matrix, but M >= 4*n, so we can fit several whole rows or columns.
3. Each word is read from slow memory individually (in practice, larger groups of words are read, such as cache lines or memory pages, but this doesn't change the basic analysis).
4. We have complete control over which words are transferred between the two levels. This is a best-case assumption, since often the hardware (cache or virtual memory system) makes this decision for us. In parallel computing, however, when the two levels are local processor memory and remote processor memory, we often have explicit control, whether we want it or not.

We consider 3 implementations of matrix multiply and compute q=f/m for each.

### Algorithm 1: Unblocked matrix multiply

```
for i=1 to n
{Read row i of A into fast memory}
for j=1 to n
{Read column j of B into fast memory}
for k=1 to n
C(i,j)=C(i,j) + A(i,k)*B(k,j)
end for
{Write C(i,j) back to slow memory}
end for
end for
```
Note that the innermost loop is doing a dot product of row i of A and column j of B; this is indicated by the following figure:

Also, one can describe two innermost loops (on j and k) as doing a vector-matrix multiplication of the ith row of A times the matrix B, to get the i-th row of C. This is a hint that we will not perform any better than these operations, since they are within the inner loop. Here is the detailed accounting of memory references:

```     m = # slow memory refs =   n^3     read each column of B n times
+ n^2     read each row of A once for each i,
and keep it in fast memory during
the execution of the two inner loops
+ 2*n^2   read/write each entry of C once
= n^3 + 3*n^2
```

Thus q = f/m = (2*n^3)/(n^3 + 3*n^2) ~ 2. This is the same q as for sgemv, which is no surprise as mentioned above. It is far below the upper bound of n/2. We leave it to the reader to show that reading B into fast memory differently cannot significantly improve m and q.

### Algorithm 2: Column blocked matrix multiply

We need some more notation to state the algorithm. Consider the matrix C = [C1, C2, ... , CN] as a set of N blocks, each consisting of n/N complete columns. Partition B the same way. We also need "Matlab" notation for submatrices: A(i:j,k:l) refers to the submatrix in rows i through j and columns k through l; A(i,k:l) is columns k through l of row i, and A(:,j) is all of column j.
```       for j=1 to N
for k=1 to n
{Read column k of A into fast memory}
Cj = Cj + A( :,k ) * Bj( k,: )  ...  rank-1 update of Cj
end for
{Write Cj back to slow memory}
end for
```
Here is a pictorial representation of the algorithm, for N=4:

The inner loop does a rank-one update of Cj, multiplying column k of A times row k of Bj. We assume that the fast memory is large enough to let us keep Bj, Cj and one column of A in it at a time; this means M >= 2*n^2/N + n, or N >= 2*n^2/(M-n) ~ 2*n^2/M, so that N is bounded below. Here is the detailed accounting of memory references:

```     m = # memory refs =   n^2     read each Bj once
+ N*n^2   read each column of A N times
+ 2*n^2   read/write each Cj once
= (N+3)*n^2
```
Thus q = f/m = (2*n^3)/((N+3)*n^2) ~ 2*n/N. We can increase q, and so accelerate the algorithm, by making N as small as possible. But N is bounded below by the memory constraint M >= 2*n^2/N + n. If we choose N as small as possible, so that N ~ 2*n^2/M, then q ~ M/n. Thus we need M to grow proportionally to n to get a reasonable q. So for a fixed M, as n grows, we would expect the algorithm to slow down. This is not an attractive property.

### Algorithm 3: Square blocked matrix multiply (also called 2D blocked, or panel blocked)

Now consider C to be an N-by-N matrix of n/N-by-n/N subblocks Cij, with A and B similarly partitioned.
```        for i=1 to N
for j=1 to N
for k=1 to N
Cij = Cij + Aik * Bkj
end for
{Write Cij back to slow memory}
end for
end for
```

Here is a pictorial representation of the algorithm, for N=4:

Thus, the inner loop is an n/N-by-n/N matrix multiply. We assume that the fast memory is large enough so that the 3 subblocks Cij, Aik and Bkj fit in the fast memory; this means that M >= 3*(n/N)^2, or N >= sqrt(3/M)*n, so that N is bounded below. Here is the detailed accounting of memory references

```     m = # memory refs =   N*n^2   read each Bkj N^3 times
+ N*n^2   read each Aik N^3 times
+ 2*n^2   read/write each Cij once
= (2*N+2)*n^2
```

Thus q = f/m = (2*n^3)/((2*N+2)*n^2) ~ n/N. As before, we can increase q, and so accelerate the algorithm, by making N as small as possible. But N is bounded below by the memory constraint N >= sqrt(M/3)*n. If we choose N as small as possible, so that N ~ sqrt(M/3)*n, then q ~ sqrt(M/3). Now q is independent of n, and in fact essentially optimal:

Theorem: ("I/O Complexity: the red blue pebble game", X. Hong and H. T. Kung, 13th Symposium on the Theory of Computing, ACM, 1981). Any blocked version of this matrix multiplication algorithm has q = Big-Omega( sqrt(M) ), i.e. growing at least as fast as a constant multiple of sqrt(M).

There is a lot more to matrix multiplication, both theoretically and practically.

### Fast serial matrix multiplication algorithms

The algorithm we just described takes 2*n^3 flops. The first asymptotically faster algorithm was discovered by V. Strassen in 1970's, and takes O(n^(log_2 (7))) ~ O(n^2.81) flops (``Gaussian elimination is not optimal'', V. Strassen, Numer. Math. v. 13, 1969). For large enough n, O(n^2.81) is arbitrarily smaller than O(n^3). Here is the algorithm:
```
M = Strassen_Matrix_Multiply(A,B,n)
/* Return M=A*B, where A and B are n-by-n;
Assume n is a power of 2 */
if n=1,
return M=A*B   /* scalar multiplication */
else
Partition A = [ A11  A12 ] , B = [ B11  B12 ]
[ A21  A22 ]       [ B21  B22 ]
where the subblocks Aij and Bij are n/2-by-n/2

M1 = Strassen_Matrix_Multiply( A12 - A22, B21 + B22, n/2 )
M2 = Strassen_Matrix_Multiply( A11 + A22, B11 + B22, n/2 )
M3 = Strassen_Matrix_Multiply( A11 - A21, B11 + B12, n/2 )
M4 = Strassen_Matrix_Multiply( A11 + A12, B22, n/2 )
M5 = Strassen_Matrix_Multiply( A11, B12 - B22, n/2 )
M6 = Strassen_Matrix_Multiply( A22, B21 - B11, n/2 )
M7 = Strassen_Matrix_Multiply( A21 + A22, B11, n/2 )

M11 = M1 + M2 - M4 + M6
M12 = M4 + M5
M21 = M6 + M7
M22 = M2 - M3 + M5 - M7

return M = [ M11  M12 ]
[ M21  M22 ]
end
```
It is straightforward but tedious to show (by induction on n) that this correctly computes the product A*B. If we let T(n) denote the number of floating point operations required to execute Strassen_Matrix_Multiply, then we may write down the simple recurrence:
```       T(n) =    7*T(n/2)   ... the cost of the 7 recursive calls
+ 18*(n/2)^2  ... the cost of the 18 n/2-by-n/2 matrix additions
```
Changing variables from n to m = log_2 n changes this to a simpler recurrence for T'(m) = T(n):
```       T'(m) =    7*T'(m-1) + 18*2^(2*m-2)
```
which can be solved to show that T(n) = O( n^(log_2 (7)) ) ~ n^2.81. Strassen's method is numerically stable, but not always as accurate as the straightforward algorithm (``Exploiting fast matrix multiplication within the Level 3 BLAS'', N. Higham, ACM Trans. Math. Soft., v. 16, 1990, "Stability of block algorithms with fast Level 3 BLAS", J. Demmel and N. Higham, ACM Trans. Math. Soft., v 18, n 3, 1992). Furthermore, it requires a significant amount of workspace to store intermediate results, which is a disadvantage compared to the conventional algorithm. A long sequence of ever more complicated algorithms has appeared (``How Can We Speed Up Matrix Multiplication'' by V. Pan, SIAM Review, v 26, 1984; ``How to multiply matrices faster'' by V. Pan, Lecture Notes in Mathematics v. 179, Springer, 1984; ``Polynomial and matrix computations,'' by V. Pan & D. Bini, Birkhauser, 1994) culminating in the current record algorithm of O(n^2.376...), due to Winograd and Coppersmith ["Matrix multiplication via arithmetic progressions". In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp 1-6, New York City, 25-27 May 1987]. All but the simplest of these algorithms are far too complicated to implement efficiently. IBM's ESSL library includes an implementation of Strassen and of an algorithm by Winograd which trades some of the multiplies in the usual algorithm for more additions, which helps on some architectures. Strassen's method has also been implemented on the Cray 2 ("Extra high speed matrix multiplication on the Cray-2", D. Bailey, SIAM J. Sci. Stat. Comput., v 9, May 1988; "Using Strassen's algorithm to accelerate the solution of linear systems", D. Bailey, K. Lee, H. Simon, J. Supercomputing, v 4, 1991). A potential attraction of Strassen is its natural block decomposition; instead of continuing the recursion until n=1 (as in the above code), one could instead use a more standard matrix multiply as soon as n is small enough to fit all the data in the fast memory.

We will discuss matrix multiplication again later, for distributed memory parallel machines, where part of the algorithm design problem is to decide which processor memories store which parts of A, B and C, and which processors are responsble for computing which parts of the product A*B.

# The IBM RS6000/590 - architecture and algorithms

## Introduction

We continue our study of memory hierarchies and parallelism by considering some details of the RS6000/590 (also called POWER2) architecture, and by showing how to exploit them in order to write an efficient matrix-vector multiplication algorithm. (This will illustrate the issues you need to consider for Assignment 1, for which you will write a matrix-matrix multiplication algorithm for this machine.)

More important than the particular details of the RS6000/590 is that they illustrate many of the issues we will encounter over and over in other aspects of parallel computing. In order to write fast code on the RS6000/590, or any machine, one should exploit:

• Parallelism - Do as many operations simultaneously as possible. For example, the RS6000/590 is a superscalar processor, which means it can execute up to six instructions simultaneously, provided they do not conflict: 2 floating point instructions, 2 fixed point (integer) instructions, a condition register instruction (such as a comparison), and a branch.
• Pipelining - Do long sequences of identical operations by processing them in a pipeline. On the RS6000/590, a single floating point instruction can do a multiply and an add in a pipeline of length 2, yielding a factor 2 in speed up if a long sequence of such operations is available. Pipelining is a special case of parallelism, one that appears very often.
• Locality - Reuse data which is "close to" other data you are using. On the RS6000, when data is fetched from slow memory to fast memory (cache), it comes in 256 byte chunks called cache lines, and a good algorithm will use all this data at the same time, to minimize the number of times the cache line has to be fetched from slow memory.
• Much of the detail of the following discussion is taken from "Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms", R. Agarwal, F. Gustavson, M. Zubair, IBM J. of R&D, v. 38, n. 5, Sept 94 (see reader). The whole issue of this journal discusses details of the RS6000 architecture, and how to optimize programs for it. Many of these articles are on-line, including a description of IBM's numerical subroutine library ESSL. There is also a great deal of material available on tuning the performance of code. For the truly ambitious, a performance monitor is available, which uses special purpose hardware to measure how well each part of the central processing unit (CPU) is performing (ask if you are interested).

The table below shows the stages one might go through in attempting to write a fast matrix multiplication routine for the RS6000. The first stage is to figure out the maximum speed at which the machine runs. We will measure success by how close to this peak speed we can get. The first line in the table is the clock rate, and the second is the peak speed. Since each floating point operation takes 1 cycle, and there is 4-fold parallelism in the machine (2 floating point instructions per cycle times 2-fold pipeline parallelism, as we explain further below), the maximum machine speed is 4 x 66.5 Mflops = 266 Mflops. The other lines give the computation rate for 64-by-64 matrix multiply, computed using several algorithms and compiler options.

Unblocked matmul refers to Algorithm 1 in DMR is a public-domain routine optimized for the RS6000. Finally, ESSL's dgemm is IBM's proprietary optimized matrix-multiply (for double precision data). f77 is the name of the Fortran compiler, and the strings following f77 are various compiler optimization flags.

When unblocked matmul with no optimization is used, the speed is a disappointing 4.5 Mflops, a small fraction of peak speed. Optimization level "-O2" attains 65 Mflops, which is approximately the clock speed. This means that the machine is doing just 1 floating point operation per cycle, rather than the maximum 4. Optimization level "-O3 -qhot" without the "-qstrict" option is one of the most aggressive levels of optimization; when using it, the compiler prints a warning that it may change the semantics of the code in its pursuit of a fast version of the code. DMR compiled this way gave erroneous answers for n>64, and still got only 186/266=70% of the peak machine speed (this was a bug, not a feature, and is meant as an early warning that leading edge technology is not always as reliable as less ambitious technology!).

```          Clock rate:                       66.5 MHz
Peak speed:                      266   Mflops

Unblocked matmul, f77:             4.5 Mflops
Unblocked matmul, f77 -O2:        65   Mflops
Unblocked matmul, f77 -O3 -qhot:  92   Mflops

DMR             , f77 -O2:       152   Mflops
DMR             , f77 -O3 -qhot: 186   Mflops (buggy for n>64)

ESSL dgemm      , f77 -O3 -qhot: 240   Mflops
```

There is yet another optimization level, not illustrated here, which would "pattern match" to discover that the unblocked implementation was really doing matrix multiplication, and replace everything by a call to ESSL's dgemm, which then nearly attains peak speed. This kind of optimization technique is clearly limited in generality.

This table illustrates that one cannot yet expect most compilers to take naive code and produce object code that fully exploits the speed of the machine, expect for a very few special cases where the compiler can recognize an entire algorithm (such as matrix multiply) and replace the user's program with a call to a preexisting library subroutine. Compilers for high-performance machines are a current subject of research, to which we will return in a later chapter. Some research compilers do attack the problem of automatically optimizing code for particular cache organizations. For example, see the SUIF compiler work at Stanford, such as "The Cache Performance and Optimizations of Blocked Algorithms" by M. Lam, E. Rothberg and M. Wolf. See also the work on tiling at UCSD and elsewhere, such as "Hierarchical Tiling for Improved Superscalar Performance", by L. Carter, J. Ferrante and S. Flynn Hummel.

Now we begin discussing the RS6000/590 in more detail, and explain how one would write a matrix-vector multiply routine. Along the way we will learn more about how memory hierarchies work, and how to best use them. Yet more details about the RS6000/590 may be found in the IBM document Algorithm and Architecture Aspects of Producing ESSL BLAS on POWER2, especially the section on BLAS-2 Implementation.

## Details of the RS6000/590 architecture

The Power2 CPU consists of 8 chips: an FPU (Floating Point Unit), FXU (Fixed Point Unit), ICU (Instruction Cache Unit), SCU (Storage Control Unit), and 4 DCU (Data Cache Units).

### Floating Point Unit.

The FPU has 32 64-bit floating point registers, and two independent MAF (Fused-multiply-add) units, which may operate independently and in parallel. The registers form the top level of the memory hierarchy: and floating point instruction take their operands from registers, and return results to registers. The hardware automatically tries to schedule independent instructions to use the next MAF unit available, to exploit parallelism. A fused-multiply-add instruction performs a=b*c+d in two cycles, doing b*c on the first cycle and (b*c)+d on the second. There are separate adders and multipliers, both of which may work simultaneously on different instructions. This is called pipelining, with a 2-stage pipe (multiply and add). To illustrate, consider a sequence of operations ai=bi*ci+di, for i=1 to 4. We assume these operation are independent, i.e. that no output ai is used as input to any other operation. This sequence would be performed in 5 cycles as follows:
```Cycle              1        2           3           4           5

Multiplication   b1*c1    b2*c2       b3*c3       b4*c4
```
Thus, a 2-stage pipeline can perform n operations in n+1 steps. Since 1 "operation" in this case consists of 2 floating point operations, this means 2*n flops are performed in n+1 cycles. Without pipelining, it would take 2*n cycles to perform the 2*n flops, meaning that the speedup is 2*n/(n+1), or nearly 2 for large n.

For this to work, it is important that these operations be independent, i.e. that none of bi, ci, or di is computed as a(i-1) in the previous operation. In this case, the data would not be available in time for the pipeline to use it, and there would be delays (unused cycles).

More generally, an s-stage pipeline can work on as many as s independent operations at a time, taking s+n-1 cycles to finish n operations:

```Cycle      1       2       3  ...   s         s+1     ...  n        ...   s+n-1

Stage 1   op1     op2     op3 ...  ops       op(s+1)  ... opn       ...
Stage 2           op1     op2 ...  op(s-1)   op(s+2)  ... op(n-1)   ...
...                         ...
Stage s                       ...  op1       op2      ... op(n-s+1) ...   opn
```
Without pipelining, the same sequence of operations would take s*n cycles, meaning that the speedup is s*n/(s+n-1), or nearly s when n is much larger than s. Pipelining is a very common way to get speedup, because it does not require multiple copies of the hardware (in this case, one adder and one multipler are all that is needed), and so is ubiquitous in the RS6000/590, in other machines ranging from PCs to supercomputers, and in many factories since the days of Henry Ford.

We just said that when the number of operations n is large compared to the pipeline length s, the speedup s*n/(s+n-1) is nearly s. But for smaller n, the speedup drops. A common way to express this fact is with the quantity n_{1/2}, read "n-one-half", which is the minimum value of n for which the speedup is half its peak value s:

```    s * n_{1/2} / (s + n_{1/2} - 1) = s/2      or    n_{1/2} = s-1
```
This is shown in the following plot:

The concept of "minimum problem size to reach half-peak performance" has become popular for other problems not solved by a single pipelined operation, because their speedup plots often appear as in the above graph: speed increases for larger problem sizes. For example the LINPACK Benchmark, defines n_{1/2} to be the smallest matrix size for which Gaussian elimination runs at half its peak performance.

The fused-multiply-add instruction has one other interesting property: it is performed with one round-off error. In other words, in a=b*c+d, b*c is first computed exactly to quadruple (128 bit) precision, if b and c are double (64 bit) precision, then d is added, and the sum rounded to a. This use of very high precision is used by the RS6000 to implement division, which still takes about 19 times longer then either multiply or add. Square root takes 27 times longer. The fused multiply-add may be used to simulate higher precision cheaply. All arithmetic conforms to the IEEE Floating Point Arithmetic standard which we will discuss later (see the notes on floating point by W. Kahan.

To summarize: to get speed out of the RS6000/590 FPU, one's algorithm needs to execute a sequence of independent operations, each of which is a multiply-add operation a=b*c+d.

### Data Cache Unit.

The cache is the second level in the memory hierarchy of the RS6000/590, and the Data Cache Unit (DCU) manages data movement between the cache and the third level in the hierarchy, which is main memory. We will only concern ourselves with these top three levels: registers, cache and main memory. We will discuss the DCU in some detail, because it illustrates common properties of memory hierarchies, which are important for designing high-performance algorithms.

Here is a brief summary of the main properties of the RS6000/590 cache; the details follow. The cache is 256 KB long, with 256 bytes (32 8-byte double-words) per cache line. It is 4-way set associative. Its write policy is store-back with LRU. It is dual-ported and nonblocking. A cache miss has a latency of 14-20 cycles and takes 8 cycles to fill a cache line. There is a quad load/store instruction.

Here are the details. The cache, or fast memory, is 256 KB long. It is read or written to slow memory in 256 byte chunks called cache lines. A load instruction trying to fetch the data stored in a memory location and put it into a register first checks to see if the data is in the cache. If it is, the cache supplies the requested data in one cycle; this is called a cache hit. Otherwise, there is a cache miss, and a delay of 14-20 cycles before an entire cache line containing the desired data starts arriving, 32 bytes/cycle, from main memory (another pipeline!). Therefore it is much more efficient if an algorithm processes all 256 consecutive bytes (32 consecutive double-words) in a cache-line, since it pays for them to be moved from main memory anyway, rather than only using one or a few words and ignoring the rest. This is called exploiting spatial locality, or using all the data located close together in memory. Movement of data between other levels in the memory hierarchy usually occurs in large chunks (such as pages between disk and main memory), so spatial locality is an important aspect of algorithm design.

As long as there is space in the cache when there is a cache-miss, cache-lines are fetched and stored in cache. The cache is organized in 256 sets with 4 cache-lines per set. 8 consecutive bits of the memory address being fetched are used to pick which of the 256 sets a cache line is to occupy. Since there are 4 lines in a set (4-way set associativity) as many as 4 lines with the same 8 bits in their addresses can simultaneously be in cache.

When a set fills up, and a fifth cache line is fetched, one of the lines currently in the set must be selected to be evicted, or removed from the cache. The one selected is the one Least Recently Used (LRU) - a counter is associated with each cache line to keep track of this - and it is then written back into slow memory. This is an illustration of why an algorithm should exhibit temporal locality, or reuse of data recently used: as long a datum is used frequently enough, it is guaranteed to remain in cache via the LRU mechanism.

When a word is written, only the cache is updated, not the slow memory. The modified cache line is marked as dirty by setting a bit. When a dirty cache line is evicted, it must be moved back to slow memory (clean cache lines can just be overwritten, which is faster). This is called a write-back policy, since slow memory is updated only when a dirty cache line is evicted, in contrast to a write-through policy, where slow memory is updated on every write (this is slower, since it accesses slow memory more often).

The cache is dual-ported and nonblocking. This means that if one cache read misses, and a 14-20 cycle delay ensues while the missing cache line is fetched from slow memory, then there is a second port on the cache which can continue functioning independently and keep the FPU supplied with data.

A load instruction moves an 8-byte word from the cache (if there is a cache-hit) to a register in one cycle. A store instruction moves the data in the opposite direction. A quad load/store instruction moves 2 consecutive 8-byte words from cache to 2 consective memory locations in 1-cycle, which is twice as fast as a standard load/store.

We illustrate the importance of spatial locality with the following example. Consider the following two algorithms to add two 1024-by-1024 double precision arrays, both stored in row-major order (C-style):
```        Algorithm 1                                Algorithm 2

for i = 1 to 1024                           for j = 1 to 1024
for j = 1 to 1024                           for i = 1 to 1024
A(i,j) = A(i,j) + B(i,j)                   A(i,j) = A(i,j) + B(i,j)
end for                                    end for
end for                                    end for
```
These algorithms differ only in the order of their two nested loops, but they differ greatly in performance because of differing spatial locality. Initially, arrays A and B are stored in main memory. A and B each occupy 8*1024^2 = 2^23 bytes, whereas the cache is only 2^18 bytes long, so they do not fit in cache. Since the matrices are stored by rows, Algorithm 1 accesses consecutive memory elements in the innermost loop, thus exhibiting good spatial locality, whereas Algorithm 2 accesses memory elements 8*1024 = 2^11 bytes apart, and so in different cache lines. Thus, in Algorithm 2 there will be 2 cache-misses on every pass through the inner loop (one for A(i,j) and one for B(i,j)), whereas for Algorithm 1 there will only be 2 cache misses for every 32 consecutive values of j in the innermost loop. In other words, Algorithm 2 will probably have 32 times as many cache misses as Algorithm 1, and in fact twice as many cache misses as floating point operations. Since one cache miss takes at least 14 cycles, or 14 times as long as a floating point operation, Algorithm 2 will be many times slower than Algorithm 1.

### Example: Matrix-Vector Multiplication

Here is (slightly simplied) inner loop for double precision matrix-vector multiplication (dgemv), y=y+A*x on the RS6000/590, where A is M-by-N. This code has been designed taking many features of the Data Cache Unit and Floating Point Unit into account. A detailed explanation follows.
```T0 = Y(I) 	! load 24  Y elements
T1 = Y(I+1) 	! in 24 FPRs (floating point registers)
...
T23 = Y(I+23)
DO J = J1, J1+JBLK-1
XJ = X(J)	! load an element of X
F1 = A(I+1,J)	! both F0 and F1 in one instruction
T0 = T0 + XJ*F0	! update Y(I)
T1 = T1 + XJ*F1	! update Y(I+1)
...		! update other 22 Y(.) values
F0 = A(I+22,J)
F1 = A(I+23,J)
T22 = T22 + XJ*F0
T23 = T23 + XJ*F1
ENDDO
Y(I) = T0 	! store Y elements
Y(I+1) = T1 	! after the loop
...
Y(I+23) = T23
```
Let us explain what is going on:
• A is stored in column major format (Fortran).
• Only part of the operation y = y + A*x is shown here. The inside of the J-loop implements a saxpy:
```      Y( I:I+23 ) = Y( I:I+23 ) + A( I:I+23, J ) * X( J )
```
We have used Matlab, or Fortran 90, notation: Y(I:I+23) refers to the subvector consisting of entries I to I+23 of vector Y, and A( I:I+23, J ) refers to the submatrix consisting of rows I to I+23 in column J of A. Therefore, the overall J-loop implements the operation
```      Y( I:I+23 ) = Y( I:I+23 ) + A( I:I+23, J1:J1+JBLK-1 ) * X( J1:J1+JBLK-1 )
```
where A( I:I+23, J1:J1+JBLK-1 ) is the submatrix in rows I to I+23, and columns J1 to J1+JBLK-1, of A. In other words, the J-loop does a 24-by-JBLK-matrix times JBLK-vector multiplication.
• There are also outer loops (not shown) from J1 = 1 to M step JBLK, and I = 1 to N step 24. JBLK is chosen so the block of A fits in cache, along with Y(I:I+23), and a little more:
```     24*JBLK + 25 <= 2^15, or JBLK <= 1364
```
• T0=Y(I) through T23=Y(I+23) loads 24 FPRs (floating point registers) with Y(I:I+23). This reuses the cache line occupied by Y(I). These remain in registers throughout the inner loop.
• In the loop, XJ, F0 and F1 also occupy registers.
• Two consecutive entries of A, say A(I,J) and A(I+1,J), are loaded at the same time with a single quad-load instruction. Only one FXU is needed to get this information.
• Tj and Tj+1 are updated at the same time with two fused-multiply-add instructions.
• The other FXU can be used to prefetch the next column of A, by inserting
```           d = A(I+23,J+2)
```
in the inner loop to get column J+2 ahead of time, completely hiding cache miss time.
• One should choose I carefully so all subblocks begin on cache-line boundaries, to minimize cache misses.