To achieve the necessary reuse of data in local memory, researchers have developed many new methods for computation involving matrices and other data arrays [6, 7, 16]. Typically an algorithm that refers to individual elements is replaced by one that operates on subarrays of data, which are called blocks in the matrix computing field. The operations on subarrays can be expressed in the usual way. The advantage of this approach is that the small blocks can be moved into the fast local memory and their elements can then be repeatedly used.
The standard example is matrix multiplication. The usual program is
mmmmmmmmmm¯The entire computation involves arithmetic operations (counting additions and multiplications separately), but produces and consumes only data values. As a whole, the computation exhibits admirable reuse of data. In general, however, an entire matrix will not fit in a small local memory. The work must therefore be broken into small chunks of computation, each of which uses a small enough piece of the data. Note that for each iteration of the outer loop (i.e., for a given value of i) operations are done and data is referred to -- no reuse. For fixed values of i and j, n computation and n data referred too -- again, no reuse.
for i = 1 to n do
for j = 1 to n do
for k = 1 to n do
c[i,j] = c[i,j] + a[i,k] * b[k,j] ; od
Now consider a blocked matrix-multiply algorithm.
mmmmmmmmmmmmmmmm¯First, note that in this program exactly the same operations are done on the same data; even round-off error is identical. Only the sequence in which independent operations are performed is different from the unblocked program. There is still reuse in the whole program of order n. But if we consider one iteration with fixed i0, j0, and k0, we see that operations are performed (by the three inner loops) and data are referred to. Now we can choose b small enough so that these data will fit in the local memory and thus achieve b-fold reuse. (If this isn't enough -- if b < B in other words -- then the machine is poorly designed and needs more local memory.) Put the other way, if we require B-fold reuse, we choose the block size b = B.
for i0 = 1 to n step b do
for j0 = 1 to n step b do
for k0 = 1 to n step b do
for i = i0 to do
for j = j0 to do
for k = k0 to do
c[i,j] = c[i,j] + a[i,k] * b[k,j] ; od
The subject of this paper is the of ordinary programs to blocked form.
Our motivation for seeking such a capability is as follows. Many algorithms can be blocked. Indeed, recent work by numerical analysts has shown that the most important computations for dense matrices are blockable. A major software development of blocked algorithms for linear algebra has been conducted as a result . Further examples, in particular in the solution of partial differential equations by difference and variational methods, are abundant. Indeed, many such codes have also been rewritten as block methods to better use the small main memory and large solid-state disk on Cray supercomputers . All experience with these techniques has shown them to be enormously effective at squeezing the best possible performance out of advanced architectures.
On the other hand, blocked code is much more difficult to write and to understand than point code. Writing it is a difficult and error-prone job. Blocking introduces block size parameters that have nothing to do with the problem being solved and which must be adjusted for each computer and each algorithm if good performance is to be achieved. Unfortunately, the alternative to having blocked code is worse: poor performance on important computations with the most powerful computers. For these reasons, Kennedy has stated that compiler management of memory hierarchies is the most important and most difficult task facing the writers of compilers for high-performance machines .