For the past 15 years or so, there has been a great deal of
activity in the area of algorithms and software for solving linear
algebra problems.
The linear algebra community has long recognized the need for help
in developing algorithms into software libraries, and
several years ago,
as a community effort, put together a * de facto* standard for
identifying basic operations required in linear algebra algorithms and software.
The hope was that the routines making up this standard, known collectively as
the Basic Linear Algebra Subprograms (BLAS), would be efficiently
implemented on advanced-architecture computers by many manufacturers,
making it possible to reap the portability benefits of having them efficiently
implemented on a wide range of machines. This goal has been largely realized.

The key insight of this approach to designing linear algebra algorithms for advanced architecture computers is that the frequency with which data are moved between different levels of the memory hierarchy must be minimized in order to attain high performance. Thus, our main algorithmic approach for exploiting both vectorization and parallelism in our implementations is the use of block-partitioned algorithms, particularly in conjunction with highly-tuned kernels for performing matrix-vector and matrix-matrix operations (the Level 2 and 3 BLAS). In general, the use of block-partitioned algorithms requires data to be moved as blocks, rather than as vectors or scalars, so that although the total amount of data moved is unchanged, the latency (or startup cost) associated with the movement is greatly reduced because fewer messages are needed to move the data.

A second key idea is that the performance of an algorithm can be tuned by a user by varying the parameters that specify the data layout. On shared memory machines, this is controlled by the block size, while on distributed memory machines it is controlled by the block size and the configuration of the logical process mesh.

The way in which an algorithm's data are distributed over the processors of
a parallel computer has a major impact on the load balance and communication
characteristics of the parallel algorithm, and hence largely determines
its performance and scalability. The block scattered (or block cyclic)
decomposition provides a simple, yet general-purpose, way of distributing a
block-partitioned matrix on distributed memory parallel computers.
In the block scattered decomposition, described in detail in [26],
a matrix is partitioned into blocks of size r s ,
and blocks separated by a fixed stride in the column and row
directions are assigned to the same processor. If the stride in the
column and row directions is **P** and **Q** blocks respectively, then
we require that P Q equals the number of processors, N_p. Thus,
it is useful to imagine the processors arranged as a P Q mesh,
or template. Then the processor at position (p,q) ( 0 p < P ,
0 q <Q ) in the template is assigned the blocks indexed by,

where i=0,...,(M_b-p-1)/P , j=0,...,(N_b-q-1)/Q , and M_b x N_b is the size of the matrix in blocks.

Blocks are scattered in this way so that good load balance can be maintained in algorithms, such as LU factorization [27,28], in which rows and/or columns of blocks of a matrix become eliminated as the algorithm progresses. However, for some of the distributed Level 3 BLAS routines a scattered decomposition does not improve load balance, and may result in higher concurrent overhead. The general matrix-matrix multiplication routine xGEMM is an example of such a routine for which a pure block (i.e., nonscattered) decomposition is optimal when considering the routine in isolation. However, xGEMM may be used in an application for which, overall, a scattered decomposition is best.

The underlying concept of the implementations we have chosen for dense matrix computations is the use of block-partitioned algorithms to minimize data movement between different levels in hierarchical memory. The ideas discussed here for dense linear algebra computations are applicable to any computer with a hierarchical memory that (1) imposes a sufficiently large startup cost on the movement of data between different levels in the hierarchy, and for which (2) the cost of a context switch is too great to make fine grain size multithreading worthwhile. These ideas have been exploited by the software packages LAPACK [7] and ScaLapack [29]. The PARKBENCH suite includes five matrix kernels.

- Dense matrix multiply. Communication involves broadcast of data along rows of mesh, and periodic shift along column direction (or vice versa).
- Transpose. Matrix transpose is an important benchmark because it exercises the communications of a computer heavily on a realistic problem where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network.
- Dense LU factorization with partial pivoting. Searching for a pivot is basically a reduction operation within one column of the processor mesh. Exchange of pivot rows is a point-to-point communication. Update phase requires data to be broadcast along rows and columns of the processor mesh.
- QR Decomposition. In this benchmark parallelization is achieved by distribution of rows on a logical grid of processors using block interleaving.
- Matrix tridiagonalization, for eigenvalue computations of symmetric matrices.

Tue Nov 14 15:43:14 PST 1995