The increasing availability of advanced-architecture computers is having a very significant effect on all spheres of scientific computation, including algorithm research and software development in numerical linear algebra. Linear algebra--in particular, the solution of linear systems of equations--lies at the heart of most calculations in scientific computing. This paper discusses some of the recent developments in linear algebra designed to exploit these advanced-architecture computers. Particular attention will be paid to dense factorization routines, such as the Cholesky and LU factorizations, and these will be used as examples to highlight the most important factors that must be considered in designing linear algebra software for advanced-architecture computers. We use these factorization routines for illustrative purposes not only because they are relatively simple, but also because of their importance in several scientific and engineering applications that make use of boundary element methods. These applications include electromagnetic scattering and computational fluid dynamics problems.

Much of the work in developing linear algebra software for advanced-architecture computers is motivated by the need to solve large problems on the fastest computers available. In this paper, we focus on four basic issues: (1) the motivation for the work; (2) the development of standards for use in linear algebra and the building blocks for a library; (3) aspects of algorithm design and parallel implementation; and (4) future directions for research.

For the past 15 years or so, there has been a great deal of
activity in the area of algorithms and software for solving linear
algebra problems.
The linear algebra community has long recognized the need for help
in developing algorithms into software libraries, and
several years ago,
as a community effort, put together a *de facto* standard for
identifying basic operations required in linear algebra algorithms and software.
The hope was that the routines making up this standard, known collectively as
the Basic Linear Algebra Subprograms (BLAS), would be efficiently
implemented on advanced-architecture computers by many manufacturers,
making it possible to reap the portability benefits of having them efficiently
implemented on a wide range of machines. This goal has been largely realized.

The key insight of our approach to designing linear algebra algorithms for advanced architecture computers is that the frequency with which data are moved between different levels of the memory hierarchy must be minimized in order to attain high performance. Thus, our main algorithmic approach for exploiting both vectorization and parallelism in our implementations is the use of block-partitioned algorithms, particularly in conjunction with highly-tuned kernels for performing matrix-vector and matrix-matrix operations (the Level 2 and 3 BLAS). In general, the use of block-partitioned algorithms requires data to be moved as blocks, rather than as vectors or scalars, so that although the total amount of data moved is unchanged, the latency (or startup cost) associated with the movement is greatly reduced because fewer messages are needed to move the data.

A second key idea is that the performance of an algorithm can be tuned by a user by varying the parameters that specify the data layout. On shared memory machines, this is controlled by the block size, while on distributed memory machines it is controlled by the block size and the configuration of the logical process mesh, as described in more detail in Section 5.

In Section 1, we first give an overview of some of the major
software projects aimed at solving dense linear algebra problems. Next, we
describe the types
of machine that benefit most from the use of block-partitioned algorithms, and
discuss what is meant by high-quality, reusable software for
advanced-architecture computers.
Section 2 discusses the role of the BLAS
in portability and performance on high-performance computers.
We discuss the
design of these building blocks, and their use in block-partitioned
algorithms, in Section 3.
Section 4 focuses on the design of a
block-partitioned algorithm for LU factorization, and
Sections 5, 6, and 7
use this example to illustrate the most important factors in implementing
dense linear algebra routines on MIMD, distributed memory, concurrent computers.
Section 5 deals with the issue of mapping the data onto
the hierarchical memory of a concurrent computer.
The layout of an application's data is crucial in determining the performance
and scalability of the parallel code.
In Sections 6 and 7,
details of the parallel implementation and optimization issues are discussed.
Section 8 presents some future directions for investigation.

Sun Feb 9 10:05:05 EST 1997