ScaLAPACK is a library of high performance linear algebra routines
for distributed memory MIMD computers. It is a continuation of the
LAPACK project, which designed and produced analogous software
for workstations, vector supercomputers, and shared memory parallel
computers. Both libraries contain routines for solving systems of
linear equations, least squares problems, and eigenvalue problems.
The goals of both projects are efficiency (to run as fast as
possible), scalability (as the problem size and number of
processors grow), reliability (including error bounds),
portability (across all important parallel machines),
flexibility (so users can construct new routines from
well-designed parts), and ease-of-use (by making LAPACK
and ScaLAPACK look as similar as possible). Many of these goals,
particularly portability, are aided by developing and promoting
*standards*, especially for low-level communication and
computation routines. We have been successful in attaining these
goals, limiting most machine dependencies to two standard
libraries called the BLAS, or Basic Linear Algebra Subroutines
[16][14][7][6], and BLACS, or Basic Linear Algebra
Communication Subroutines [10][8]. LAPACK and
ScaLAPACK will run on any machine where the BLAS and the BLACS
are available.

The first part of this paper presents the design of ScaLAPACK. After a brief discussion of the BLAS and LAPACK, the block cyclic data layout, the BLACS, the PBLAS (Parallel BLAS), and the algorithms used are discussed. We also outline the difficulties encountered in producing correct code for networks of heterogeneous processors; difficulties we believe are little recognized by other practitioners.

The second part of this paper presents a theoretical model of parallel computers dedicated to dense linear algebra: the Distributed Linear Algebra Machine (DLAM). This ideal model provides a convenient framework for developing parallel algorithms. Moreover, it can be applied to obtain theoretical performance bounds and to analyze the scalability and programmability of parallel algorithms.

Finally, the paper discusses the performance of ScaLAPACK.
Extensive results on various platforms are presented. One
of our goals is to model and predict the performance of each
routine as a function of a few problem and machine parameters.
We show how the DLAM can be used to express this function,
identify performance bottlenecks during development, and
help users to choose various implementation parameters
(like the number of processors) to optimize performance.
One interesting result is that for some algorithms, speed is
*not* a monotonic increasing function of the number of
processors. In other words, speed can be increased by letting
some processors remain idle.

Fri Mar 31 13:01:26 EST 1995