Current advanced architecture computers are NUMA (Non-Uniform Memory Access) machines. They possess hierarchical memories, in which accesses to data in the upper levels of the memory hierarchy (registers, cache, and/or local memory) are faster than those in lower levels (shared or off-processor memory). One technique to more efficiently exploit the power of such machines is to develop algorithms that maximize reuse of data in the upper levels of memory. This can be done by partitioning the matrix or matrices into blocks and by performing the computation with matrix-vector or matrix-matrix operations on the blocks. A set of BLAS (Level 2 and 3 BLAS)  were proposed for that purpose. The Level 3 BLAS have been successfully used as the building blocks of a number of applications, including LAPACK , which is the successor to LINPACK  and EISPACK . LAPACK is a software library that uses block-partitioned algorithms for performing dense and banded linear algebra computations on vector and shared memory computers.
The scalable library we are developing for distributed-memory concurrent computers will also use block-partitioned algorithms and be as compatible as possible with the LAPACK library for vector and shared memory computers. It is therefore called ScaLAPACK (``Scalable LAPACK'') , and can be used to solve ``Grand Challenge'' problems on massively parallel, distributed-memory, concurrent computers .
The Basic Linear Algebra Communication Subprograms (BLACS)  provide ease-of-use and portability for message-passing in parallel linear algebra applications. The Parallel BLAS (PBLAS) assume a block cyclic data distribution and are functionally an extended subset of the Level 1, 2, and 3 BLAS for distributed memory systems. They are based on previous work with the Parallel Block BLAS (PB-BLAS) . The current model implementation relies internally on the PB-BLAS, as well as the BLAS and the BLACS. The ScaLAPACK routines consist of calls to the sequential BLAS, the BLACS, and the PBLAS modules. ScaLAPACK can therefore be ported to any machine on which the BLAS and the BLACS are available.
This paper presents the implementation details, performance, and scalability of the ScaLAPACK routines for the LU, QR, and Cholesky factorization of dense matrices. These routines have been studied on various parallel platforms by many other researchers . We maintain compatibility between the ScaLAPACK codes and their LAPACK equivalents by isolating as much of the distributed memory operations as possible inside the PBLAS and ScaLAPACK auxiliary routines. Our goal is to simplify the implementation of complicated parallel routines while still maintaining good performance.
Currently the ScaLAPACK library contains Fortran 77 subroutines for the analysis and solution of systems of linear equations, linear least squares problems, and matrix eigenvalue problems. ScaLAPACK routines to reduce a real general matrix to Hessenberg or bidiagonal form, and a symmetric matrix to tridiagonal form are considered in .
The design philosophy of the ScaLAPACK library is addressed in Section 2. In Section 3, we describe the ScaLAPACK factorization routines by comparing them with the corresponding LAPACK routines. Section 4 presents more details of the parallel implementation of the routines and performance results on the Intel family of computers: the iPSC/860, the Touchstone Delta, and the Paragon. In Section 5, the scalability of the algorithms on the systems is demonstrated. Conclusions and future work are presented in Section 6.