Linear Algebra

Jack Dongarra (director), Herb Keller, Roldan Pozo, Danny Sorensen, and Eric Van de Velde

Several areas in linear algebra must be addressed to allow wide use of parallel computing. These areas include the development of LAPACK for distributed-memory machines, dense nonsymmetric eigenvalue problems, parallel algorithms for large-scale eigenvalue problems, sparse linear least squares, multigrid algorithms, sparse linear systems, and linear algebra for signal processing.

Jack Dongarra holds a joint appointment with the University of Tennessee (UT) and Oak Ridge National Laboratory (ORNL) under the UT/ORNL Science Alliance Program. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced computers, programming methodology, and tools for parallel computers. Current research also involves the development, testing, and documentation of high-quality mathematical software. He was involved in the design and implementation of the software packages EISPACK, LINPACK, the BLAS, LAPACK, and PVM and is currently involved in the design of algorithms and techniques for high-performance computer architectures. He is also co-editor-in-chief of the International Journal of Supercomputer Applications and an editor for Parallel Computing, the Journal of Distributed and Parallel Computing, the Journal of Supercomputing, IMPACT of Computing in Scientific Applications, Communications of the ACM, Research Monographs on Parallel and Distributed Computing, and the Journal of Numerical Linear Algebra with Applications.

Framework for Library Development on Advanced Architectures

One of the important areas for software development is the design and implementation of a mathematical software library that is scalable on parallel computers. Researchers are developing a prototype scalable library for solving the major problems of numerical linear algebra. They are initially looking at dense matrix algorithms and will later focus on sparse structures.

LAPACK for Distributed-memory Machines

The module approach of LAPACK for shared-memory MIMD architectures is being adapted for distributed-memory MIMD. It would be beneficial to have the same software superstructure for the solution of linear equations on both classes of parallel machines. The Linear Algebra group plans to use algorithms based upon blocking techniques that take advantage of surface-to-volume effects and allow effective reuse of data in local memory. This approach can lead to optimal algorithms on super-computers with modest amounts of parallelism, and on distributed-memory machines where the levels of parallelism are much greater (e.g., the hypercube). Without sacrificing performance, portability will be achieved through a collection of routines that are portable across a wide range of architectures.

Previous efforts have focused on the development of a set of linear algebra communication routines for the specification and manipulation of distributed data structures in algorithms. The result has been a proposed standard for the implementation of a scalable communication library, which is the first step in seeking a community consensus on such standards.

Message-passing Standards

Researchers are involved in the formation of a standard for performing message-passing between pairs of processes in a MIMD distributed-memory computing system. A small set of typed message-passing routines form the core of the standard and are augmented by support for features such as non-contiguous messages, communication contexts, and process groups.

The project provides parallel hardware vendors with a clearly defined base set of routines that can be efficiently implemented. As a result, the vendors can build upon this collection of standard low-level routines to create higher-level routines for the distributed-memory communication environments supplied with their parallel machines. Researchers are developing a simple-to-use interface that is powerful enough to allow programmers to use the high-performance message-passing operations available on advanced machines.

In an effort to create a "true" standard for message passing, researchers incorporated into the message-passing interface the most useful features of several systems. Rather than choosing one system to adopt as a standard, features were used from systems by IBM, Intel, nCUBE, Express, p4, and PARMACS. The message paradigm will be attractive because it is widely portable and can be used in communications for distributed-memory and shared-memory multiprocessors, networks of workstations, and any combination of these elements. The paradigm will not be made obsolete by increases in network speeds or by architectures combining shared and distributed-memory components.

Large-scale Eigenvalue Problems

Iterative methods to solve very large-scale algebraic eigenvalue problems and linear systems are essential to the numerical solution of many very large-scale applications. Basic research on algorithms for the iterative solution eigenproblems and linear systems is still needed. Nonsymmetric problems clearly require the most attention. During the first three years after the establishment of the CRPC, an approach to the numerical solution of large scale eigenproblems was developed based upon a new variant of Arnoldi's method. This approach has provided a unified computational framework for the development of numerical software for the iterative solution of large scale eigen problems.

The Arnoldi process is a well-known technique for approximating a few eigenvalues and corresponding eigenvectors of a general square matrix. Researchers have developed ARPACK, a software package based upon an implicit restarting scheme that may be viewed as a truncation of the standard implicitly shifted QR-iteration for dense problems. Numerical difficulties and storage problems normally associated with Arnoldi and Lanczos processes are avoided. The algorithm is capable of computing a few (k) eigenvalues with user specified features such as largest real part or largest magnitude using n*2k + O(k^2) storage. No auxiliary storage is required. A set of Schur basis vectors for the k-dimensional eigenspace is computed that is numerically orthogonal to working precision.

ARPACK, which is based upon an implementation of this algorithm, has been designed to be efficient on a variety of high-performance computers. Parallelism within the scheme is obtained primarily through the matrix-vector operations that comprise the majority of the work in the algorithm. The software is capable of solving large-scale symmetric, nonsymmetric, and generalized eigenproblems from significant application areas. ARPACK has been used to solve problems from quantum chemistry, reactive scattering, stability of fluid flow, design of dielectric wave guides, and structural analysis with some problems as large as 250K degrees of freedom.

Continuing research activities involve adapting these techniques to the solution of large (sparse) nonsymmetric and symmetric-indefinite linear systems. Researchers are also developing preconditioning techniques for eigenvalue calculations. The use of preconditioners has proven highly effective in accelerating iterative methods for sparse linear systems; however, it has not received the attention it deserves in the context of large scale nonsymmetric eigenvalue problems. Polynomial preconditioning has been one popular alternative for eigenproblems, but recent work has indicated that other kinds of preconditioners like relaxation, incomplete factorization, or multigrid might also be very useful in solving some nonsymmetric eigenvalue problems.

ScaLAPACK++

ScaLAPACK++ is an object-oriented extension designed to support distributed dense, banded, sparse matrix operations for symmetric, positive-definite, and nonsymmetric cases. Researchers have initially focused on the most common factorizations for dense systems: LU, QR, and Cholesky. The intent is that for large-scale problems, these ScaLAPACK++ routines should effectively exploit the computational hardware of medium-grain-sized multicomputers with up to a few thousand processors, such as the Intel Paragon and Thinking Machines Corporation's CM-5.

Among the important design goals of ScaLAPACK++ are scalability, portability, flexibility, and ease-of-use. Such goals present serious challenges, as the layout of an application's data within the hierarchical memory of a concurrent computer is critical in determining the performance and scalability of the parallel code. To enhance the programmability of the library, details of the parallel implementation will be hidden as much as possible, while providing the user with the capability to control the data distribution. The ScaLAPACK++ release will include a general two-dimensional matrix decomposition that supports the most common block/scattered schemes currently used. ScaLAPACK++ can be extended to support arbitrary matrix decompositions by providing the specific parallel BLAS library to operate on such matrices.

Decoupling the matrix operations from the details of the decomposition not only simplifies the encoding of an algorithm but also allows the possibility of postponing the decomposition until runtime. Often, the optimal matrix decomposition is dependent on how the matrix is utilized in other parts of the driving application code. Furthermore, it may be necessary to dynamically alter the matrix decomposition at runtime to accommodate special routines.

The currently supported decomposition scheme defines global matrix objects that are distributed across a logical grid of processors. Matrices are mapped to processors using a block scattered class of decompositions that allows a wide variety of matrix mappings while enhancing scalability and maintaining good load balance for various dense factorization algorithms. At each node of the multi-computer, there is a sequential LAPACK++ library that provides the object-oriented framework to describe block algorithms on conventional matrices in each individual processor.

Parallelism is exploited through the use of distributed-memory versions of the Basic Linear Algebra Subprogram (BLAS) that perform the basic computational units of the block algorithms. Thus, at a higher level, the block algorithms used look the same for the parallel or sequential versions, and only one version of each needs to be maintained.