University of Tennessee, Knoxville Progress Report 95/96

ScaLAPACK Software for Linear Equations

The research project ``A scalable parallel library for numerical linear algebra'' consisted of a number of closely related topics which involves researchers at a number of institutions. ScaLAPACK is being developed at Oak Ridge National Laboratory, the University of Tennessee at Knoxville, and the University of California at Berkeley. ScaLAPACK is a library of software for performing dense linear algebra computations on message-passing computers.

This research led to a number of important new software tools and standards. Recognizing that a message passing standard was necessary to ensure the easy portability of the prototype libraries, the project helped in initiating and promoting the development of the MPI (Message Passing Interface). Standards specific to parallel linear algebra were also developed. The PBLAS (Parallel Basic Linear Algebra Subprograms) are message passing versions of most of the sequential BLAS routines, and have a similar interface. The BLACS (Basic Linear Algebra Communication Subprograms) are a set of routines for communicating rectangular and trapezoidal sub-matrices between processes. The BLACS and PBLAS are important building blocks of the ScaLAPACK library.

The initial public release of ScaLAPACK occurred last year, and included routines for the solution of a general system of linear equations via LU and Cholesky factorizations, orthogonal factorizations (QR, RQ, LQ, and QL), reduction to condensed form (upper Hessenberg, tridiagonal form, and bidiagonal), and the symmetric eigenproblem.

A subsequent release occurred earlier this year and increased the functionality of the package by including routines for the symmetric positive definite banded linear systems of equations, condition estimation and iterative refinement, for LU and Cholesky factorization, full-rank linear least squares problems, generalized orthogonal factorizations, reduction of a symmetric-definite generalized eigenproblem to standard form, the generalized symmetric and the nonsymmetric eigenproblem.

Since these releases, many improvements have been underway to increase the flexibility and functionality of the library. Flexibility is being increased by expanding the spectrum of PBLAS operations. For instance the ability to operate on non-aligned data not only greatly simplifies the writing of ScaLAPACK codes but also makes it possible to express divide-and-conquer algorithms in terms of calls to ScaLAPACK and PBLAS routines. The functionality of ScaLAPACK is being expanded by the introduction of new routines such as rank deficient linear least squares, generalized linear least squares, singular value decomposition, general banded routines, and tridiagonal solvers.

A number of new scalable algorithms have been developed. A new version of the dense symmetric eigenvalue routine PDSYEVX is faster in most common cases and eliminates the need to reorthogonalize in the clustered eigenvalue case. A second eigensolver was also developed that is slower than PDSYEVX, but is more reliable.

We continued our investigation of new parallel algorithms for the non-symmetric eigenvalue problem. The sign function presents an algorithm for the dense nonsymmetric eigenvalue problem. It is relatively straightforward to parallelize, compared to the conventional sequential algorithm, but does several times as many floating point operations. It is the only scalable algorithm for this problem currently available. We have explored its numerical properties, performance, algorithmic variations, and suitability for mixed parallelism which was reported in a number of publications.

The ScaLAPACK library for dense linear algebra computations is in the process of transferring some of the technology to the commercial marketplace. IBM Corporation and Cray Research have incorporated parts of ScaLAPACK into their parallel scientific subroutine libraries. NAG Inc. has also incorporated parts of ScaLAPACK into its parallel library.

Sparse and Parallel BLAS

The existing BLAS have proven to be very effective in assisting portable, efficient software for sequential and some of the current class of high-performance computers. In November, 1995, we organized a workshop along with colleagues from the Rutherford Appleton Laboratory and Cray Research. The purpose of the workshop was to investigate the possibility of extending the currently accepted standards, to provide greater coverage of sparse matrices, and provide additional facilities for parallel computing. In particular we considered standardizing a set of Parallel BLAS along the lines of the existing BLAS for the dense and sparse cases.

The goal of the workshop was to stimulate thought, discussion, and comment on the future development of a set of standards for basic matrix data structures, both dense and sparse, as well as calling sequences for a set of low-level computational kernels for the parallel and sequential settings. These new ``standards'' are needed to complement and supplement the existing ones for sparse and parallel computation. One of the major aims of these standards will be to enable linear algebra libraries (both public domain and commercial) to interoperate efficiently and easily.

The meeting was attended by 52 people. There were eight sessions of talks with a total of 28 talks. The principle observations and conclusions reached in this workshop were as follows:

Additional information, including many of the presentations for the workshop, can be found here.

As a result of this meeting and a follow up meeting held at Supercomputer '95 in San Diego it was decided to organize a series of meetings similar to the MPI Forum to continue the BLAS standardization work. We expect this activity to be on going through FY 1997. Refer to the BLAS Technical Forum homepage for further information.

Templates for Eigenvalue Computations

A new generation of even more massively parallel computers will soon emerge. Concurrent with the development of these more powerful parallel systems is a shift in the computing practices of many scientists and researchers. Increasingly, the tendency is to use a variety of distributed computing resources, with each individual task assigned to the most appropriate architecture, rather than to use a single, monolithic machine. The pace of these two developments, the emergence of highly parallel machines and the move to a more distributed computing environment, have been so rapid that software developers have been unable to keep up. Part of the problem has been that supporting system software has inhibited this development. Consequently, exploiting the power of these technological advances has become more and more difficult. Much of the existing reusable scientific software, such as that found in commercial libraries and in public domain packages, is no longer adequate for the new architectures. If the full power of these new machines is to be realized, then scalable libraries, comparable in scope and quality to those that currently exist, must be developed. One of our goals as software designers is to communicate to the high-performance computing community algorithms and methods for the solution of system of linear equations. In the past we have provided black-box software in the form of a mathematical software library, such as LAPACK, LINPACK, NAG, and IMSL. These software libraries provide for:

On the other hand, the high-performance computing community, which wants to solve complex, large-scale problems as quickly as possible, seems to want

These differing priorities make for different approaches to algorithms and software. The first set of priorities leads us to produce ``black boxes" for general problem classes. The second set of priorities seems to lead us to produce ``template codes" or ``toolboxes" where the users can assemble, modify and tune building blocks starting from well-documented subparts. This leads to software which is not going to be reliable on all problems, and requires extensive user tuning to make it work. This is just what the block-box users do not want.

Our eigenvalue template design will expect that a non-expert user confronted with an eigenvalue problem would be willing to supply

In return, the user would want

In the (likely) case that no single best algorithm exists, we expect the user would want a list of reasonable alternatives and a discussion of the tradeoffs in time, space and accuracy among them. For example, it might be reasonable to use a dense algorithm on a sparse problem if high accuracy is desired, the problem is not too large, and/or the problem is to be solved just once. Much of our effort will center on educating the user as to which facts must be supplied to make a decision. To this end we have decided to categorize available methods along five (mostly independent) axes:

  1. Mathematical Properties of the Problem,
  2. Desired Spectral Information,
  3. Problem Representation,
  4. Available Transformations, and
  5. Costs (including dependence on accuracy, computer type).

Ideally, the user would supply a ``coordinate'' for each of these five axes, thus uniquely identifying a problem class for which we could identify a ``best algorithm,'' software implementing it, a performance model to predict its execution time, and a method to assess the accuracy.

Realistically, only a few regions of this five-dimensional space are populated with interesting or useful algorithms, and we expect to have a more decision-tree like approach to guide the user's expectations. This work will continue into FY 1997.