next up previous contents index
Next: Sparse BLAS Up: Matrix-Vector and Matrix-Matrix Multiplications Previous: Matrix-Vector and Matrix-Matrix Multiplications   Contents   Index


For the past 15 years or so, there has been a great deal of activity in the area of algorithms and software for solving linear algebra problems. The linear algebra community has long recognized the need for help in developing algorithms into software libraries, and several years ago, as a community effort, put together a de facto standard for identifying basic operations required in linear algebra algorithms and software. The hope was that the routines making up this standard, known collectively as the Basic Linear Algebra Subprograms (BLAS), would be efficiently implemented on advanced-architecture computers by many manufacturers, making it possible to reap the portability benefits of having them efficiently implemented on a wide range of machines. This goal has been largely realized.

The key insight of our approach to designing linear algebra algorithms for advanced architecture computers is that the frequency with which data are moved between different levels of the memory hierarchy must be minimized in order to attain high performance. Thus, our main algorithmic approach for exploiting both vectorization and parallelism in our implementations is the use of block-partitioned algorithms, particularly in conjunction with highly tuned kernels for performing matrix-vector and matrix-matrix operations (the Level 2 and 3 BLAS). In general, the use of block-partitioned algorithms requires data to be moved as blocks, rather than as vectors or scalars, so that although the total amount of data moved is unchanged, the latency (or startup cost) associated with the movement is greatly reduced because fewer messages are needed to move the data.

A second key idea is that the performance of an algorithm can be tuned by a user by varying the parameters that specify the data layout. On shared memory machines, this is controlled by the block size, while on distributed memory machines it is controlled by the block size and the configuration of the logical process mesh.

The question arises, ``How can we achieve sufficient control over these various factors that effect performance to obtain the levels of performance that machines can offer?'' The answer is through use of the BLAS.

There are now three levels of BLAS:

Level 1 BLAS [288]: for vector operations, such as $y \leftarrow \alpha x + y$;

Level 2 BLAS [133]: for matrix-vector operations, such as $y \leftarrow \alpha A x + \beta y$;

Level 3 BLAS [134]: for matrix-matrix operations, such as $C \leftarrow \alpha A B + \beta C$.

Here, $A$, $B$, and $C$ are matrices; $x$ and $y$ are vectors; and $\alpha$ and $\beta$ are scalars.

The Level 1 BLAS are used in packages like LAPACK, for convenience rather than for performance: they perform an insignificant fraction of the computation, and they cannot achieve high efficiency on most modern supercomputers.

The Level 2 BLAS can achieve near-peak performance on many vector processors, such as a single processor of a CRAY X-MP or Y-MP, or a Convex C-2 machine. However, on other vector processors such as a CRAY-2 or an IBM 3090 VF, the performance of the Level 2 BLAS is limited by the rate of data movement between different levels of memory.

The Level 3 BLAS overcome this limitation. This third level of BLAS performs $O(n^3)$ floating point operations on $O(n^2)$ data, whereas the Level 2 BLAS perform only $O(n^2)$ operations on $O(n^2)$ data. The Level 3 BLAS also allow us to exploit parallelism in a way that is transparent to the software that calls them. While the Level 2 BLAS offer some scope for exploiting parallelism, greater scope is provided by the Level 3 BLAS.

To a great extent, the user community embraced the BLAS, not only for performance reasons, but also because developing software around a core of common routines like the BLAS is good software engineering practice. Highly efficient machine-specific implementations of the BLAS are available for most modern high-performance computers. The BLAS have enabled software to achieve high performance with portable code.

In the spirit of the earlier BLAS meetings and the standardization efforts of the Massage-Passing Interface (MPI) and high-performance Fortran (HPF) forums, a technical forum was established to consider expanding the BLAS in the light of modern software, language, and hardware developments. The BLAS Technical Forum meetings began with a workshop in November 1995 at the University of Tennessee. Meetings were hosted by universities and software and hardware vendors.

Various working groups were established to consider issues such as overall functionality, language interfaces, sparse BLAS, distributed memory dense BLAS, extended and mixed precision BLAS, interval BLAS, and extensions to the existing BLAS. The rules of the forum were adopted from those used for the MPI and HPF forums.

A major aim of the standards defined in this document are to enable linear algebra libraries (both public domain and commercial) to interoperate efficiently, reliably, and easily. We believe that hardware and software vendors, higher level library writers, and application programmers all benefit from the efforts of this forum and are the intended end users of these standards.

Further information about the BLAS and BLAS Technical Forum can be obtained on the book's homepage, ETHOME. The BLAS Technical Forum webpage can be found at the URL:

next up previous contents index
Next: Sparse BLAS Up: Matrix-Vector and Matrix-Matrix Multiplications Previous: Matrix-Vector and Matrix-Matrix Multiplications   Contents   Index
Susan Blackford 2000-11-20