next up previous contents index
Next: Block Algorithms and their Up: Performance of LAPACK Previous: Parallelism   Contents   Index

The BLAS as the Key to Portability

How then can we hope to be able to achieve sufficient control over vectorization, data movement, and parallelism in portable Fortran code, to obtain the levels of performance that machines can offer?

The LAPACK strategy for combining efficiency with portability is to construct the software as much as possible out of calls to the BLAS (Basic Linear Algebra Subprograms); the BLAS are used as building blocks.

The efficiency of LAPACK software depends on efficient implementations of the BLAS being provided by computer vendors (or others) for their machines. Thus the BLAS form a low-level interface between LAPACK software and different machine architectures. Above this level, almost all of the LAPACK software is truly portable.

There are now three levels of BLAS:

Level 1 BLAS [78]:
for vector operations, such as $y \leftarrow \alpha x + y$

Level 2 BLAS [42]:
for matrix-vector operations, such as $y \leftarrow \alpha A x + \beta y$

Level 3 BLAS [40]:
for matrix-matrix operations, such as $C \leftarrow \alpha A B + \beta C$

Here, A, B and C are matrices, x and y are vectors, and $\alpha$ and $\beta$ are scalars.

The Level 1 BLAS are used in LAPACK, but for convenience rather than for performance: they perform an insignificant fraction of the computation, and they cannot achieve high efficiency on most modern supercomputers.

The Level 2 BLAS can achieve near-peak performance on many vector processors, such as a single processor of a CRAY Y-MP, CRAY C90, or CONVEX C4 machine. However on other vector processors, such as a CRAY 2, or a RISC workstation or PC with one more levels of cache, their performance is limited by the rate of data movement between different levels of memory.

This limitation is overcome by the Level 3 BLAS, which perform O(n3) floating-point operations on O(n2) data, whereas the Level 2 BLAS perform only O(n2) operations on O(n2) data.

The BLAS also allow us to exploit parallelism in a way that is transparent to the software that calls them. Even the Level 2 BLAS offer some scope for exploiting parallelism, but greater scope is provided by the Level 3 BLAS, as Table 3.1 illustrates.

(all matrices are of order 1000; U is upper triangular)

Table 3.1: Speed in megaflops of Level 2 and Level 3 BLAS operations on an SGI Origin 2000
Number of processors: 1 2 4 8 16
Level 2: $y \leftarrow \alpha A x + \beta y$ 210 154 385 493 163
Level 3: $C \leftarrow \alpha A B + \beta C$ 555 1021 1864 3758 4200
Level 2: $x \leftarrow U x$ 162 152 148 153 137
Level 3: $B \leftarrow U B$ 518 910 1842 3519 5171
Level 2: $x \leftarrow U^{-1} x$ 172 80 140 168 17
Level 3: $B \leftarrow U^{-1} B$ 519 967 1859 3507 5078

next up previous contents index
Next: Block Algorithms and their Up: Performance of LAPACK Previous: Parallelism   Contents   Index
Susan Blackford