The total number
of floating-point
operations
performed by most
of the ScaLAPACK
driver routines
for dense matrices
can be approximated
by the quantity
, where
is a constant
and *N* is the
order of the
largest matrix
operand. For
solving linear equations
or linear
least squares,
is a constant
depending
solely on the selected
algorithm. The
algorithms used
to find eigenvalues
and singular
values are
iterative; hence,
for these
operations, the
constant
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants
for a fixed
number of
iterations.
The ``standard''
constants
range
from 1/3 to
27, as shown
in Table
4.

The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
*p* independent
chunks of
flops each. This
upper bound is
referred to hereafter
as the *peak
performance* and
can be computed
as the product
of
and the highest
reachable local
processor flop
rate. Hence, for
a given problem
size *N* and
assuming a uniform
distribution of the
computational tasks,
the most important
factors determining
the overall performance
are the number
*p* of processors
involved in the
computation and
the local processor
flop rate.

In a serial
computational
environment,
*transportable
efficiency* is
the essential
motivation for
developing blocking
strategies and
block-partitioned
algorithms
[2, 3, 14, 27].
The linear algebra
package (LAPACK)
[3] is
the archetype of
such a strategy.
The LAPACK software
is constructed as
much as possible
out of calls to
the BLAS.
These kernels
confine the
impact of
machine
architecture
differences
within a small
number of
routines. The
efficiency and
portability of
the LAPACK
software are
then achieved
by combining
native and
efficient BLAS
implementations
with portable
high-level
components.

The BLAS are subdivided into three levels, each of which offers increased scope for exploiting parallelism. This subdivision corresponds to three different kinds of basic linear algebra operations:

- Level 1 BLAS [29]: for vector operations, such as ;
- Level 2 BLAS [16]: for matrix-vector operations, such as ;
- Level 3 BLAS [15]: for matrix-matrix operations, such as .

The performance potential of the three levels of BLAS is strongly related to the ratio of floating-point operations to memory references, as well as to the reuse of data when it is stored in the higher levels of the memory hierarchy. Consequently, the Level 1 BLAS cannot achieve high efficiency on most modern supercomputers. The Level 2 BLAS can achieve near-peak performance on many vector processors. On RISC microprocessors, however, their performance is limited by the memory access bandwidth bottleneck. The greatest scope for exploiting the highest levels of the memory hierarchy as well as other forms of parallelism is offered by the Level 3 BLAS [3].

The previous reasoning applies to distributed-memory computational environments in two ways. First, in order to achieve overall high performance, it is necessary to express the bulk of the computation local to each process in terms of Level 3 BLAS operations. Second, designing and developing a set of parallel BLAS (PBLAS) for distributed-memory concurrent computers should lead to an efficient and straightforward port of the LAPACK software. This is the path followed by the ScaLAPACK project [8, 18] as well as others [1, 7, 12, 20]. As part of the ScaLAPACK project, a set of PBLAS has been early designed and developed [11, 9].

Sat Feb 1 08:18:10 EST 1997