An important difference between ScaLAPACK and LAPACK is that a parallel computing environment, possibly consisting of a heterogeneous collection of processors, introduces new sources of possible errors not found in the serial environment in which LAPACK runs. These errors could indeed afflict any parallel algorithm that uses floating-point arithmetic. For example, consider the following pseudocode, executed in parallel by several processors:

s= global_sum(x) ... each processor receives the sumsof global arrayxif

s<threshthenreturn my part of answer 1

else

do more computations

return my part of answer 2

end if

It is possible for the value of *s* to differ from processor to processor;
we call this *incoherence*.
This can happen if the floating-point arithmetic varies
from processor to processor (we call this *heterogeneity*), since
processors may not even share the same set of floating-point numbers.
The value of *s* can also vary if global_sum accumulates the sum
in different orders on different processors,
since floating-point addition is not associative.
In either case, the
test *s*< *thresh* may be true on one processor but not another, so
that the program may inconsistently return answer 1 on some processors
and answer 2 on others. If the ``more computations'' include
communication with synchronization, even deadlock could result.

Deadlock can also result if the floating-point numbers communicated from one processor to another cause fatal floating-point errors on the receiving processor. For example, if an IBM RS/6000, running in its default mode, sends a message containing a denormalized number [7, 8] to a DEC Alpha running in its default mode, then the DEC Alpha aborts [19].

It is also possible for global_sum to compute the same *s* on all processors
but compute a different *s* from run to run of the program,
for example, if global_sum computes the sum in a nondeterministic order
on one processor and broadcasts the result to all processors.
We call this *nonrepeatability*.
If this happens, debugging the overall code can be more difficult.

Coherence and repeatability are independent properties of an algorithm. It is possible in principle for an algorithm running on a particular platform to be incoherent and repeatable, coherent and nonrepeatable, or any other combination. On a different platform, the same algorithm may have different properties.

Reference [19] contains a more extensive discussion of these possible errors.

One run of a ScaLAPACK routine is designed to be as reliable as LAPACK,
so that errors due to incoherence cannot occur as long as ScaLAPACK is
executed on a *homogeneous network* of
processors.
The following conditions apply:

- The processors are completely identical. This also means that relevant flags, like those controlling the way overflow and underflow are handled in IEEE floating-point arithmetic, must be identical.
- The communication library used by the BLACS may only ``copy bits'' and not modify any floating-point numbers (by translation to a different internal floating-point format, as XDR [111] may do).
- The identical ScaLAPACK object code must be executed by each processor.

The above conditions guarantee that a single ScaLAPACK call is as reliable
as its LAPACK counterpart.
If, in addition, identical answers from one run to another
are desired (i.e., *repeatability*),
this can be guaranteed at runtime by calling
BLACS_SET to enforce repeatability of the BLACS, and the ScaLAPACK
routines that use them, by using an appropriate topology
(see the BLACS users guide [54] for details).

Maintaining coherence on a heterogeneous network is harder, and not always possible. If floating-point formats differ (say, on a Cray C90 and IBM RS/6000, which uses IEEE arithmetic), there is no cost-effective way to guarantee coherence. If floating-point formats are the same, however, operations such as global sums can accumulate the result on one processor and broadcast it to guarantee coherence (except for the problem of DEC Alphas and denormalized numbers mentioned above). The BLACS do this, except when using the ``bidirectional exchange'' topology. One can avoid using ``bidirectional exchange'' and so guarantee coherence whenever possible, by calling BLACS_SET to enforce coherence (see the BLACS users guide [54] for details).

Still other ScaLAPACK routines are guaranteed to work only on homogeneous networks (PxGESVD and PxSYEV). These routines do large numbers of redundant calculations on all processors and depend on the results of these calculations being the same. There are too many of these calculations to cost-effectively compute them all on one processor and broadcast the results.

The user may wonder why ScaLAPACK and the BLACS are not designed to
guarantee coherence and repeatability in the most general possible situations,
so that calling BLACS_SET would not be necessary.
The reason is that the possible bugs described above are quite rare,
and so ScaLAPACK and the BLACS were designed to maximize performance instead.
Provided the mere sending of floating-point numbers does not cause a
fatal error, these bugs cannot occur at all in most ScaLAPACK routines,
because branches depending on a supposedly identical floating-point value
like *s* do not occur.
For most other ScaLAPACK routines where such branches do occur,
we have not seen these bugs despite extensive testing, including attempts
to cause them to occur.
Complete understanding and cost-effective
elimination of such possible bugs are future work.

In the meantime, to get repeatability when running on a homogeneous network, we recommend calling BLACS_SET as described above when using the following ScaLAPACK drivers: PxGESVX, PxPOSVX, PxSYEV, PxSYEVX, PxGESVD, and PxSYGVX.

Tue May 13 09:21:01 EDT 1997