Heterogeneous Networks

Next: The Distributed Linear Up: Design of ScaLAPACK Previous: ScaLAPACK - Symmetric

Heterogeneous Networks

There are special challenges associated with writing reliable numerical software on networks containing heterogeneous processors, i.e., processors which may do floating point arithmetic differently. This includes not just machines with completely different floating point formats and semantics (e.g. Cray versus workstations running IEEE standard floating point arithmetic), but even supposedly identical machines running with different compilers or even just different compiler options. The basic problem lies in making data dependent branches on different processors, which may branch differently than expected on different processors, leading to different processors executing a completely different section of code than the other processors expect. We give three examples of this below.

The simplest example is an iteration where the stopping criterion depends on the machine precision. If the precision varies from processor to processor, different processors will have significantly different stopping criteria than others. In particular, the criterion for the most accurate processor may never be satisfied if it depends on data computed less accurately by other processors. Many problems like this can be eliminated by using the largest machine epilson among all participating processes. Routine PDLAMCH returns this largest value, replacing the uniprocessor DLAMCH. Similarly, one would use the smallest overflow threshold and largest underflow threshold for other calculations. But this is not a panacea, as subsequent examples show.

Next, consider the situation where processors sharing a distributed vector compute its two-norm, and depending on that either scale by a constant much different from 1, or do not. This happens in the inner loop of the QR decomposition, for example. The two-norm is computed by the ScaLAPACK routine PDNRM2, which computes two-norms locally and does a reduction. If the participating processors have different floating point formats, they may receive different values of the two-norm on return, just because the same floating point numbers cannot be represented on all machines. This two-norm is then compared to a threshold, and if it exceeds the threshold scaling takes place. Since the two-norm may be different, and the threshold may be different, the result of the comparison could differ on different processors, so that one process would scale the sub-vector it owns, and another would not. This would very likely lead to erroneous results. This could in principle be corrected by extending the reduction operation PDNRM2 to broadcast a discrete value (like the boolean value of a comparison); then all participating processors would be able to agree with the processor at the root of the reduction tree.

However, there are still harder problems. Consider bisection for finding eigenvalues of symmetric matrices. In this algorithm, the real axis is broken into disjoint intervals to be searched by different processors for the eigenvalues contained in each. Disjoint intervals are searched in parallel. The algorithm depends on a function, call it count(a,b), that counts the number of eigenvalues in the half open interval [a, b ). Using count, intervals can be subdivided into smaller intervals containing eigenvalues until the intervals are narrow enough to declare the eigenvalues they contain as ``found''. One problem is that two processors with different floating point formats cannot even agree on the boundary between their intervals, because they cannot store the same floating point number. This could result in multiple copies of eigenvalues if intervals overlap, or missing eigenvalues if there are gaps between intervals. Furthermore, the count function may count differently on different processors, so an interval [a, b ) may be considered to contain 1 eigenvalue by processor A, but 0 eigenvalues by processor B, which has been given the interval by processor A during load balancing. This can happen even if processors A and B are identical, but if their compilers generate slightly different code sequences for count. We have not yet decided what to do about all of these problems, so we currently only guarantee correctness of PDSYEVX for networks of processors with identical floating point formats (slightly different floating point operations are acceptable). See [4] for details.

Next: The Distributed Linear Up: Design of ScaLAPACK Previous: ScaLAPACK - Symmetric

Antoine Petitet
Fri Mar 31 13:01:26 EST 1995