next up previous
Next: ScaLAPACK and the BLAS Up: Factors Affecting ScaLAPACK Performance Previous: Factors Affecting ScaLAPACK Performance

Processor Performance and Total Performance

The ScaLAPACK routines have been specifically designed to allow for an even distribution of the computational load and thus to achieve the highest possible performance. Therefore, the overall execution time is strongly related to the rate of floating-point operations per second (flop/s) that the slowest processor in the machine configuration can achieve.

This behavior can easily be observed if some factor slows a particular processor of the system. Consider, for instance, a ten-processor machine configuration. Suppose that nine of the processors can deliver a peak performance of 100 megaflop/s (Mflop/s) but that the tenth processor can achieve only 20 Mflop/s. (On a homogeneous system, different versions of the operating system and/or memory capacities, I/O traffic, or simply another user's program can easily cause such a performance degradation.) On such a ten-processor machine, the overall ScaLAPACK peak performance is thus limited to 200 Mflop/s, whereas the performance of the machine with nine 100-megaflop/s processors is 900 Mflop/s. Specifically, the most heavily loaded processor controls execution time. The implications are clear. If a user's code is running on nine unloaded processors and one processor with a load factor of 5, one can observe no more than a factor of tex2html_wrap_inline618 speedup.

Similarly, it is possible on some systems to spawn multiple processes on a single processor. In such a case, performance is limited by the slowest processor, presumably the one with the most processes. For example, if 10 processes are spawned on 9 identical processors, the speedup is limited to tex2html_wrap_inline620.

The load of the machine, in addition to the direct effect of offering a program only a portion of the total cycles, can have several indirect effects. If each processor is individually scheduled, performance can be arbitrarily poor because significant progress is possible only when all processes are concurrently scheduled. A loaded machine may also cause one's data to be swapped out to disk, which can greatly reduce peak performance.


next up previous
Next: ScaLAPACK and the BLAS Up: Factors Affecting ScaLAPACK Performance Previous: Factors Affecting ScaLAPACK Performance

Jack Dongarra
Sat Feb 1 08:18:10 EST 1997