The total number of floating-point operations performed by most of the ScaLAPACK driver routines for dense matrices can be approximated by the quantity , where is a constant and N is the order of the largest matrix operand. For solving linear equations or linear least squares, is a constant depending solely on the selected algorithm. The algorithms used to find eigenvalues and singular values are iterative; hence, for these operations, the constant truly depends on the input data as well. It is, however, customary or ``standard'' to consider the values of the constants for a fixed number of iterations. The ``standard'' constants range from 1/3 to 27, as shown in Table 4.
The performance of the ScaLAPACK drivers is thus bounded above by the performance of a computation that could be partitioned into p independent chunks of flops each. This upper bound is referred to hereafter as the peak performance and can be computed as the product of and the highest reachable local processor flop rate. Hence, for a given problem size N and assuming a uniform distribution of the computational tasks, the most important factors determining the overall performance are the number p of processors involved in the computation and the local processor flop rate.
In a serial computational environment, transportable efficiency is the essential motivation for developing blocking strategies and block-partitioned algorithms [2, 3, 14, 27]. The linear algebra package (LAPACK)  is the archetype of such a strategy. The LAPACK software is constructed as much as possible out of calls to the BLAS. These kernels confine the impact of machine architecture differences within a small number of routines. The efficiency and portability of the LAPACK software are then achieved by combining native and efficient BLAS implementations with portable high-level components.
The BLAS are subdivided into three levels, each of which offers increased scope for exploiting parallelism. This subdivision corresponds to three different kinds of basic linear algebra operations:
The performance potential of the three levels of BLAS is strongly related to the ratio of floating-point operations to memory references, as well as to the reuse of data when it is stored in the higher levels of the memory hierarchy. Consequently, the Level 1 BLAS cannot achieve high efficiency on most modern supercomputers. The Level 2 BLAS can achieve near-peak performance on many vector processors. On RISC microprocessors, however, their performance is limited by the memory access bandwidth bottleneck. The greatest scope for exploiting the highest levels of the memory hierarchy as well as other forms of parallelism is offered by the Level 3 BLAS .
The previous reasoning applies to distributed-memory computational environments in two ways. First, in order to achieve overall high performance, it is necessary to express the bulk of the computation local to each process in terms of Level 3 BLAS operations. Second, designing and developing a set of parallel BLAS (PBLAS) for distributed-memory concurrent computers should lead to an efficient and straightforward port of the LAPACK software. This is the path followed by the ScaLAPACK project [8, 18] as well as others [1, 7, 12, 20]. As part of the ScaLAPACK project, a set of PBLAS has been early designed and developed [11, 9].