The ScaLAPACK software assumes that the user's input data has been distributed on a two-dimensional grid of processes according to the block cyclic scheme. For a given number of processes, the parameters of this family of data distributions are the shape of the process grid and the size of the block used to partition and distribute the matrix entries over the process grid. These parameters affect the number of messages exchanged during the operation, the aggregated volume of data communicated, and the computational load balance. These factors have a significant impact on the efficiency achieved by a ScaLAPACK driver routine.
Most of the linear algebra algorithms perform a succession of elementary transformations on rows or columns of a matrix. Therefore, most of the communication operations performed within a ScaLAPACK routine involve process rows or columns. If one assumes that there is roughly the same number of of communication operations in both dimensions of the process grid, then a square two-dimensional process grid clearly offers the greatest scope for the parallelization of the communication operations.
Nevertheless, a few exceptions to this rule exist. First, if the user's interconnection network physically supports only one processor communicating at a time (e.g., ethernet), then better performance will be achieved on a one-dimensional process grid. Indeed, the smaller number of larger messages to be exchanged on a one-dimensional grid prevents the competition for resources from becoming a critical performance factor. Second, for a small number (say, eight) of processes, it is often slightly preferable to select a one-dimensional process grid--simply because there are not enough processes to make a large difference.
Most of the computation in the ScaLAPACK routines is performed in a blocked fashion by using Level 3 BLAS, as is done in LAPACK. The logical computational blocking factor used within the Level 3 PBLAS may differ from the distribution block size. Consequently, the performance of the ScaLAPACK library is not very sensitive to the physical distribution block size, as long as the extreme case is avoided. Very large distribution blocking factors do lead to computational imbalance. The chosen logical block size affects the amount of workspace needed on every process. This amount of workspace is typically large enough to contain a logical block of rows or columns of the distributed matrix operand. Therefore, the larger the logical block size, the greater the necessary workspace or, put another way, the smaller the problem that can be solved on a given grid of processes. For Level 3 BLAS block-partitioned algorithms , one dimension of the matrix operands is locally equal to the logical block size. Therefore, it is good practice to choose the logical block size to be the problem size for which the BLAS matrix-multiply routine achieves approximately 90% of its peak performance.