The best grid shape is determined by the algorithm implemented in the library and the underlying physical network. A one link physical network will favor or , where is the process grid. This affects the scalability of the algorithm, but reduces the overhead due to message collisions. It is possible to predict the best grid shape given the number of processes available. The current algorithms for the factorization or reduction routines can be split into two categories.
If at every step of the algorithm a block of columns and/or rows needs to be broadcast, as in the LU or QR factorizations, it is possible to pipeline this communication phase and overlap it with some computation. The direction of the pipeline determines the shape of the grid. For example, the LU, QR and QL factorizations perform better for ``flat'' process grids ( ). These factorizations share a common bottleneck of performing a reduction operation along each column (for pivoting in LU, and for computing a norm in QR and QL). The first implication of this observation is that large latency message passing perform better on a ``flat'' grid than on a square grid. Secondly, after this reduction has been performed, it is important to update the next block of columns as fast as possible. This is done by broadcasting the current block of columns using a ring topology, i.e, feeding the ongoing communication pipe. Similarly, the performance of the LQ and RQ factorizations take advantage of ``tall'' grids ( ) for the same reasons, but transposed.
The theoretical efficiency of the LU factorization can be estimated by (3):
For large n, the last term in the denominator dominates, and it is minimized by choosing a slightly smaller than . works well on Intel machines. For smaller n, the middle term dominates, and it becomes more important to choose a small . Suppose that we keep the ratio constant as P increases, thus we have and , where u and v are constant . Moreover, let us ignore the factor for a moment. In this case, and are proportional to and must grow with P to maintain efficiency. For sufficiently large , the factor cannot be ignored, and the performance will slowly degrade with the number of processors P. This phenomenon is observed in practice as shown in Figure 10 for the efficiency of the LU factorization on the Intel Paragon.
The second group of routines are two-sided algorithms as opposed to one-sided algorithms. In these cases, it is not usually possible to maintain a communication pipeline, and thus square or near square grids are more optimal. This is the case for the algorithms used for implementing the Cholesky factorization, the matrix inversion and the reduction to bidiagonal form (BRD), Hessenberg form (HRD), tridiagonal form (TRD) and the nonsymmetric QR eigenvalue algorithm (HQR). For example, the update phase of the Cholesky factorization of a lower symmetric matrix physically transposes the current block of columns of the lower triangular factor.
Assume now that at most P processes are available. A natural question arising is: could we decide what process grid should be used? Similarly, depending on P, it is not always possible to factor to create the appropriate grid. For example, if P is prime, the only possible grids are and . If such grids are particularly bad for performance, it may be beneficial to let some processors remain idle, so the remainder can be formed into a ``squarer'' grid . These problems can be analyzed by a complicated function of the machine and problem parameters. It is possible to develop models depending on the machine and problem parameters which accurately estimate the impact of modifying the shape of the grid on the total execution time, as well as predicting the necessary amount of extra memory required for each routine.
Figure: Intel Paragon LU and QR Performance
Figure: IBM SP-2 and Intel Paragon Cholesky Factorization
Figure: Execution speed in Mflop/s for the band algorithm on the IBM SP-2.
Figure: Performance of the Parallel QR Algorithm on the Intel MP Paragon for a 2x2 up to a 9x9 grid of processes.
Figure: Performance with different blocksize NB (MB=NB) and prediction of percentage of time to performance operations.
Figure: IBM SP-2 and Intel Paragon Performance