next up previous
Next: Optimality and Pipelining Tradeoffs Up: OptimizationTuning, and Trade-offs Previous: Layout of Local Process

Tradeoffs between Load Balance and Communication Latency

  We have discussed the mapping of the logical hierarchical memory to physical memory. In addition, we have pointed out the importance of maintaining long inner loops to get good sequential performance for each process, and the desirability of sending a few large messages rather than many smaller ones. We next consider load balance issues. Assuming that equal numbers of processes have been assigned to each processor, load imbalance arises in two phases of the parallel LU factorization algorithm; namely, in factoring each column block, which involves only P processes, and in solving the lower triangular system to evaluate each row block of U, which involves only Q processes. If the time for data movement is negligible, the aspect ratio of the template that minimizes load imbalance in step k of the algorithm is,
 eqnarray596
where tex2html_wrap_inline2349 is the matrix size in blocks, and r the block size. Thus, the optimal aspect ratio of the template should be the same as the aspect ratio of the matrix, i.e., tex2html_wrap_inline2645 in blocks, or M/N in elements. If the effect of communication time is included then we must take into account the relative times taken to locate and broadcast the pivot information, and the time to broadcast the lower triangular matrix, tex2html_wrap_inline1999, along a row of the template. For both tasks the communication time increases with the number of processes involved, and since the communication time associated with the pivoting is greater than that associated with the triangular solve, we would expect the optimum aspect ratio of the template to be less than M/N. In fact, for our runs on the Intel Delta system we found an aspect ratio, P/Q, of between 1/4 and 1/8 to be optimal for most problems with square matrices, and that performance depends rather weakly on the aspect ratio, particularly for large grain sizes. Some typical results are shown in Figure 19 for 256 processors, which show a variation of less than 20% in performance as P/Q varies between 1/16 and 1 for the largest problem.

  figure605
Figure 19: Performance of LU factorization on the Intel Delta as a function of square matrix size for different processor templates containing approximately 256 processors. The best performance is for an aspect ratio of 1/4, though the dependence on aspect ratio is rather weak.

The block size, r, also affects load balance. Here the tradeoff is between the load imbalance that arises as rows and columns of the matrix are eliminated as the algorithm progresses, and communication startup costs. The block cyclic decomposition seeks to maintain good load balance by cyclically assigning blocks to processes, and the load balance is best if the blocks are small. On the other hand, cumulative communication startup costs are less if the block size is large since, in this case, fewer messages must be sent (although the total volume of data sent is independent of the block size). Thus, there is a block size that optimally balances the load imbalance and communication startup costs.


next up previous
Next: Optimality and Pipelining Tradeoffs Up: OptimizationTuning, and Trade-offs Previous: Layout of Local Process

Jack Dongarra
Sun Feb 9 10:05:05 EST 1997