next up previous
Next: Mapping Logical Memory to Up: No Title Previous: Parallel Implementation

Optimization, Tuning, and Trade-offs

  In this section, we shall examine techniques for optimizing the basic LU factorization code presented in Section 4.1. Among the issues to be considered are the assignment of processes to physical processors, the arrangement of the data in the local memory of each process, the trade-off between load imbalance and communication latency, the potential for overlapping communication and calculation, and the type of algorithm used to broadcast data. Many of these issues are interdependent, and in addition the portability and ease of code maintenance and use must be considered. For further details of the optimization of parallel LU factorization algorithms for specific concurrent machines, together with timing results, the reader is referred to the work of Chu and George [12], Geist and Heath [32], Geist and Romine [33], Van de Velde [48], Brent [8], Hendrickson and Womble [35], Lichtenstein and Johnsson [41], and Dongarra and co-workers [10, 24].

Jack Dongarra
Sun Feb 9 10:05:05 EST 1997