Achieving High Performance on a Distributed-Memory Computer
 
 
- Use an efficient data distribution.
- 
- Block size (I.e., MB,NB) = 64.
- Square processor grid, Pr = Pc.
 
- Use efficient machine-specific BLAS (not the Fortran77 reference implementation from netlib) and BLACS (nondebug, BLACSDBGLVL=0 in Bmake.inc)