The layout of matrix blocks in the local memory of a process, and the arrangement of matrix elements within each block, can also affect performance. Here, tradeoffs among several factors need to be taken into account. When communicating matrix blocks, for example in the broadcasts of and , we would like the data in each block to be contiguous in physical memory so there is no need to pack them into a communication buffer before sending them. On the other hand, when updating the trailing submatrix, E, each process multiplies a column of blocks by a row of blocks, to do a rank-r update on the part of E that it contains. If this were done as a series of separate block-block matrix multiplications, as shown in Figure 18(a), the performance would be poor except for sufficiently large block sizes, r, since the vector and/or pipeline units on most processors would not be fully utilized, as may be seen in Figure 17 for the i860 processor. Instead, we arrange the loops of the computation as shown in Figure 18(b). Now, if the data are laid out in physical memory first by running over the i index and then over the d index the inner two loops can be merged, so that the length of the inner loop is now . This generally results in much better vector/pipeline performance. The b and j loops in Figure 18(b) can also be merged, giving the algorithm shown in Figure 18(c). This is just the outer product form of the multiplication of an by an matrix, and would usually be done by a call to the Level 3 BLAS routine xGEMM of which an assembly coded sequential version is available on most machines. Note that in Figure 18(c) the order of the inner two loops is appropriate for a Fortran implementation - for the C language this order should be reversed, and the data should be stored in each process by rows instead of by columns.
Figure 17: Performance of the assembly-coded Level 3 BLAS matrix multiplication routine DGEMM on one i860 processor of the Intel Delta system. Results for square and rectangular matrices are shown. Note that the peak performance of about 35 Mflops is attained only for matrices whose smallest dimension exceeds 100. Thus, performance is improved if a few large matrices are multiplied by each process, rather than many small ones.
Figure 18: Pseudocode for different versions of the rank-r update, , for one process. The number of row and column blocks per process is given by and , respectively; r is the block size. Blocks are indexed by (b,d), and elements within a block by (i,j). In version (a) the blocks are multiplied one at a time, giving an inner loop of length r. (b) shows the loops rearranged before merging the i and d loops, and the j and b loops. This leads to the outer product form of the algorithm shown in (c) in which the inner loop is now of length .
We have found in our work on the Intel iPSC/860 hypercube and the Delta system that it is better to optimize for the sequential matrix multiplication with an (i,d,j,b) ordering of memory in each process, rather than adopting an (i,j,d,b) ordering to avoid buffer copies when communicating blocks. However, there is another reason for doing this. On most distributed memory computers the message startup cost is sufficiently large that it is preferable wherever possible to send data as one large message rather than as several smaller messages. Thus, when communicating and the blocks to be broadcast would be amalgamated into a single message, which requires a buffer copy. The emerging Message Passing Interface (MPI) standard  provides support for noncontiguous messages, so in the future the need to avoid buffer copies will not be of such concern to the application developer.