On a distributed memory computer the application programmer is responsible for decomposing the data over the processes of the computer. The way in which a matrix is distributed over the processes has a major impact on the load balance and communication characteristics of the concurrent algorithm, and hence largely determines its performance and scalability. The current implementation of ScaLAPACK assumes the matrices to be distributed according to the block-cyclic decomposition scheme. The block cyclic distribution provides a simple, yet general-purpose way of distributing a block-partitioned matrix on distributed memory concurrent computers. The High Performance Fortran standard [23, 25] provides a block cyclic data distribution as one of the basic data distributions.
Assuming a two-dimensional block cyclic data distribution, an M by N matrix is first decomposed into MB by NB blocks starting at its upper left corner. These blocks are then uniformly distributed across the process grid. Thus every process owns a collection of blocks, which are locally and contiguously stored in a two dimensional ``column major'' array. We present in Fig. 2 the partitioning of a99-by-9 matrix into 2-by-2 blocks. Then in Fig. 3 we show how these 2-by-2 blocks are mapped onto a 2-by-3 process grid, i.e., M=N=9 and MB=NB=2. The local entries of every matrix column are contiguously stored in the processes' memories.
Figure: Matrix partitioning
Figure: Mapping of matrix onto 2-by-3 process grid
For further details on data distributions, refer to .