On a distributed memory computer the application programmer is responsible for decomposing the data over the processes of the computer. The way in which a matrix is distributed over the processes has a major impact on the load balance and communication characteristics of the concurrent algorithm, and hence largely determines its performance and scalability. The current implementation of ScaLAPACK assumes the matrices to be distributed according to the block-cyclic decomposition scheme. The block cyclic distribution provides a simple, yet general-purpose way of distributing a block-partitioned matrix on distributed memory concurrent computers. The High Performance Fortran standard [23, 25] provides a block cyclic data distribution as one of the basic data distributions.

Assuming a two-dimensional block cyclic data distribution,
an `M` by `N` matrix is first
decomposed into `MB` by `NB` blocks
starting at its upper left corner. These blocks
are then uniformly distributed across the process
grid. Thus every process owns a collection of
blocks, which are locally and contiguously stored
in a two dimensional ``column major'' array.
We present in Fig. 2 the partitioning of a99-by-9 matrix into 2-by-2 blocks. Then in Fig. 3 we show how these 2-by-2 blocks are mapped onto a 2-by-3
process grid, i.e., *M*=*N*=9 and *MB*=*NB*=2.
The local entries of every matrix column are
contiguously stored in the processes' memories.

**Figure:** Mapping of matrix onto 2-by-3 process grid

For further details on data distributions, refer to [17].

Thu Jul 25 15:38:00 EDT 1996