In this section, we consider the data layout of dense matrices on distributed-memory machines, with the goal of making dense matrix computations as efficient as possible. We shall discuss a sequence of data layouts, starting with the most simple, obvious, and inefficient one and working up to the complicated but efficient ScaLAPACK ultimately uses. Even though our justification is based on Gaussian elimination, analysis of many other algorithms has led to the same set of layouts. As a result, these layouts have been standardized as part of the High Performance Fortran standard [91], with corresponding data declarations as part of that language.

The two main issues in choosing a data layout for dense matrix computations are

- load balance, or splitting the work reasonably evenly among the processors throughout the algorithm, and
- use of the Level 3 BLAS during computations on a single processor, to account for the memory hierarchy on each processor.

It will help to remember the pictorial representation of Gaussian elimination below. As the algorithm proceeds, it works on successively smaller square southeast corners of the matrix. In addition, there is extra Level 2 BLAS work to factorize the submatrix .

**Figure 4.2:** Gaussian elimination using Level 3 BLAS

For convenience we will number the
processes from 0 to *P*-1, and
matrix columns (or rows) from 1
to *N*. The following two figures
shows a sequence of data layouts
we will consider. In all cases,
each submatrix is labeled with
the number of the process (from
0 to 3) that contains it. Process
0 owns the shaded submatrices.

Consider the layout illustrated
on the left of figure 4.3,
the **one-dimensional block
column distribution**. This distribution

**Figure 4.3:** The one-dimensional block and cyclic column distributions

assigns a block of contiguous columns
of a matrix to successive processes.
Each process receives only one block
of columns of the matrix. Column *k*
is stored on process where is the maximum number of
columns stored per process. In the
figure *N*=16 and *P*=4. This layout
does not permit good load balancing
for the above Gaussian elimination
algorithm because as soon as the
first *tc* columns are complete,
process 0 is idle for the rest
of the computation. The transpose
of this layout, the **one-dimensional
block row distribution**, has a similar
shortfall for dense computations.

The second layout illustrated on the right of
figure 4.3, the **one-dimensional
cyclic column distribution**, addressed this
problem by assigning column *k* to process
(*k*-1) mod *P*. In the figure, *N*=16 and *P*=4.
With this layout, each process owns approximately
of the square southeast corner of the
matrix, so the load balance is good. However,
since single columns (rather than blocks) are
stored, we cannot use the Level 2 BLAS to
factorize and may not be
able to use the Level 3 BLAS to update
. The transpose of this
layout, the **one-dimensional cyclic
row distribution**, has a similar shortfall.

The third layout shown on the left of
figure 4.4, the
**one-dimensional block-cyclic
column distribution**, is a compromise
between the distribution schemes shown
in figure 4.3. We choose a
block size *NB*, divide the columns
into groups of size NB, and distribute
these groups in a cyclic manner. This
means column *k* is stored in process
.
In fact, this layout includes the first
two as the special cases, and *NB*=1, respectively.
In the figure *N*=16, *P*=4 and *NB*=2.
For *NB* larger than 1, this has a
slightly worse balance than the
one-dimensional cyclic column
distribution, but can use the
Level 2 BLAS and Level 3 BLAS
for the local computations.
For *NB* less than *tc*, it
has a better load balance than
the one-dimensional block column
distribution, but can call the BLAS
only on smaller subproblems. Hence,
it takes less advantage of the local
memory hierarchy. Moreover, this
layout has the disadvantage that
the factorization of
will take place on one process (in the
natural situation where column blocks
in the layout correspond to column
blocks in Gaussian elimination),
thereby representing a serial bottleneck.

**Figure 4.4:** The one-dimensional block-cyclic column- and the two-dimensional
block-cyclic distributions

This serial bottleneck is eased by the
fourth layout shown on the right of
figure 4.4, the **
two-dimensional block cyclic distribution**.
Here, we think of our *P* processes arranged
in a rectangular
array of processes, indexed in a two-dimensional
fashion by , with
and . All the processes
with a fixed are referred
to as process column . All the processes
with a fixed are referred
to as process row . Thus, this layout
includes all the previous layouts, and their
transposes, as special cases. In the figure,
*N*=16, *P*=4, , and *MB*=*NB*=2.
This layout permits -fold parallelism
in any column, and calls to the Level 2 BLAS
and Level 3 BLAS on local subarrays. Finally,
this layout also features good scalability
properties as shown in [61].

The two-dimensional block cyclic distribution scheme is the data layout that is used in the ScaLAPACK library for dense matrix computations.

Tue May 13 09:21:01 EDT 1997