For the sake of illustration, let the size of A and B be
(i.e., **n = 8**), and let the number of (virtual) processors be **p = 4**.
The following is a possible sequence of actions that the programmer could do
using the tool.

After examining the data dependences within the program segment as reported
by the tool, let us assume that the programmer decides to partition A
by column and B by row. The tool computes the internal mapping:\
A$(1) = A(1:8, 1:2) and B$(1) = B(1:2, 1:8).

A$(2) = A(1:8, 3:4) and B$(2) = B(3:4, 1:8).

A$(3) = A(1:8, 5:6) and B$(3) = B(5:6, 1:8).

A$(4) = A(1:8, 7:8) and B$(4) = B(7:8, 1:8).

To determine the communication necessary, the tool uses Algorithm COMM, shown in Figure 13.8. For simple partitioning schemes as found in many applications, the communication computed by algorithm COMM can be parameterized by processor number, that is, evaluated once for an arbitrary processor. In addition, we are also investigating other methods to speed up the algorithm.

**Figure 13.8:** Algorithm to Determine the Communication Induced by the Data
Partitioning Scheme

Consider program P1 for example. According to algorithm COMM, when the
**k**th processor executes the first statement, the required communication is
given by

where the range of **i** and **j** are determined by the section of the LHS
owned by processor **k**, in this case and (since A
is partitioned columnwise). But the partitioning of A ensures that
, the data is always local to
**k**. The set of pairs will, therefore, be an empty set for
any **k**. Thus, the execution of the first statement with A partitioned by
column requires no communication.

When the **k**th processor executes the second statement, the communication as
computed by algorithm COMM is given by

The ranges of **i** and **j** are determined by the section of the LHS that is
owned by processor **k**: in this case and (since B
is partitioned rowwise). The second and third terms will be ,
because the row partitioning of B ensures that , the data
is always local to **k**. The first term can be
a nonempty set, because processor **k** owns a column of A (i.e., **j** in
the range ), while the range of **j** in the first term is .
Thus, communication may be required to get the nonlocal element of A before
the **k**th processor can proceed with the computation of its .
The dependence from the definition of to its use is
loop-independent. Algorithm COMM therefore computes `commlevel`, the
common nesting level of the source and sink of the dependence, to be the
level of the inner **i** loop. The section translated to the
level of the inner **i** loop is simply the single element .
Thus, each message communicates this single element and the communication
occurs within the inner **i** loop.

The execution of program P1 results in a large number of messages because each message only communicates a single element of A, and the communication occurs within the inner loop. Message startup and transmission costs are specified by the target machine parameters, and the average cost of each message is determined from the performance model. The tool computes the communication cost by multiplying the number of messages by the average cost of sending a single element message. This cost estimate is returned to the programmer.

Now consider the program P2, with the same partitioning scheme for A and B.
When the **k**th processor executes the first statement, the required
communication as determined by algorithm COMM is given by

where the range of **j** is determined by the section of the LHS
owned by processor **k**, in this case (since A is
partitioned columnwise). Note that in this case, . This is because `commlevel` is now the level of the
outer **j** loop, so that the section must be translated to
the level of the **j** loop. In other words, the reference to )
in the first statement results in an access of the first seven elements of
the **j**th column of A, during each iteration of the **j** loop. Since A is
partitioned columnwise, this section will always be available locally in
each processor, so that the above set is empty and no communication is
required.

When processor **k** executes the second statement, the communication required
is given by

The second and third terms will be empty sets since the required part of B is
local to each **k** (because B is partitioned rowwise). The first term will
be nonempty, because each processor owns , and
the range of **j** in the first term is outside the range . The
data required by processor **k** from processor **q** will therefore be a strip
, from each .

This data can be communicated between the two inner `do i` loops. Each
message will communicate a size strip of A. Fewer exchanges will
be required compared to program P1, because each exchange now communicates a
strip of A, and the communication occurs outside the inner loop. Once again,
the performance model and target machine parameters are used by the tool to
estimate the total communication cost, and this cost is returned to the
programmer.

For most target machines, the communication cost in program P2 will be considerably less than in program P1, because of larger message size and fewer messages.

Next, let us consider program P3. Assuming that the same partitioning scheme
is used for A and B, the execution of the first loop by the *k*th processor
will require communication given by

But this is an empty set because of the column partitioning of A. Here
, because `commlevel`
for this case is the level of the subroutine that contains the two loops.
The section is, therefore, translated to this level by substituting the
appropriate bounds for **i** and **j**. The translated section indicates that the
reference in the first statement results in an access of the
section during all iterations of the outer
**j** loop that are executed by processor **k**.

When the **k**th virtual processor executes the second loop, the required
communication is

The second and third terms will be empty sets because of the row
partitioning of B. The first term will be nonempty, and the data required by
processor **k** from processor **q** will be the block ,
, for each . This block can be communicated between
the two `do j` loops.

This communication can be done between the two loops, allowing computation within each of the two loops to proceed in parallel. The number of messages is the fewest for this case because a block of A is communicated during each exchange. Program P3 is thus likely to give superior performance compared to P1 or P2, on most machines. We ran programs P1, P2 and P3 with A partitioned by column and B by row, on 16 processors of the nCUBE at Caltech. The functions and consisted of one and two double-precision floating-point operations, respectively. The results of the experiment are shown in Figure 13.9. The graphs clearly illustrate the performance improvement that occurs due to reduction in number of messages and increase in length of each message.

**Figure 13.9:** Timing Results for Programs P1, P2 and P3 on the nCUBE, Using 16
Processors.

Wed Mar 1 10:19:35 EST 1995