next up previous contents index
Next: 13.2.2 Overview of the Up: A Software Tool Previous: A Software Tool

13.2.1 Is Any Assistance Really Needed?

The ultimate goal of the programmer is peak performance on the target computer. The realization of peak performance requires the understanding of many subtle relationships between the algorithm, the program, and the target machine architecture. Factors such as input data size, data dependences in the code, target machine characteristics, and the data partitioning scheme are related in very nonintuitive ways, and jointly determine the performance of the program. Thus, a data partitioning scheme that is chosen purely on the basis of some algorithmic property, may not always be the best choice.

Let us examine the relationship between these aspects more closely, to illustrate the subtle complexities that are involved in choosing the partitioning of the data domain. Consider the following program:\

 MMMMMMMMMMMMMMMMMMMMMMMMMM¯example (A, B, N)

double precision A(N, N), B(N, N) *

do k=1, cycles

do j=1,N

do i=2,N-1

A(i, j) = ( B(i-1, j), B(i+1, j) )

enddo

enddo

do j=2,N-1

do i=2,N-1

B(i,j) = ( A(i-1, j), A(i+1, j), A(i, j),

A(i, j-1), A(i, j+1) )

enddo

enddo

enddo

end

and represent functions with 4 and 10 double-precision floating-point operations, respectively. This program segment does not represent any particular realistic computation; rather, it was chosen to illustrate all the aspects of our argument using a small piece of code. The program segment was executed on 64 processors of an nCUBE,  with array sizes ranging from N = 64 to N = 320. A and B were first partitioned as columns, so that each processor was assigned successive columns. The program was then run once again, this time with A and B partitioned as blocks, so that each processor was assigned a block of elements. The resulting execution and communication times for column and block partitioning schemes are shown in Figure 13.6. The communication time was measured by removing all computation in the loops.

  
Figure 13.6: Timing results on an nCUBE, using 64 processors.

When employing a column partitioning scheme for arrays A and B, communication is only necessary after the first j loop. Each processor has to exchange boundary values with its left and right neighbor. In a block partitioning scheme, each processor has to communicate with its four neighbors after the first loop and with its north and south neighbors after the second loop. For small message lengths, the communication cost is dominated by the message startup time, whereas the transmission cost begins to dominate as the messages get longer (i.e., more data is exchanged at each communication step). This explains why communication cost for the column partition is greater than for the block partition for array sizes larger than . It is clear from the graph that column partitioning is preferable when the array sizes are less than , and block partitioning is preferable for larger sizes.

The steps in the execution time graphs are caused mainly by load imbalance effects. For example, the step between N = 128 and N = 129 for the column partition is due to the fact that for size 129 one subdomain has an extra column, so that the processor assigned to that subdomain is still busy after all the others have finished, causing load imbalance in the system. Similar behavior can be observed for the block partition, but here the steps occur at smaller increments of the array size N. The steps in the communication time graphs are due to the fact that the packet size on the nCUBE is , so that messages that are even a few bytes longer need an extra packet to be transmitted.

The above example indicates that several factors contribute to the observed performance of a chosen partitioning scheme, making it difficult for a human to predict this behavior statically. Our aim is to make the programmer aware of these performance effects without having to run the program on the target computer. We hope to do this by providing an interactive tool, that can give performance estimates in response to a data layout specification. The tool's performance estimates will allow the programmer to gauge the effect of a data partitioning scheme and thus provide some guidance in making a better choice.



next up previous contents index
Next: 13.2.2 Overview of the Up: A Software Tool Previous: A Software Tool



Guy Robinson
Wed Mar 1 10:19:35 EST 1995