MULTI PROCESSOR LOW-LEVEL BENCHMARKS

The PARKBENCH suite of benchmark programs provide low-level benchmarks to characterize the basic communication properties of an MPP by measuring the parameters RINF and NHALF for communication (COMMS1, COMMS2, COMMS3). The ratio of arithmetic speed to communication speed (the hardware+compiler parameter FHALF for communication) is measured by the POLY3 benchmark. The ability to synchronize the processors in a large MPP, in an acceptable time, is a key characteristic of such computers, and the SYNCH1 benchmark measures the number of barrier statements that can be executed per second as a function of the number of processors taking part in the barrier.

All of these low-level kernels are available in the current distribution from the netlib repository.

Communication benchmarks (COMMS1 and COMMS2). The COMMS1, or pingpong, benchmark measures the basic communication properties of a message-passing MIMD computer. A message of variable length, n, is sent from a master node to a slave node. The slave node receives the message into a Fortran data array, and immediately returns it to the master. Half the time for this message pingpong is recorded as the time, T, to send a message of length, N, In the COMMS2 benchmark there is a message exchange in which two nodes simultaneously send messages to each other and return them. In this case advantage can be taken of bidirectional links, and a greater bandwidth can be obtained than is possible with COMMS1. In both benchmarks, the time as a function of message length is fitted by least squares using the parameters RINF and NHALF to the following linear timing model:
T = (N+NHALF)/RINF ==>
when the communication rate is given by,
R=RINF/(1+NHALF/N) = PI0*N/(1+N/NHALF) ==>
and the startup time is
T0=NHALF/RINF=1/PI0 ==>
PI0 is known as the specific performance. In general, we may say that RINF characterizes the long-message performance and PI0 the short-message performance. The COMMS1 benchmark computes all four of the above parameters, RINF, NHALF, T0, and PI0, because each emphasizes a different aspect of performance. However only two of them are independent. In the case that there are different modes of transmission for messages shorter or longer than a certain length, the benchmark can read in this breakpoint and perform a separate least-squares fit for the two regions. An example is the Intel iPSC/860 which has a different message protocol for messages shorter than and longer than 100 byte.
Total saturation bandwidth (COMMS3). To complement the above communication benchmarks, there is a need for a benchmark to measure the total saturation bandwidth of the complete communication system, and to see how this scales with the number of processors. A natural generalization of the COMMS2 benchmark is made as follows, and called the COMMS3 benchmark: Each processor of a P-processor system sends a message of length N to the other (P-1) processors. Each processor then waits to receive the (P-1) messages directed at it. The timing of this generalized pingpong ends when all messages have been successfully received by all processors; although the process will be repeated many times to obtain an accurate measurement, and the overall time will be divided by the number of repeats. The time for the generalized pingpong is the time to send P(P-1) messages of length N and can be analysed in the same way as COMMS1 and COMMS2 into values of RINF and NHALF. The value obtained for RINF is the required total saturation bandwidth, and we are interested in how this scales up as the number of processors P increases and with it the number of available links in the system.
Communication bottleneck (POLY3) POLY3 assesses the severity of the communication bottleneck. It is the same as the POLY1 benchmark except that the data for the polynomial evaluation is stored on a neighbouring processor. The value of FHALF obtained therefore measures the ratio of arithmetic to communication performance. The computational intensity of the calculation must be significantly greater than FHALF (say 4 times greater) if communication is not to be a bottleneck. In this case the computational intensity is the ratio of arithmetic performed on a processor to words transferred to/from it over communication links. In the common case that the amount of arithmetic is proportional to the volume of a region, and the data communicated is proportional to the surface of the region, the computational intensity is increased as the size of the region (or granularity of the decomposition) is increased. Then the FHALF obtained from this benchmark is directly related to the granularity that is required to make communication time unimportant.
Synchronization benchmarks (SYNCH1). SYNCH1 measures the time to execute a barrier synchronization statement as a function of the number of processes taking part in the barrier. The practicability of massively parallel computation with thousands or tens of thousands of processors depends on this barrier time not increasing too fast with the number of processors. The results are quoted both as a barrier time, and as the number of barrier statements executed per second (barr/s).

PARKBENCH low-level page

Last Modified May 14, 1996