MULTI PROCESSOR LOW-LEVEL BENCHMARKS
The PARKBENCH suite of benchmark programs provide low-level benchmarks
to characterize the basic communication properties of an MPP by measuring
the parameters RINF and NHALF for communication (COMMS1, COMMS2, COMMS3). The
ratio of arithmetic speed to communication speed (the hardware+compiler
parameter FHALF for communication) is measured by the POLY3 benchmark.
The ability to synchronize the processors in a large MPP, in an acceptable
time, is a key characteristic of such computers, and the SYNCH1 benchmark
measures the number of barrier statements that can be executed per second
as a function of the number of processors taking part in the barrier.
All of these low-level kernels are available in the current
Communication benchmarks (COMMS1 and COMMS2).
The COMMS1, or pingpong, benchmark
measures the basic communication properties of a message-passing MIMD
computer. A message of variable length, n, is sent from a master node to
a slave node. The slave node receives the message into a Fortran data array,
and immediately returns it to the master. Half the time for this message
pingpong is recorded as the time, T, to send a message of length, N,
In the COMMS2 benchmark there is a message exchange in which two nodes
simultaneously send messages to each other and return them. In this case
advantage can be taken of bidirectional links, and a greater bandwidth
can be obtained than is possible with COMMS1. In both benchmarks, the time
as a function of message length is fitted by least squares using the
parameters RINF and NHALF to the following linear timing model:
T = (N+NHALF)/RINF ==>
when the communication rate is given by,
R=RINF/(1+NHALF/N) = PI0*N/(1+N/NHALF) ==>
and the startup time is
PI0 is known as the specific performance. In general,
we may say that RINF characterizes the long-message
performance and PI0 the short-message performance. The COMMS1 benchmark
computes all four of the above parameters, RINF, NHALF, T0, and PI0,
because each emphasizes a different aspect of performance. However only
two of them are independent. In the case that there are different modes
of transmission for messages shorter or longer than a certain length, the
benchmark can read in this breakpoint and perform a separate least-squares
fit for the two regions. An example is the Intel iPSC/860 which has a different
message protocol for messages shorter than and longer than 100 byte.
Total saturation bandwidth (COMMS3).
To complement the above communication benchmarks, there is a need for
a benchmark to measure the total saturation bandwidth of the complete
communication system, and to see how this scales with the number
of processors. A natural generalization of the COMMS2 benchmark
is made as follows, and called the COMMS3 benchmark:
Each processor of a P-processor system sends a message of length N
to the other (P-1) processors. Each processor then waits to receive
the (P-1) messages directed at it. The timing of this generalized
pingpong ends when all messages have been successfully received by all
processors; although the process will be repeated many times to obtain
an accurate measurement, and the overall time will be divided by the
number of repeats. The time for the generalized pingpong is the time to
send P(P-1) messages of length N and can be analysed in the same way
as COMMS1 and COMMS2 into values of RINF and NHALF. The value obtained for
RINF is the required total saturation bandwidth, and we are interested in how
this scales up as the number of processors P increases and with it the
number of available links in the system.
Communication bottleneck (POLY3)
POLY3 assesses the severity of the communication bottleneck.
It is the same as the POLY1 benchmark except that the data for the
polynomial evaluation is stored on a neighbouring processor. The value of
FHALF obtained therefore measures the ratio of arithmetic to communication
performance. The computational intensity
of the calculation must be significantly greater than FHALF (say 4 times
greater) if communication is not to be a bottleneck. In this case
the computational intensity is the ratio of arithmetic performed on a
processor to words transferred to/from it over communication links.
In the common case that the amount of arithmetic is proportional to
the volume of a region, and the data communicated is proportional to
the surface of the region, the computational intensity is increased as
the size of the region (or granularity of the decomposition) is increased.
Then the FHALF obtained from this benchmark is directly related
to the granularity that is required to make communication time unimportant.
Synchronization benchmarks (SYNCH1).
SYNCH1 measures the time to execute a barrier synchronization statement
as a function of the number of processes taking part in the barrier.
The practicability of massively parallel computation with thousands
or tens of thousands of processors depends on this barrier time not
increasing too fast with the number of processors. The results are quoted
both as a barrier time, and as the number of barrier statements
executed per second (barr/s).
PARKBENCH low-level page
Last Modified May 14, 1996