SINGLE PROCESSOR LOW-LEVEL BENCHMARKS
The single-processor low-level benchmarks provided by PARKBENCH, aim to
measure performance parameters that characterise the basic architecture of
the computer, and the compiler software through which it is used. For this
reason, such benchmarks have also been called appropriately basic
Following the methodology of Euroben,
the aim is that these hardware/compiler parameters will be used in performance
formulae that predict the timing and performance of the more complex
They are therefore a set of synthetic benchmarks
to measure theoretical parameters that describe the severity of some overhead
or potential bottleneck, or the properties of some item of hardware.
The fundamental measurement in any benchmarking is the measurement of elapsed
wall-clock time. Because the computer clocks on each node of a multi-node MPP
are not synchronized, all benchmark time measurements must be made with a
single clock on one node of the system. The benchmarks TICK1 and TICK2 have,
respectively, been designed to measure the resolution and to check the absolute
value of this clock. These benchmarks should be run with satisfactory results
before any further benchmark measurements are made.
All of these low-level kernels are available in the current
Timer resolution (TICK1).
TICK1 measures the interval between ticks of the clock being used
in the benchmark measurements. That is to say the resolution of
the clock. A succession of calls to the timer routine are inserted
in a loop and executed many times. The differences between successive
values given by the timer are then examined. If the changes in the
clock value (or ticks) occur less frequently than the time taken to
enter and leave the timer routine, then most of these differences
will be zero. When a tick takes place, however, a difference equal
to the tick value will be recorded, surrounded by many zero differences.
This is the case with clocks of poor resolution; for example most UNIX
clocks that tick typically every 10 ms. Such poor UNIX clocks can still
be used for low-level benchmark measurements if the benchmark is
repeated, say, 10,000 times, and the timer calls are made outside this
With some computers, such as the
CRAY series, the clock ticks every cycle of the computer, that is to say
every 6ns on the Y-MP. The resolution of the CRAY clock is therefore
approximately one million times better than a UNIX clock, and that is
quite a difference! If TICK1 is used on such a computer the difference
between successive values of the timer is a very accurate measure of
how long it takes to execute the instructions of the timer routine, and
therefore is never zero. TICK1 takes the minimum of all such differences,
and all it is possible to say is that the clock tick is less than or
equal to this value. Typically this minimum will be several hundreds
of clock ticks. With a clock ticking every computer cycle, we can make
low-level benchmark measurements without a repeat loop. Such measurements
can even be made on a busy timeshared system (where many users are
contending for memory access) by taking the minumum time recorded from
a sample of, say, 10,000 single execution measurements. In this case,
the minimum can usually be said to apply to a case when there was no
memory access delay caused by other users.
Timer value (TICK2).
TICK2 confirms that the absolute values returned by the computer clock
are correct, by comparing its measurement of a given time interval
with that of an external wall-clock (actually the benchmarker's
wristwatch). Parallel benchmark performance can only be measured
using the elapsed wall-clock time, because the objective of
parallel execution is to reduce this time. Measurements made with a
CPU-timer (which only records time when its job is executing in the CPU)
are clearly incorrect, because the clock does not record waiting time
when the job is out of the CPU. TICK2 will immediately detect the
incorrect use of a CPU-time-for-this-job-only clock. An example
of a timer that claims to measure elapsed time but is actually a
CPU-timer, is the returned value of the popular Sun UNIX timer ETIME.
TICK2 also checks that the correct multiplier is being used in the
computer system software to convert clock ticks to true seconds.
Basic arithmetic operations (RINF1).
This benchmark takes a set of common Fortran DO-loops and analyzes
their time of execution in terms of the two parameters, RINF and NHALF.
RINF is the asymptotic performance rate in Mflop/s which
is approached as the loop (or vector) length, n, becomes longer. NHALF
(the half-performance length) expresses how rapidly, in terms
increasing vector length, the actual performance, r, approaches RINF.
It is defined as the vector length required to achieve a performance
of one half of RINF.
Memory bottleneck benchmarks (POLY1 and POLY2).
Even if the vector lengths are long enough to overcome the vector
startup overhead, the peak rate of the arithmetic pipelines may not
be realised because of the delays associated with obtaining data from
the cache or main memory of the computer. The POLY1 and POLY2 benchmarks
quantify this dependence of computer performance on memory access
The POLY1 benchmark repeats the polynomial evaluation for each order
typically 1000 times for vector lengths upto 10,000, which would
normally fit into the cache of a cache-based processor. Except for
the first evaluation the data will therefore be found in the cache.
POLY1 is therefore an in-cache test of the memory bottleneck
between the arithmetic registers of the processor and its cache.
POLY2, on the other hand, flushes the cache prior to each different
order and then performs only one polynomial evaluation,
for vector lengths from 10,000 upto 100,000, which would normally
exceed the cache size. Data will have to be brought from off-chip
memory, and POLY2 is an out-of-cache test of the memory
bottleneck between off-chip memory and the arithmetic registers.
PARKBENCH low-level page
Last Modified May 14, 1996