Even if the vector lengths are long enough to overcome the vector
startup overhead, the peak rate of the arithmetic pipelines may not
be realised because of the delays associated with obtaining data from
the cache or main memory of the computer. The POLY1 and POLY2 benchmarks
quantify this dependence of computer performance on memory access
bottlenecks. The computational intensity, **f**, of a DO-loop is defined
as the number of floating-point operations performed per memory
reference to an element of a vector variable [19].
The asymptotic performance, \rinf, of a computer is observed to increase
as the computational intensity increases, because as this becomes
larger, the effects of memory access delays become negligible
compared to the time spent on arithmetic. This effect is characterised
by the two parameters (\rhat, \fhalf), where \rhat~ is the peak hardware
performance of the arithmetic pipeline, and \fhalf is the computational
intensity required to achieve half this rate. That is to say the
asymptotic performance is given by:

\rinf = \frac{\rhat}{(1+\fhalf/f)} (1)

If memory access and arithmetic are not overlapped, then \fhalf can
be shown to be the ratio of arithmetic speed (in Mflop/s) to memory
access speed (in Mword/s) [19]. The parameter \fhalf,
like \nhalf, measures an unwanted overhead
and should be as small as possible. In order to vary
**f** and allow the peak performance to be approached, we choose a kernel
loop that can be computed with maximum efficiency on any hardware. This
is the evaluation of a polynomial by Horner's rule, in which case the
computational intensity is the order of the polynomial, and both the
multiply and add pipelines can be used in parallel. To measure \fhalf,
the order of the polynomial is increased from one to ten, and the measured
performance for long vectors is fitted to Eqn.(3).

The POLY1 benchmark repeats the polynomial evaluation for each order
typically 1000 times for vector lengths up to 10,000, which would
normally fit into the cache of a cache-based processor. Except for
the first evaluation the data will therefore be found in the cache.
POLY1 is therefore an * in-cache* test of the memory bottleneck
between the arithmetic registers of the processor and its cache.

POLY2, on the other hand, flushes the cache prior to each different
order and then performs only one polynomial evaluation,
for vector lengths from 10,000 up to 100,000, which would normally
exceed the cache size. Data will have to be brought from off-chip
memory, and POLY2 is an * out-of-cache* test of the memory
bottleneck between off-chip memory and the arithmetic registers.

The POLY1 benchmark exists as MOD1G of the EuroBen benchmarks [20]. POLY2 exists as part of the Hockney benchmarks.

Tue Nov 14 15:43:14 PST 1995