From owner-pbwg-method@CS.UTK.EDU  Wed Mar 24 07:15:16 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA10727; Wed, 24 Mar 93 07:15:16 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA11943; Wed, 24 Mar 93 07:15:06 -0500
X-Resent-To: pbwg-method@CS.UTK.EDU ; Wed, 24 Mar 1993 07:15:01 EST
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from sun2.nsfnet-relay.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA11895; Wed, 24 Mar 93 07:14:59 -0500
Via: uk.ac.southampton.ecs; Wed, 24 Mar 1993 12:14:16 +0000
From: R.Hockney@parallel-applications-centre.southampton.ac.uk
Via: calvados.pac.soton.ac.uk (plonk); Wed, 24 Mar 93 12:05:17 GMT
Date: Wed, 24 Mar 93 11:57:31 GMT
Message-Id: <3358.9303241157@calvados.pac.soton.ac.uk>
Apparently-To: pbwg-method@cs.utk.edu

TO METHODOLOGY SUBCOMMITTEE
---------------------------
The following message is my contribution to the draft report of
Chapter2 - Methodology, for consideration. Please let me have
your comments, so that I may modify it.
         Roger Hockney
~r method3.tex
From owner-pbwg-method@CS.UTK.EDU  Wed Mar 24 07:47:54 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA11899; Wed, 24 Mar 93 07:47:54 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA12970; Wed, 24 Mar 93 07:47:47 -0500
X-Resent-To: pbwg-method@CS.UTK.EDU ; Wed, 24 Mar 1993 07:47:46 EST
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from sun2.nsfnet-relay.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA12961; Wed, 24 Mar 93 07:47:38 -0500
Via: uk.ac.southampton.ecs; Wed, 24 Mar 1993 12:41:29 +0000
Via: brewery.ecs.soton.ac.uk; Wed, 24 Mar 93 12:34:15 GMT
From: Prof Roger Hockney <R.W.Hockney@ecs.soton.ac.uk>
Received: from alithea.ecs.soton.ac.uk by brewery.ecs.soton.ac.uk;
          Wed, 24 Mar 93 12:42:39 GMT
Date: Wed, 24 Mar 93 12:42:43 GMT
Message-Id: <271.9303241242@alithea.ecs.soton.ac.uk>
To: pbwg-method@cs.utk.edu
Subject: Methodology Chapter 2 contribution

This is my first draft of a proposed Chapter2 for your comments
and amalgamation with contributions from other members of the
subcommittee. Please let me have your Comments, particularly
any parts that you cannot accept (I hope none, however), so that
I may edit it. I particularly wish to hear from our leader, David.
             Best regards   Roger Hockney.
You should be able to print it using LATEX, and the files benrep1.tex
benref1.bib sent to pbwg-comm alredy.
\chapter{Methodology}

The conclusions drawn from a benchmark study of computer performance
depend not only on the basic timing results obtained, but also on
the way these are interpreted and converted into performance figures.
The choice of the performance metric, may itself influence
the conclusions. For example, do we want the computer that generates
the most megaflops (or has the highest Speedup), or the computer 
that solves the problem in the least time? It is now well known
that high values of the first metrics do not necessarily imply the 
second property. This confusion can be avoided by choosing a more 
suitable metric that reflects solution time directly, for example 
either the Temporal or Simulation performance, defined below. This
issue of the sensible choice of performance metric is becoming
increasing important with the advent of massively parallel computers
which have the potential of very high megaflop rates, but have 
much more limited potential for reducing solution time. 


\section{Time Measurement}

The fundamental measurement made in any benchmark is the time to 
complete some specified task. All other performance figures are derived 
from this basic timing measurement. 
The benchmark time, $T(N;p)$, will be a function of the problem size, $N$,
and the number of processors, $p$. Here, the problem size is 
represented by the vector variable, $N$, 
which stands for a set of parameters characterising the size of the 
problem: e.g. the number of mesh points in each dimension, and the
number of particles in a particle-mesh simulation. Benchmark problems of
different sizes can be created by multiplying all the size parameters by
suitable powers of a single scale factor, thereby increasing the spatial and
particle resolution in a sensible way, and reducing the size parameters to
a single size factor (here called $\alpha$). 

We believe that it is most important
to regard execution time and performance as a function of at least the two
variables $(N,p)$, which define a parameter plane. Much confusion has arisen
in the past by attempts to treat performance as a function of a single
variable, by taking a particular path through this plane, and not stating
what path is taken. Many different paths may be taken, and hence many different
conclusions can be drawn. It is important, therefore, always to define the 
path through the performance plane, or better as we do here, to study the 
shape of the two-dimensional performance hill. In some cases there may 
even be an optimum path up this hill.


\section{Units and Symbols}

A rational set of units and symbols is essential for any numerate
science including benchmarking. The following extension of the
internationally agreed SI system of physical units \cite{SI75} is 
made to accommodate the needs of computer benchmarking.

\medskip
New symbols and units: 
\begin{enumerate}
\item flop : number of floating-point operations
\item mref : number of memory references (reads or writes)
\item barr : number of barrier operations
\item b    : number of binary digits (bits)
\item B    : number of bytes (groups of 8 bits)
\item ${\rm w}_{32}$ : number of words (number of bits per word as 
             subscript, here 32).
\end{enumerate}
Note that flop and mref are both inseparable four-letter symbols.
The character case is significant in all unit symbols so that e.g. Flop, 
Mref, $W_{64}$ are incorrect. Unit symbols should always be printed in 
roman type, to contrast with variables names which are printed in italic.

\medskip
SI provides the standard prefixes:
\begin{enumerate}
\item k    : kilo meaning $10^3$
\item M    : mega meaning $10^6$
\item G    : giga meaning $10^9$
\item T    : tera meaning $10^{12}$
\end{enumerate}
This means that we cannot use M to mean $1024^2$ (the binary mega) as is 
often done in describing computer memory capacity, e.g. 256 MB. We can 
however introduce the new prefix:
\begin{enumerate}
\item K    : meaning 1024, then use a subscript 2 to indicate the binary
             versions
\item ${\rm M}_2$    : binary mega $1024^2$
\item ${\rm G}_2$    : binary giga $1024^3$
\item ${\rm T}_2$    : binary tera $1024^4$
\end{enumerate}
In most cases the difference between the mega and the binary mega (4\%) 
is probably unimportant, but it is important to be unambiguous. In this
way one can continue with existing practice if the difference doesn't 
matter, and have an agreed method of being more exact when necessary.
For example, the above memory capacity was probably intended to mean
$256 {\rm M_2 B}$.

As a consequence of the above, an amount of computational work involving
$4.5 \times 10^{12}$ floating-point operations is correctly written as 
4.5 Tflop. Note that the unit symbol Tflop is never pluralised with an
added 's', and it is therefore incorrect to write the above as 4.5 Tflops 
which could be confused with a rate per second. The most frequently used 
unit of performance, millions of floating-point operations per second 
is correctly written Mflop/s, in analogy to km/s. The slash is necessary 
and means 'per',  because the 'p' is an integral part of the unit symbol 
'flop' and cannot also be used for this purpose.  


\section{Floating-Point Operation Count}

Although we discourage the use of millions of floating-point
operations per second as a performance metric, it can be a useful 
measure if the number of floating-point operations, $F(N)$, 
needed to solve the benchmark problem is carefully defined.

For simple problems (e.g. matrix multiply) it is sufficient to use a 
theoretical value for the floating-point operation count (in this case $2n^3$ 
flop, for nxn matrices) obtained by inspection of the 
code or consideration of the arithmetic in the algorithm. For more complex
problems containing data-dependent conditional statements, an empirical method
may have to be used.  The sequential version of the benchmark code defines
the problem and the algorithm to be used to solve it. Counters can be inserted 
into this code or a hardware monitor used to count the number of floating-point
operations. The latter is the procedure followed by the {\sc PERFECT} Club 
\cite{Berr89}. In either case a decision has to be made regarding the number
of flop that are to be credited for different types of floating-point 
operations, and we see no good reason to deviate from those chosen by 
McMahon \cite{Ma88} when the Mflop/s measure was originally defined. 
These are:

\begin{table}[h]
\centering
\begin{tabular}{ll}
add, subtract, multiply		& 1 flop \\
divide, square-root		& 4 flop \\
exponential, sine etc.		& 8 flop \\
{\sc IF(X .REL. Y)}		& 1 flop \\
\end{tabular}
\end{table}

We distinguish two types of operation count. The first is the nominal 
benchmark floating-point operation count, $F_B(N)$, which is found in the 
above way from the defining Fortran77 sequential code. The other is the
actual number of floating-point operations performed by the hardware
when executing the distributed multi-node version, $F_H(N)$, which may be 
greater than the nominal benchmark count, due to the distributed version 
performing redundant arithmetic operations.


\section{Performance Metrics}

Given the time of execution $T(N;p)$ and the flop-count $F(N)$ several different
performance measures can be defined. Each metric has its own uses, and gives 
different information about the computer and algorithm used in the benchmark.
It is important therefore to distinguish the metrics with different names,
symbols and units, and to understand clearly the difference between them.
Much confusion and wasted work can arise from optimising a benchmark with
respect to an inappropriate metric. The principal performance metrics are:

\subsection{Temporal Performance}

If we are interested in comparing the
performance of different algorithms for the solution of the same problem, then
the correct performance metric to use is the {\it Temporal Performance},
$R_T$, which is defined as the inverse of the execution time
\begin{equation}
                            R_T(N;p)=T^{-1}(N;p)              \label{Eqn(1)}
\end{equation}
The units of temporal performance are, in general, solutions per second
(sol/s), or some more appropriate absolute unit such as 
timesteps per second (tstep/s). With this metric we can be sure
that the algorithm with the highest performance executes in the least time,
and is therefore the best algorithm. We note that the number of flop does not
appear in this definition, because the objective of algorithm design is not
to perform the most arithmetic per second, but rather it is to solve a given
problem in the least time, regardless of the amount of arithmetic involved.
For this reason the temporal performance is also the metric that a 
computer user should employ to select the best algorithm to solve his problem, 
because his objective is also to solve the problem in the least time, and he 
does not care how much arithmetic is done to achieve this.

\subsection{Simulation Performance}

A special case of temporal performance occurs for simulation programs in which
the benchmark problem is defined as the simulation of a certain period of
physical time, rather than a certain number of timesteps. In this case we speak
of the {\em Simulation Performance} and use units such as {\em simulated days
per day} (written sim-d/d or 'd'/d) in weather forecasting, where the 
apostrophe is used to indicate 'simulated'; or {\em simulated
pico-seconds per second} (written sim-ps/s or 'ps'/s) in electronic device
simulation. It is important to use simulation performance rather than
timestep/s if one is comparing different simulation algorithms which may 
require different sizes of timestep for the same accuracy (for example an
implicit scheme that can use a large timestep, compared with an explicit
scheme that requires a much smaller step). In order to maintain numerical
stability, explicit schemes also require the use of a smaller timestep as 
the spatial grid is made finer. For such schemes the simulation performance 
falls off dramatically as the problem size is increased by introducing 
more mesh points in order to refine the spatial resolution: the doubling 
of the number of mesh-points in each of three dimensions can reduce the 
simulation performance by a factor near 16 because the timestep must also 
be approximately halved. Even though the larger problem will generate more 
Megaflop per second, in forecasting, it is the simulated days per day 
(i.e. the simulation performance) and not the Mflop/s, that matter to the user.

As we see below, benchmark performance is also measured in terms of the amount
of arithmetic performed per second or Mflop/s. However it is important to
realise that it is incorrect to compare the Mflop/s achieved by two algorithms 
and to conclude that the algorithm with the highest Mflop/s rating is the best
algorithm. This is because the two algorithms may be performing quite 
different amounts of arithmetic during the solution of the same problem.
The temporal performance metric, $R_T$, defined above, has been introduced
to overcome this problem, and provide a measure that can be used to compare
different algorithms for solving the same problem. However, it should be 
remembered that the temporal performance only has the same meaning within the 
confines of a fixed problem, and no meaning can be attached to a
comparison of the temporal performance on one problem with the temporal
performance on another.

\subsection{Benchmark Performance}

In order to compare the performance of a computer on one benchmark with 
its performance on another, account must be taken of the different amounts of 
work (measured in flop) that the different problems require for their solution.
Using the flop-count for the benchmark, $F_B(N)$, we can
define the {\em Benchmark Performance} as
\begin{equation}
                            R_B(N;p)=F_B(N)/{T(N;p)}           \label{Eqn(2)}
\end{equation}
The units of benchmark performance are Mflop/s (benchmark name), where we 
include the name of the benchmark in parentheses to emphasise that the 
performance may depend strongly on the problem being solved, and to emphasise 
that the values are based on the nominal benchmark flop-count. In other 
contexts such performance figures would probably be quoted as examples of the 
so-called {\em sustained} performance of a computer. We feel that the use of 
this term is meaningless unless the problem being solved and the degree of 
code optimisation is quoted, because the performance is so varied across 
different benchmarks and different levels of optimisation. Hence we favour 
the quotation of a selection of benchmark performance figures, rather than a 
single sustained performance, because the latter implies that the quoted 
performance is maintained over all problems.

Note also that the flop-count $F_B(N)$ is that for the defining sequential 
version of the benchmark, and that the same count is used to calculate $R_B$ 
for the distributed-memory (DM) version of the program, even though the DM 
version may actually perform
a different number of operations.  It is usual for DM programs to perform more
arithmetic than the defining sequential version, because often numbers are
recomputed on the nodes in order to save communicating their values from a
master processor. However such calculations are redundant (they have already
been performed on the master) and it would be incorrect to credit them to the
flop-count of the distributed program. 

Using the sequential flop-count in the
calculation of the DM programs benchmark performance has the additional 
advantage that it is possible to conclude that, for a given benchmark,
the implementation that has the highest benchmark performance is the best 
because it executes in the least time.  This would not necessarily be the 
case if a different $F_B(N)$ were used for different implementations of the 
benchmark. For example, the use of a better algorithm which obtains the
solution with less than $F_B(N)$ operations will show up as higher benchmark
performance. For this reason it should cause no surprise if the benchmark 
performance occasionally exceeds the maximum possible hardware performance. 
To this extent benchmark performance Mflop/s must be understood
to be nominal values, and not necessarily exactly the number of operations
executed per second by the hardware, which is the subject of the next
metric. The purpose of benchmark performance is to compare different
implementations and algorithms on different computers for the solution of
the same problem, on the basis that the best performance means the least
execution time. For this to be true $F_B(N)$ must be kept the same for
all implementations and algorithms.  


\subsection{Hardware Performance}

If we wish to compare the observed performance with the theoretical 
capabilities of the computer hardware, we must compute the actual number of
floating-point operations performed, $F_H(N;p)$, and from it the actual
{\em Hardware Performance} 
\begin{equation}
                            R_H(N;p)=F_H(N)/{T(N;p)}             \label{Eqn(3)}
\end{equation}
The hardware performance also has the units Mflop/s, and will have the same 
value as the benchmark performance for the sequential version of the benchmark. 
However, the hardware performance may be higher than the benchmark performance 
for the distributed version, because the hardware performance gives credit for 
redundant arithmetic operations, whereas the benchmark performance does not.
Because the hardware performance measures the actual floating-point operations
performed per second, unlike the benchmark performance, it can never exceed
the theoretical peak performance of the computer.

Assuming a computer with multiple-CPUs each with multiple arithmetic pipelines,
delivering a maximum of one flop per clock period, the theoretical peak value
of hardware performance is
\begin{equation}
   r^*= \frac{fl.pt.pipes/CPU}{clock.period}\times number.CPUs   \label{Eqn(4)}
\end{equation}
with units of Mflop/s if the clock period is expressed in microseconds. By 
comparing the measured hardware performance, $R_H(N;p)$, with the theoretical 
peak performance, we can assess the fraction of the available performance that 
is being realised by a particular implementation of the benchmark.


\subsection{Speedup, Efficiency and Performance per Node}

We do not favour the use of any of the popular performance metrics: 
speedup, efficiency or performance per node; because all these are 
either easily misinterpreted or obscure important effects. 
The speedup of a benchmark code 
is defined as the ratio of the $p$-processor temporal performance to the 
single-processor temporal performance. It is a very useful and convenient 
measure if we are concerned with the optimisation of a particular code in 
isolation, because its value can easily be compared with the maximum possible 
speedup, namely the number of processors being used. We can thereby assess
how much of the potential hardware performance is being utilised. However
benchmarking is to do with comparing the performance of different computers,  
and all the above three metrics are unsuitable for this purpose.

Speedup compares the performance of a code with itself, and might therefore 
be called an introspective, or even incestuous measure. Problems can therefore 
arise (see below), and incorrect conclusions can be drawn, if we try to use 
speedup to compare different algorithms on the same computer, or the same 
algorithm on different computers. This is because speedup is a relative 
measure (it is defined as the ratio of two performances), and therefore all 
knowledge of the absolute performance has been lost. 
Benchmarking, however, is concerned with the comparison of the absolute 
performance of computers, and therefore the use of a relative measure like
speedup is not very useful, and can be positively misleading.  
For example, we do not wish to conclude that a computer with a large number of
slow processors and therefore high value of speedup, is faster than another 
with fewer processors and therefore with a lower speedup, if in fact the 
reverse is the case, because the processors on the second computer are so much 
faster. Only by adopting absolute measures of performance with physical units 
involving inverse time, can one avoid this type of false conclusion.

Speedup is not even useful for comparing the performance of one algorithm with 
another on the same computer, because it is not necessarily true that the 
algorithm with the highest speedup executes in the least time (see, e.g. 
~\cite{Cvet90}). One can only be sure that this is the case if the 
single-processor temporal performance of both algorithms is the same, which 
is most unlikely. If the single-processor performances of the two algorithms 
are different and we compare the speedups of the two algorithms, then we are 
comparing the performance of the two algorithms measured in different units.
This is like comparing the speeds of two cars, one measured in m.p.h. and the
other in cm/s. Such a comparison has no validity either for cars or 
algorithms. Computers and algorithms can only safely be compared in terms
of their absolute performance in solving a problem. The most unambiguous
measure is the temporal performance, which is the inverse of 
the time of execution, or the related simulation performance. 

The benchmark performance per node might seem to be an attractive 
metric because it is an absolute measure which
can be related directly to the hardware performance of node. However
it has the major defect that it hides the point at which the absolute 
performance begins to decrease as the number of processors increases. If we
plot benchmark performance against number of processors, this point is
clearly visible as a maximum, however if the same data is plotted as
performance per node, all we see is a very uninteresting monotonically 
falling line, and the important maximum has disappeared. The efficiency, 
which is defined as the speedup divided by the number of processors, is 
doubly condemned because it is a relative measure and hides the maximum.

\section{Performance Database}

The database of benchmark performance results should be based on an
extension of the excellent X-window display demonstrated by Jack Dongarra
at the March 1993 PBWG meeting. The basic data held would be values of
$T(N;p)$ for at least 4 values of problem size $N$, each for sufficient
$p$-values (say 5 to 10) to determine the trend of variation of performance
with number of processors for constant problem size. It is important that
there be enough $p$-values to see Amdahl saturation, if present, or any 
peak in performance followed by degradation. A graphical interface is
really essential to allow this multidimensional data to be viewed in any
of the metrics defined above, as chosen interactively by the user.
The user could also be offered (by suitable interpolation) a display of 
the results in various scaled metrics, in which the problem size is 
expanded with the number of processors.

In order to encompass as wide a range of performance and number of 
processors as possible, a log-scale on both axes is unavoidable, and
the format and scale range should be kept fixed as long as possible
to enable easy comparison between graphs. A three-cycle by three-cycle
log-log graph with range 1 to 1000 in both $p$ and Mflop/s would cover
most needs in the immediate future. Examples of such graphs are to be
found in \cite{Hoc92,Add93}.

% ----------------------------------------------------------------------------

From owner-pbwg-method@CS.UTK.EDU  Fri Mar 26 08:48:10 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA13935; Fri, 26 Mar 93 08:48:10 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA20932; Fri, 26 Mar 93 08:47:49 -0500
X-Resent-To: pbwg-method@CS.UTK.EDU ; Fri, 26 Mar 1993 08:47:46 EST
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from sun2.nsfnet-relay.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA20915; Fri, 26 Mar 93 08:47:44 -0500
Via: uk.ac.southampton.ecs; Fri, 26 Mar 1993 13:46:58 +0000
From: R.Hockney@parallel-applications-centre.southampton.ac.uk
Via: calvados.pac.soton.ac.uk (plonk); Fri, 26 Mar 93 13:31:08 GMT
Date: Fri, 26 Mar 93 13:37:18 GMT
Message-Id: <7983.9303261337@calvados.pac.soton.ac.uk>
Apparently-To: pbwg-method@cs.utk.edu

Ignore this message, I'm just testing the pwg-method mail reflector
      Roger Hockney
From owner-pbwg-method@CS.UTK.EDU  Fri Apr 30 12:31:41 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA18429; Fri, 30 Apr 93 12:31:41 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA22989; Fri, 30 Apr 93 12:31:01 -0400
X-Resent-To: pbwg-method@CS.UTK.EDU ; Fri, 30 Apr 1993 12:30:59 EDT
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from sun2.nsfnet-relay.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA22968; Fri, 30 Apr 93 12:30:51 -0400
Via: uk.ac.southampton.ecs; Fri, 30 Apr 1993 16:12:44 +0100
From: R.Hockney@parallel-applications-centre.southampton.ac.uk
Via: calvados.pac.soton.ac.uk (plonk); Fri, 30 Apr 93 16:04:49 BST
Date: Fri, 30 Apr 93 15:11:59 GMT
Message-Id: <15933.9304301511@calvados.pac.soton.ac.uk>
To: pbwg-method@cs.utk.edu
Subject: Contribution to Chapter-1 (Methodology)

\chapter{Methodology}

The conclusions drawn from a benchmark study of computer performance
depend not only on the basic timing results obtained, but also on
the way these are interpreted and converted into performance figures.
The choice of the performance metric, may itself influence
the conclusions. For example, do we want the computer that generates
the most megaflops (or has the highest Speedup), or the computer 
that solves the problem in the least time? It is now well known
that high values of the first metrics do not necessarily imply the 
second property. This confusion can be avoided by choosing a more 
suitable metric that reflects solution time directly, for example 
either the Temporal or Simulation performance, defined below. This
issue of the sensible choice of performance metric is becoming
increasing important with the advent of massively parallel computers
which have the potential of very high megaflop rates, but have 
much more limited potential for reducing solution time. 


\section{Time Measurement}

The fundamental measurement made in any benchmark is the time to 
complete some specified task. All other performance figures are derived 
from this basic timing measurement. 
The benchmark time, $T(N;p)$, will be a function of the problem size, $N$,
and the number of processors, $p$. Here, the problem size is 
represented by the vector variable, $N$, 
which stands for a set of parameters characterising the size of the 
problem: e.g. the number of mesh points in each dimension, and the
number of particles in a particle-mesh simulation. Benchmark problems of
different sizes can be created by multiplying all the size parameters by
suitable powers of a single scale factor, thereby increasing the spatial and
particle resolution in a sensible way, and reducing the size parameters to
a single size factor (here called $\alpha$). 

We believe that it is most important
to regard execution time and performance as a function of at least the two
variables $(N,p)$, which define a parameter plane. Much confusion has arisen
in the past by attempts to treat performance as a function of a single
variable, by taking a particular path through this plane, and not stating
what path is taken. Many different paths may be taken, and hence many different
conclusions can be drawn. It is important, therefore, always to define the 
path through the performance plane, or better as we do here, to study the 
shape of the two-dimensional performance hill. In some cases there may 
even be an optimum path up this hill.


\section{Units and Symbols}

A rational set of units and symbols is essential for any numerate
science including benchmarking. The following extension of the
internationally agreed SI system of physical units \cite{SI75} is 
made to accommodate the needs of computer benchmarking.

\medskip
New symbols and units: 
\begin{enumerate}
\item flop : number of floating-point operations
\item mref : number of memory references (reads or writes)
\item barr : number of barrier operations
\item b    : number of binary digits (bits)
\item B    : number of bytes (groups of 8 bits)
\item ${\rm w}_{32}$ : number of words (number of bits per word as 
             subscript, here 32).
\end{enumerate}
Note that flop and mref are both inseparable four-letter symbols.
The character case is significant in all unit symbols so that e.g. Flop, 
Mref, $W_{64}$ are incorrect. Unit symbols should always be printed in 
roman type, to contrast with variables names which are printed in italic.

\medskip
SI provides the standard prefixes:
\begin{enumerate}
\item k    : kilo meaning $10^3$
\item M    : mega meaning $10^6$
\item G    : giga meaning $10^9$
\item T    : tera meaning $10^{12}$
\end{enumerate}
This means that we cannot use M to mean $1024^2$ (the binary mega) as is 
often done in describing computer memory capacity, e.g. 256 MB. We can 
however introduce the new prefix:
\begin{enumerate}
\item K    : meaning 1024, then use a subscript 2 to indicate the binary
             versions
\item ${\rm M}_2$    : binary mega $1024^2$
\item ${\rm G}_2$    : binary giga $1024^3$
\item ${\rm T}_2$    : binary tera $1024^4$
\end{enumerate}
In most cases the difference between the mega and the binary mega (4\%) 
is probably unimportant, but it is important to be unambiguous. In this
way one can continue with existing practice if the difference doesn't 
matter, and have an agreed method of being more exact when necessary.
For example, the above memory capacity was probably intended to mean
$256 {\rm M_2 B}$.

As a consequence of the above, an amount of computational work involving
$4.5 \times 10^{12}$ floating-point operations is correctly written as 
4.5 Tflop. Note that the unit symbol Tflop is never pluralised with an
added 's', and it is therefore incorrect to write the above as 4.5 Tflops 
which could be confused with a rate per second. The most frequently used 
unit of performance, millions of floating-point operations per second 
is correctly written Mflop/s, in analogy to km/s. The slash is necessary 
and means 'per',  because the 'p' is an integral part of the unit symbol 
'flop' and cannot also be used for this purpose.  


\section{Floating-Point Operation Count}

Although we discourage the use of millions of floating-point
operations per second as a performance metric, it can be a useful 
measure if the number of floating-point operations, $F(N)$, 
needed to solve the benchmark problem is carefully defined.

For simple problems (e.g. matrix multiply) it is sufficient to use a 
theoretical value for the floating-point operation count (in this case $2n^3$ 
flop, for nxn matrices) obtained by inspection of the 
code or consideration of the arithmetic in the algorithm. For more complex
problems containing data-dependent conditional statements, an empirical method
may have to be used.  The sequential version of the benchmark code defines
the problem and the algorithm to be used to solve it. Counters can be inserted 
into this code or a hardware monitor used to count the number of floating-point
operations. The latter is the procedure followed by the {\sc PERFECT} Club 
\cite{Berr89}. In either case a decision has to be made regarding the number
of flop that are to be credited for different types of floating-point 
operations, and we see no good reason to deviate from those chosen by 
McMahon \cite{Ma88} when the Mflop/s measure was originally defined. 
These are:

\begin{table}[h]
\centering
\begin{tabular}{ll}
add, subtract, multiply		& 1 flop \\
divide, square-root		& 4 flop \\
exponential, sine etc.		& 8 flop \\
{\sc IF(X .REL. Y)}		& 1 flop \\
\end{tabular}
\end{table}

We distinguish two types of operation count. The first is the nominal 
benchmark floating-point operation count, $F_B(N)$, which is found in the 
above way from the defining Fortran77 sequential code. The other is the
actual number of floating-point operations performed by the hardware
when executing the distributed multi-node version, $F_H(N)$, which may be 
greater than the nominal benchmark count, due to the distributed version 
performing redundant arithmetic operations.


\section{Performance Metrics}

Given the time of execution $T(N;p)$ and the flop-count $F(N)$ several different
performance measures can be defined. Each metric has its own uses, and gives 
different information about the computer and algorithm used in the benchmark.
It is important therefore to distinguish the metrics with different names,
symbols and units, and to understand clearly the difference between them.
Much confusion and wasted work can arise from optimising a benchmark with
respect to an inappropriate metric. The principal performance metrics are:

\subsection{Temporal Performance}

If we are interested in comparing the
performance of different algorithms for the solution of the same problem, then
the correct performance metric to use is the {\it Temporal Performance},
$R_T$, which is defined as the inverse of the execution time
\begin{equation}
                            R_T(N;p)=T^{-1}(N;p)              \label{Eqn(1)}
\end{equation}
The units of temporal performance are, in general, solutions per second
(sol/s), or some more appropriate absolute unit such as 
timesteps per second (tstep/s). With this metric we can be sure
that the algorithm with the highest performance executes in the least time,
and is therefore the best algorithm. We note that the number of flop does not
appear in this definition, because the objective of algorithm design is not
to perform the most arithmetic per second, but rather it is to solve a given
problem in the least time, regardless of the amount of arithmetic involved.
For this reason the temporal performance is also the metric that a 
computer user should employ to select the best algorithm to solve his problem, 
because his objective is also to solve the problem in the least time, and he 
does not care how much arithmetic is done to achieve this.

\subsection{Simulation Performance}

A special case of temporal performance occurs for simulation programs in which
the benchmark problem is defined as the simulation of a certain period of
physical time, rather than a certain number of timesteps. In this case we speak
of the {\em Simulation Performance} and use units such as {\em simulated days
per day} (written sim-d/d or 'd'/d) in weather forecasting, where the 
apostrophe is used to indicate 'simulated'; or {\em simulated
pico-seconds per second} (written sim-ps/s or 'ps'/s) in electronic device
simulation. It is important to use simulation performance rather than
timestep/s if one is comparing different simulation algorithms which may 
require different sizes of timestep for the same accuracy (for example an
implicit scheme that can use a large timestep, compared with an explicit
scheme that requires a much smaller step). In order to maintain numerical
stability, explicit schemes also require the use of a smaller timestep as 
the spatial grid is made finer. For such schemes the simulation performance 
falls off dramatically as the problem size is increased by introducing 
more mesh points in order to refine the spatial resolution: the doubling 
of the number of mesh-points in each of three dimensions can reduce the 
simulation performance by a factor near 16 because the timestep must also 
be approximately halved. Even though the larger problem will generate more 
Megaflop per second, in forecasting, it is the simulated days per day 
(i.e. the simulation performance) and not the Mflop/s, that matter to the user.

As we see below, benchmark performance is also measured in terms of the amount
of arithmetic performed per second or Mflop/s. However it is important to
realise that it is incorrect to compare the Mflop/s achieved by two algorithms 
and to conclude that the algorithm with the highest Mflop/s rating is the best
algorithm. This is because the two algorithms may be performing quite 
different amounts of arithmetic during the solution of the same problem.
The temporal performance metric, $R_T$, defined above, has been introduced
to overcome this problem, and provide a measure that can be used to compare
different algorithms for solving the same problem. However, it should be 
remembered that the temporal performance only has the same meaning within the 
confines of a fixed problem, and no meaning can be attached to a
comparison of the temporal performance on one problem with the temporal
performance on another.

\subsection{Benchmark Performance}

In order to compare the performance of a computer on one benchmark with 
its performance on another, account must be taken of the different amounts of 
work (measured in flop) that the different problems require for their solution.
Using the flop-count for the benchmark, $F_B(N)$, we can
define the {\em Benchmark Performance} as
\begin{equation}
                            R_B(N;p)=F_B(N)/{T(N;p)}           \label{Eqn(2)}
\end{equation}
The units of benchmark performance are Mflop/s (benchmark name), where we 
include the name of the benchmark in parentheses to emphasise that the 
performance may depend strongly on the problem being solved, and to emphasise 
that the values are based on the nominal benchmark flop-count. In other 
contexts such performance figures would probably be quoted as examples of the 
so-called {\em sustained} performance of a computer. We feel that the use of 
this term is meaningless unless the problem being solved and the degree of 
code optimisation is quoted, because the performance is so varied across 
different benchmarks and different levels of optimisation. Hence we favour 
the quotation of a selection of benchmark performance figures, rather than a 
single sustained performance, because the latter implies that the quoted 
performance is maintained over all problems.

Note also that the flop-count $F_B(N)$ is that for the defining sequential 
version of the benchmark, and that the same count is used to calculate $R_B$ 
for the distributed-memory (DM) version of the program, even though the DM 
version may actually perform
a different number of operations.  It is usual for DM programs to perform more
arithmetic than the defining sequential version, because often numbers are
recomputed on the nodes in order to save communicating their values from a
master processor. However such calculations are redundant (they have already
been performed on the master) and it would be incorrect to credit them to the
flop-count of the distributed program. 

Using the sequential flop-count in the
calculation of the DM programs benchmark performance has the additional 
advantage that it is possible to conclude that, for a given benchmark,
the implementation that has the highest benchmark performance is the best 
because it executes in the least time.  This would not necessarily be the 
case if a different $F_B(N)$ were used for different implementations of the 
benchmark. For example, the use of a better algorithm which obtains the
solution with less than $F_B(N)$ operations will show up as higher benchmark
performance. For this reason it should cause no surprise if the benchmark 
performance occasionally exceeds the maximum possible hardware performance. 
To this extent benchmark performance Mflop/s must be understood
to be nominal values, and not necessarily exactly the number of operations
executed per second by the hardware, which is the subject of the next
metric. The purpose of benchmark performance is to compare different
implementations and algorithms on different computers for the solution of
the same problem, on the basis that the best performance means the least
execution time. For this to be true $F_B(N)$ must be kept the same for
all implementations and algorithms.  


\subsection{Hardware Performance}

If we wish to compare the observed performance with the theoretical 
capabilities of the computer hardware, we must compute the actual number of
floating-point operations performed, $F_H(N;p)$, and from it the actual
{\em Hardware Performance} 
\begin{equation}
                            R_H(N;p)=F_H(N)/{T(N;p)}             \label{Eqn(3)}
\end{equation}
The hardware performance also has the units Mflop/s, and will have the same 
value as the benchmark performance for the sequential version of the benchmark. 
However, the hardware performance may be higher than the benchmark performance 
for the distributed version, because the hardware performance gives credit for 
redundant arithmetic operations, whereas the benchmark performance does not.
Because the hardware performance measures the actual floating-point operations
performed per second, unlike the benchmark performance, it can never exceed
the theoretical peak performance of the computer.

Assuming a computer with multiple-CPUs each with multiple arithmetic pipelines,
delivering a maximum of one flop per clock period, the theoretical peak value
of hardware performance is
\begin{equation}
   r^*= \frac{fl.pt.pipes/CPU}{clock.period}\times number.CPUs   \label{Eqn(4)}
\end{equation}
with units of Mflop/s if the clock period is expressed in microseconds. By 
comparing the measured hardware performance, $R_H(N;p)$, with the theoretical 
peak performance, we can assess the fraction of the available performance that 
is being realised by a particular implementation of the benchmark.


\subsection{Speedup, Efficiency and Performance per Node}

We do not favour the use of any of the popular performance metrics: 
speedup, efficiency or performance per node; because all these are 
either easily misinterpreted or obscure important effects. 
The speedup of a benchmark code 
is defined as the ratio of the $p$-processor temporal performance to the 
single-processor temporal performance. It is a very useful and convenient 
measure if we are concerned with the optimisation of a particular code in 
isolation, because its value can easily be compared with the maximum possible 
speedup, namely the number of processors being used. We can thereby assess
how much of the potential hardware performance is being utilised. However
benchmarking is to do with comparing the performance of different computers,  
and all the above three metrics are unsuitable for this purpose.

Speedup compares the performance of a code with itself, and might therefore 
be called an introspective, or even incestuous measure. Problems can therefore 
arise (see below), and incorrect conclusions can be drawn, if we try to use 
speedup to compare different algorithms on the same computer, or the same 
algorithm on different computers. This is because speedup is a relative 
measure (it is defined as the ratio of two performances), and therefore all 
knowledge of the absolute performance has been lost. 
Benchmarking, however, is concerned with the comparison of the absolute 
performance of computers, and therefore the use of a relative measure like
speedup is not very useful, and can be positively misleading.  
For example, we do not wish to conclude that a computer with a large number of
slow processors and therefore high value of speedup, is faster than another 
with fewer processors and therefore with a lower speedup, if in fact the 
reverse is the case, because the processors on the second computer are so much 
faster. Only by adopting absolute measures of performance with physical units 
involving inverse time, can one avoid this type of false conclusion.

Speedup is not even useful for comparing the performance of one algorithm with 
another on the same computer, because it is not necessarily true that the 
algorithm with the highest speedup executes in the least time (see, e.g. 
~\cite{Cvet90}). One can only be sure that this is the case if the 
single-processor temporal performance of both algorithms is the same, which 
is most unlikely. If the single-processor performances of the two algorithms 
are different and we compare the speedups of the two algorithms, then we are 
comparing the performance of the two algorithms measured in different units.
This is like comparing the speeds of two cars, one measured in m.p.h. and the
other in cm/s. Such a comparison has no validity either for cars or 
algorithms. Computers and algorithms can only safely be compared in terms
of their absolute performance in solving a problem. The most unambiguous
measure is the temporal performance, which is the inverse of 
the time of execution, or the related simulation performance. 

The benchmark performance per node might seem to be an attractive 
metric because it is an absolute measure which
can be related directly to the hardware performance of node. However
it has the major defect that it hides the point at which the absolute 
performance begins to decrease as the number of processors increases. If we
plot benchmark performance against number of processors, this point is
clearly visible as a maximum, however if the same data is plotted as
performance per node, all we see is a very uninteresting monotonically 
falling line, and the important maximum has disappeared. The efficiency, 
which is defined as the speedup divided by the number of processors, is 
doubly condemned because it is a relative measure and hides the maximum.

\section{Performance Database}

The database of benchmark performance results should be based on an
extension of the excellent X-window display demonstrated by Jack Dongarra
at the March 1993 PBWG meeting. The basic data held would be values of
$T(N;p)$ for at least 4 values of problem size $N$, each for sufficient
$p$-values (say 5 to 10) to determine the trend of variation of performance
with number of processors for constant problem size. It is important that
there be enough $p$-values to see Amdahl saturation, if present, or any 
peak in performance followed by degradation. A graphical interface is
really essential to allow this multidimensional data to be viewed in any
of the metrics defined above, as chosen interactively by the user.
The user could also be offered (by suitable interpolation) a display of 
the results in various scaled metrics, in which the problem size is 
expanded with the number of processors.

In order to encompass as wide a range of performance and number of 
processors as possible, a log-scale on both axes is unavoidable, and
the format and scale range should be kept fixed as long as possible
to enable easy comparison between graphs. A three-cycle by three-cycle
log-log graph with range 1 to 1000 in both $p$ and Mflop/s would cover
most needs in the immediate future. Examples of such graphs are to be
found in \cite{Hoc92,Add93}.

% ----------------------------------------------------------------------------

From owner-pbwg-method@CS.UTK.EDU Tue May 18 09:41:04 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA23468; Tue, 18 May 93 09:41:04 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA07846; Tue, 18 May 93 09:41:43 -0400
X-Resent-To: pbwg-method@CS.UTK.EDU ; Tue, 18 May 1993 09:41:41 EDT
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from ben.uknet.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA07829; Tue, 18 May 93 09:41:36 -0400
Message-Id: <9305181341.AA07829@CS.UTK.EDU>
Received: from eros.uknet.ac.uk by ben.uknet.ac.uk via UKIP with SMTP (PP) 
          id <sg.12676-0@ben.uknet.ac.uk>; Tue, 18 May 1993 14:41:31 +0100
Received: from newton.npl.co.uk by eros.uknet.ac.uk via PSS with NIFTP (PP) 
          id <6811-0@eros.uknet.ac.uk>; Tue, 18 May 1993 14:41:22 +0100
Date: Tue, 18 May 93 14:41 GMT
From: Trevor Chambers <THEC@newton.npl.co.uk>
To: PBWG-METHOD <PBWG-METHOD@EDU.UTK.CS>


COMMITTEE: methodology

TOPIC: definition of popular metrics

CONTENT: They are here to stay. Should we attempt to strengthen the 
definitions?

ACTUAL MESSAGE: Can PBWG realistically dispense with commonly used measures 
such as speed-up and performance per node?

There is the danger that PBWG could isolate itself from the main stream of 
computer users, or that others will use the PBWG benchmarks and report the 
popular metrics anyway.

One possibility is to acknowledge these metrics and to attempt to strengthen 
the definitions.

Relevant to this matter is the degree to which the suggested PBWG metrics are 
used and understood by potential users and people involved in procurement.
What information do we have on this matter? 

I propose that the committee takes a considered view on the use of popular 
metrics.


Trevor Chambers

pp Ed Brocklehurst
From owner-pbwg-method@CS.UTK.EDU Fri May 21 07:17:47 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA03541; Fri, 21 May 93 07:17:47 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA11317; Fri, 21 May 93 07:17:43 -0400
X-Resent-To: pbwg-method@CS.UTK.EDU ; Fri, 21 May 1993 07:17:42 EDT
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from ben.uknet.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA11309; Fri, 21 May 93 07:17:37 -0400
Message-Id: <9305211117.AA11309@CS.UTK.EDU>
Received: from newton.npl.co.uk by ben.uknet.ac.uk via PSS with NIFTP (PP) 
          id <sg.28497-0@ben.uknet.ac.uk>; Fri, 21 May 1993 12:17:09 +0100
Date: Fri, 21 May 93 12:17 GMT
From: CBF@newton.npl.co.uk
To: PBWG-METHOD <PBWG-METHOD@EDU.UTK.CS>

To the Methodology Committee.

I'm afraid I'm a little confused at the intent of 2.4.3
the Benchmark Performance section as stated. If we are comparing the same
benchmark on different machines do we need Fb(N) in the equations?

For instance R1b(N;p) > R2b(N;p) where R1b is the rate of machine 1
and R2b is the rate of machine 2.

This implies Fb(N)/T1(N;p) > Fb(N)/T(N;p) as Fb(N) does not depend on 
the machine but only on the defining serial code.

This implies 1/T1(N;p) > 1/T(N;p) (or T(N;p) > T1(N;p) ).

In other words we get a straight timing result anyway.

If we are not comparing the same benchmark is the flop a suitable 
representation of the work that a benchmark does. Is it not possible that 
memory accessing and I/O usage could outstrip floating point operations in 
terms of importance?

From owner-pbwg-method@CS.UTK.EDU Fri May 21 13:12:51 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA06602; Fri, 21 May 93 13:12:51 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA29733; Fri, 21 May 93 12:05:30 -0400
X-Resent-To: pbwg-method@CS.UTK.EDU ; Fri, 21 May 1993 12:05:25 EDT
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from sun2.nsfnet-relay.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA29698; Fri, 21 May 93 12:05:18 -0400
Via: uk.ac.southampton.ecs; Fri, 21 May 1993 16:42:15 +0100
From: R.Hockney@parallel-applications-centre.southampton.ac.uk
Via: calvados.pac.soton.ac.uk (plonk); Fri, 21 May 93 16:34:34 BST
Date: Fri, 21 May 93 15:41:53 GMT
Message-Id: <3128.9305211541@calvados.pac.soton.ac.uk>
To: pbwg-method@cs.utk.edu
Subject: Speedup and R_B Mflop/s

May I respond to some comments on my submission on metrics:
(1) To Trevor Chambers - Speedup and Performance per Node
I think you may have talked yourself into a job! Although I personally
do not think that Speedup is a useful metric, I acknowledge its widespread
use, and agree that the best policy is to try to tighten up the definitions.
Please put your suggestions to the 24th May meeting. My problem with the
use of Speedup is that I don't think that it is possible to produce a usefull
definition, because all the knowledge of absolute performance has been
thrown away by taking the ratio to some (what) one-processor time. Its
much more rational to keep the knowledge of absolute performance by
using e.g. Temporal or Benchmark performance. How can you compare 
different computers if you don't use absolute measures, which involve
seconds^-1 in their units? Lets discuss this on Monday.
(2) To CBF at NPL - Do we need F_B(N)
The answer is NO we don't, in which case the only metric to use is the
Temporal Performance, which is the same as comparing executions times,
but the result is expressed more naturally as a performance. The definition
of Benchmark performance was made inorder to keep the units of Mflop/s
which people widely use but ensure that the highest R_B implies the least
execution time.  This can only be done if F_B(N) is kept the same for
all implementations and treated as a nominal figure.  If you want real
Mflop/s you use R_H the hardware performance.  Of course F_B is only
appropriate for problems dominated by floating-point arithmetic, for
other problems the Benchmark writyer would be expected to define an
appropriate measure of work (I/O references e.g.) or else simply fall
back on Temporal performance and time alone. The other reason for R_B
is to allow some (albeit approximate) comparison of all arithmetic
benchmarks in the same units (i.e.Mflop/s). Temporal Performances
cannot be compared across benchmarks.
           All Good points for disussion on Monday, preferable as
suggested editting to my draft submission, with your agreement of course
David. Are you there
                        Roger Hockney
From owner-pbwg-method@CS.UTK.EDU Mon May 24 03:26:04 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA19684; Mon, 24 May 93 03:26:04 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA13266; Mon, 24 May 93 03:26:34 -0400
X-Resent-To: pbwg-method@CS.UTK.EDU ; Mon, 24 May 1993 03:26:33 EDT
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from sun2.nsfnet-relay.ac.uk by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA13258; Mon, 24 May 93 03:26:28 -0400
Via: uk.ac.southampton.ecs; Mon, 24 May 1993 08:25:54 +0100
Via: brewery.ecs.soton.ac.uk; Mon, 24 May 93 08:18:26 BST
From: Vladimir Getov <vsg@ecs.soton.ac.uk>
Received: from beluga.ecs.soton.ac.uk by brewery.ecs.soton.ac.uk;
          Mon, 24 May 93 08:27:00 BST
Date: Mon, 24 May 93 08:26:59 BST
Message-Id: <25360.9305240726@beluga.ecs.soton.ac.uk>
To: pbwg-method@cs.utk.edu
Subject: a note on metrics


Here follows my rough note on metrics, which you may like to consider
in today's meeting. I wish you fruitful and successful discussions.

Best regards,

Vladimir Getov

		    A Few Notes on Metrics
		    ----------------------


1. The elapsed time is the only prime metric.

1.1. It must be the wall-clock time, otherwise there will be some hidden
delays and the interpretation of results may not be correct.

1.2. Do we need to know the components of the wall-clock time? The
computer user will probably answer: - `No, I am only interested in the
total elapsed time to complete my task.' However, if one would like to
improve the application performance as much as possible, the knowledge
of all components of the wall-clock time is necessary. This information
is essential for computer architects, but is also very useful for
computer users. As usual, these components include the calculation time,
the communication time, the memory-access time, the I/O time, the system
time, the wait time, etc. 

2. In order to define the secondary (derivative) metrics I would suggest
answering the question: -`What is our goal? Why we need benchmark
measurements?'  There is a few answers to this question and I believe 
every answer defines its appropriate derivative (secondary) performance 
metrics. My current `goals' list includes:

     - Comparative benchmarking analysis of different parallel machines;

     - Analysis and comparison of different algorithms, methods and
     program implementations for a given application;

     - Benchmarking analysis and identification of bottlenecks for
     particular parallel architecture (shell we include the compiler
     benchmarks, etc. here?).

In my opinion the most useful derivative benchmarking metrics are the 
benchmark performance, the speed-up and the efficiency. Concerning
the speed-up and the efficiency I would think about different
definitions. For example, it seems very useful to calculate the speed-up
as the ratio of the elapsed time of two different algorithms on the same
computer. This defines straight-away which one is faster for given
problem size and number of processors. The efficiency could be defined
as the ratio of the benchmark performance to the peak performance. This
gives a clear picture of how far the measurement results are from the
theoretical peak. In fact the list of derivative metrics may be quite
long and it would be difficult to decide which are the important and
useful metrics to be included in the PBWG documents.


vsg@ecs.soton.ac.uk                  Dr Vladimir Getov 
Tel: +44 703 593368                  Dept of Electronics and Computer Science
Fax: +44 703 593045                  University of Southampton 
                                     Southampton SO9 5NH, U.K.

From owner-pbwg-method@CS.UTK.EDU Wed Sep  1 22:04:37 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA12004; Wed, 1 Sep 93 22:04:37 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA25195; Wed, 1 Sep 93 22:03:06 -0400
X-Resent-To: pbwg-method@CS.UTK.EDU ; Wed, 1 Sep 1993 22:03:04 EDT
Errors-To: owner-pbwg-method@CS.UTK.EDU
Received: from BERRY.CS.UTK.EDU by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA25189; Wed, 1 Sep 93 22:03:03 -0400
Received: from LOCALHOST.cs.utk.edu by berry.cs.utk.edu with SMTP (5.61++/2.7c-UTK)
	id AA08100; Wed, 1 Sep 93 22:03:01 -0400
Message-Id: <9309020203.AA08100@berry.cs.utk.edu>
To: pbwg-method@cs.utk.edu
Subject: method5.tex
Date: Wed, 01 Sep 1993 22:03:01 -0400
From: "Michael W. Berry" <berry@cs.utk.edu>

This is a revision of the methodology chapter of the PARKBENCH
report submitted by David Bailey.


Regards,

Mike
------- Forwarded Message

Return-Path: <dbailey@nas.nasa.gov>
Received: from wk49.nas.nasa.gov by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA19544; Wed, 1 Sep 93 20:24:25 -0400
Received: by wk49.nas.nasa.gov (5.67-NAS.4/5.67-NAS-1.1(SGI))
	id AA02573; Wed, 1 Sep 93 17:24:22 -0700
Date: Wed, 1 Sep 93 17:24:22 -0700
From: dbailey@nas.nasa.gov (David H. Bailey)
Message-Id: <9309020024.AA02573@wk49.nas.nasa.gov>
To: berry@cs.utk.edu
Subject: Revised Methodology section.  Please distribute to method group

%file method5.tex

% Compiled by David Bailey for methodology subcommittee.
% Text below submitted by Roger Hockney to methodology subcommittee.
% This text includes revision by Bailey on 01 Sep 1993.

\chapter{Methodology}

\section{Philosophy}

One might ask why anyone should care about developing a standardized,
rigorous and scientifically tenable methodology for studying the
performance of high-performance computer systems.  There are several
reasons why this is an important undertaking:

\begin{enumerate}

\item To establish and maintain high standards of honesty and
integrity in our profession.

\item To improve the status of supercomputer performance analysis as
a rigorous scientific discipline.

\item To reduce confusion in the high-performance computing literature.

\item To increase understanding of these systems, both at a low-level
hardware or software level and at a high-level, total system
performance level.

\item To assist the purchasers of high-performance computing equipment
in selecting systems best suited to their needs.

\item To reduce the amount of time and resources vendors must expend in
implementing multiple, redundant benchmarks.

\item To provide valuable feedback to vendors on bottlenecks that can
be alleviated in future products.

\end{enumerate}

It is important to note that researchers in many scientific
disciplines have found it necessary to establish and refine standards
for performing experiments and reporting the results.  Many scientists
have learned the importance of standard terminology and notation.
Chemists, physicists and biologists long ago discovered the importance
of ``controls'' in their experiments.  The issue of repeatability
proved crucial in the recent ``cold fusion'' episode.  Medical
researchers have found it necessary to perform ``double-blind''
experiments in their field.  Psychologists and sociologists have
developed highly refined experimental methodologies and advanced data
analysis techniques.  Political scientists have found that subtle
differences in the phrasing of a question can affect the results of a
poll.  Researchers in many fields have found that environmental
factors in their experiments can significantly influence the measured
results; thus they must carefully report all such factors in their
papers.

If supercomputer performance analysis and benchmarking is ever to be
taken seriously as a scientific discipline, certainly its
practitioners should be expected to adhere to standards that prevail
in other disciplines.  This document is dedicated to promoting these
standards in our field.

\section{Fundamental Metrics}

The conclusions drawn from a benchmark study of computer performance
depend not only on the basic timing results obtained, but also on
the way these are interpreted and converted into performance figures.
The choice of the performance metric, may itself influence the 
conclusions. For example, do we want the computer that generates the 
most megaflop per second (or has the highest Speedup), or the computer 
that solves the problem in the least time? It is now well known
that high values of the first metrics do not necessarily imply the 
second property. This confusion can be avoided by choosing a more 
suitable metric that reflects solution time directly, for example 
either the Temporal, Simulation or Benchmark performance, defined below. 
This issue of the sensible choice of performance metric is becoming
increasing important with the advent of massively parallel computers
which have the potential of very high megaflop rates, but have 
much more limited potential for reducing solution time. 

\section{Time Measurement}

Before other issues can be considered, we must discuss the measurement
of run time.  In recent years a consensus has been reached among many
scientists in the field that the most relevant measure of run time is
actual wall-clock elapsed time.  This measure of time will be required
for all ParkBench results that are posted to the database.

Elapsed wall-clock time means the time that would be measured on an
external clock that records the time-of-day or even Greenwich mean
time (GMT), between the start and finish of the benchmark.  We are not
concerned with the origin of the time measurement, since we are taking
a difference, but it is important that the time measured would be the
same as that given by a difference between two measurements of GMT, if
it were possible to make them.  It is important to be clear about
this, because many computer clocks (e.g. Sun Unix function ETIME)
measure elapsed CPU time, which is the total time that the process or
job which calls it has been executing in the CPU.  Such a clock does
not record time (i.e. it stops ticking) when the job is swapped out of
the CPU.  It does not record, therefore, any wait time which must be
included if we are to assess correctly the performance of a parallel
program.  On some systems, scientists have found that even for
programs that perform no explicit I/O, considerable ``system'' time is
nonetheless involved, for example in fetching certain library routines
or other data.

Only timings actually measured may be cited for ParkBench benchmarks
(and we strongly recommend this practice for other benchmarks as
well).  Extrapolations and projections, for instance to a larger
number of nodes, may not be employed for any reason.  Also, in the
interests of repeatability it is highly recommended that timing runs
be repeated, several times if possible.

Two low-level benchmarks are provided in the ParkBench~ suite to test
the precision and accuracy of the clock that is to be used in the
benchmarking.  These should be run first, before any benchmark
measurements are made.  They are:
\begin{enumerate}

\item TICK1 - measures the precision of the clock by measuring the time 
              interval between ticks of the clock. A clock is said to
              tick when it changes its value.

\item TICK2 - measures the accuracy of the clock by comparing a given
              time interval measured by an external wall-clock (the
              benchmarker's wrist watch is adequate) with the same
              interval measured by the computer clock. This tests the
              scale factor used to convert computer clock ticks to seconds, 
              and immediately detects if a CPU-clock is incorrectly being 
              used.
\end{enumerate} 

The fundamental measurement made in any benchmark is the elapsed
wall-clock time to complete some specified task. All other performance
figures are derived from this basic timing measurement. The benchmark
time, $T(N;p)$, will be a function of the problem size, $N$, and the
number of processors, $p$. Here, the problem size is represented by
the vector variable, $N$, which stands for a set of parameters
characterising the size of the problem: e.g. the number of mesh points
in each dimension, and the number of particles in a particle-mesh
simulation.  Benchmark problems of different sizes can be created by
multiplying all the size parameters by suitable powers of a single
scale factor, thereby increasing the spatial and particle resolution
in a sensible way, and reducing the size parameters to a single size
factor (usually called $\alpha$).

We believe that it is most important to regard execution time and
performance as a function of at least the two variables $(N,p)$, which
define a parameter plane. Much confusion has arisen in the past by
attempts to treat performance as a function of a single variable, by
taking a particular path through this plane, and not stating what path
is taken.  Many different paths may be taken, and hence many different
conclusions can be drawn.  It is important, therefore, always to
define the path through the performance plane, or better as we do
here, to study the shape of the two-dimensional performance hill.  In
some cases there may even be an optimum path up this hill.

\section{Units and Symbols}

A rational set of units and symbols is essential for any numerate
science including benchmarking. The following extension of the
internationally agreed SI system of physical units \cite{SI75} is 
made to accommodate the needs of computer benchmarking.

The value of a variable comprises a pure number stating the number of
units which equal the value of the variable, followed by a unit
symbol specifying the unit in which the variable is being measured.
A new unit is required whenever a quantity of a new nature arises,
such as e.g. the first appearance of vector operations, or message 
sends. Generally speaking a unit symbol should be as short as
possible, consistent with being easily recognised and not already
used. The following have been found necessary in the characterisation
of computer and benchmark performance in science and engineering. 
No doubt more will have to be defined as benchmarking enters new areas.

\medskip
New unit symbols and their meaning: 
\begin{enumerate}
\item \flop : floating-point operation [latex \verb1\1flop]
\item \inst : instruction of any kind [latex \verb1\1inst]
\item \inop : integer operation [latex \verb1\1inop]
\item \vecop: vector operation [latex \verb1\1vecop]
\item \send: message send operation [latex \verb1\1send]
\item \iter : iteration of loop [latex \verb1\1iter]
\item \mref : memory reference (read or write) [latex \verb1\1mref]
\item \barr : barrier operation [latex \verb1\1barr]
\item \bit  : binary digit (bit) [latex \verb1\1bit]
\item \B    : byte (groups of 8 bits) [latex \verb1\1B]
\item \sol  : solution or single execution of benchmark 
              [latex \verb1\1sol]
\item \w    : computer word. Symbol is lower case (W means watt) 
              [latex \verb1\1w]
\end{enumerate}
When required a subscript may be used to show the number of bits
involved in the unit. For example: a 32-bit floating-point operation
${\flop}_{32}$, a 64-bit word ${\w}_{64}$, also we have
${\bit}={\w}_1$, ${\B}={\w}_8$, ${\w}_{64}= 8 {\B}$.

Note that flop, mref and other multi-letter symbols are inseparable
four or five-letter symbols. The character case is significant in all
unit symbols so that e.g. Flop, Mref, $W_{64}$ are incorrect. Unit
symbols should always be printed in roman type, to contrast with
variables names which are printed in italic. To aid in the use of
roman type, especially within Latex maths mode, Latex commands have
been defined for each unit, these commands being a backslash followed
by the unit symbol (except for '\inop' and '\bit' whose names are
changed in the command to avoid a clash with already defined system
commands). Such commands will print in roman type wherever they
occur. Because 's' is the SI unit for seconds, unit symbols like
'sheep' do not take 's' in the plural. Thus one counts: one flop, two
flop, ..., one hundred flop etc. This is especially important when the
unit symbol is used in ordinary text as a useful abbreviation, as
often, quite sensibly, it is.

\medskip
SI provides the standard prefixes:
\begin{enumerate}
\item k    : kilo meaning $10^3$
\item M    : mega meaning $10^6$
\item G    : giga meaning $10^9$
\item T    : tera meaning $10^{12}$
\end{enumerate}
This means that we cannot use M to mean $1024^2$ (the binary mega) as is 
often done in describing computer memory capacity, e.g. 256 MB. We can 
however introduce the new prefix:
\begin{enumerate}
\item K    : meaning 1024, then use a subscript 2 to indicate the binary
             versions
\item $\mbox{\rm M}_2$    : binary mega $1024^2$
\item $\mbox{\rm G}_2$    : binary giga $1024^3$
\item $\mbox{\rm T}_2$    : binary tera $1024^4$
\end{enumerate}
In most cases the difference between the mega and the binary mega (4\%) 
is probably unimportant, but it is important to be unambiguous. In this
way one can continue with existing practice if the difference doesn't 
matter, and have an agreed method of being more exact when necessary.
For example, the above memory capacity was probably intended to mean
$256 \mbox{\rm M}_2 \B$.

As a consequence of the above, an amount of computational work involving
$4.5 \times 10^{12}$ floating-point operations is correctly written as 
4.5 Tflop. Note that the unit symbol Tflop is never pluralised with an
added 's', and it is therefore incorrect to write the above as 4.5 Tflops 
which could be confused with a rate per second. The most frequently used 
unit of performance, millions of floating-point operations per second 
is correctly written Mflop/s, in analogy to km/s. The slash is necessary 
and means 'per',  because the 'p' is an integral part of the unit symbol 
'flop' and cannot also be used to mean 'per'.  

\section{Floating-Point Operation Count}

Although we discourage the use of millions of floating-point
operations per second as a performance metric, it can be a useful
measure if the number of floating-point operations, $F(N)$, needed to
solve the benchmark problem is carefully defined.

For simple problems (e.g. matrix multiply) it is sufficient to use a
theoretical value for the floating-point operation count (in this case
$2n^3$ flop, for nxn matrices) obtained by inspection of the code or
consideration of the arithmetic in the algorithm. For more complex
problems containing data-dependent conditional statements, an
empirical method may have to be used.  The sequential version of the
benchmark code defines the problem and the algorithm to be used to
solve it. Counters can be inserted into this code or a hardware
monitor used to count the number of floating-point operations. The
latter is the procedure followed by the {\sc PERFECT} Club
\cite{Berr89}. In either case a decision has to be made regarding the
number of flop that are to be credited for different types of
floating-point operations, and we see no good reason to deviate from
those chosen by McMahon \cite{Ma88} when the Mflop/s measure was
originally defined.  These are:

\begin{table}[h]
\centering
\begin{tabular}{ll}
add, subtract, multiply		& 1 flop \\
divide, square-root		& 4 flop \\
exponential, sine etc.		& 8 flop (this figure will be adjusted) \\   
{\sc IF(X .REL. Y)}		& 1 flop \\
\end{tabular}
\end{table}

Some members of the committee felt that these numbers, derived in the
1970s, no longer correctly reflected the situation on current
computers.  However, since these numbers are only used to calculate a
nominal benchmark flop-count, it is not so important that they be
accurate. The important thing is that they do not change, otherwise
all previous flop-counts would have to be renormalised.  In any case,
it is not possible for a single set of ratios to be valid for all
computers and library software. I (rwh) suggest the committee stays
with the above ratios until such time as they become wildly wrong and
extensive research provides us with a more realistic set.

We distinguish two types of operation count. The first is the nominal
benchmark floating-point operation count, $F_B(N)$, which is found in
the above way from the defining Fortran77 sequential code. The other
is the actual number of floating-point operations performed by the
hardware when executing the distributed multi-node version,
$F_H(N,p)$, which may be greater than the nominal benchmark count, due
to the distributed version performing redundant arithmetic
operations. Because of this, the hardware flop count may also depend
on the number of processors on which the benchmark is run, as shown in
its argument list.

\section{Performance Metrics}

Given the time of execution $T(N;p)$ and the flop-count $F(N)$ several
different performance measures can be defined.  Each metric has its
own uses, and gives different information about the computer and
algorithm used in the benchmark.  It is important therefore to
distinguish the metrics with different names, symbols and units, and
to understand clearly the difference between them.  Much confusion and
wasted work can arise from optimising a benchmark with respect to an
inappropriate metric. The principal performance metrics are:

\subsection{Temporal Performance}

If we are interested in comparing the performance of different
algorithms for the solution of the same problem, then the correct
performance metric to use is the {\it Temporal Performance}, $R_T$,
which is defined as the inverse of the execution time
\begin{equation}
                            R_T(N;p)=T^{-1}(N;p)              \label{Eqn(1)}
\end{equation}
The units of temporal performance are, in general, solutions per
second (sol/s), or some more appropriate absolute unit such as
timesteps per second (tstep/s). With this metric we can be sure that
the algorithm with the highest performance executes in the least time,
and is therefore the best algorithm. We note that the number of flop
does not appear in this definition, because the objective of algorithm
design is not to perform the most arithmetic per second, but rather it
is to solve a given problem in the least time, regardless of the
amount of arithmetic involved.  For this reason the temporal
performance is also the metric that a computer user should employ to
select the best algorithm to solve his problem, because his objective
is also to solve the problem in the least time, and he does not care
how much arithmetic is done to achieve this.

\subsection{Simulation Performance}

A special case of temporal performance occurs for simulation programs
in which the benchmark problem is defined as the simulation of a
certain period of physical time, rather than a certain number of
timesteps. In this case we speak of the {\em Simulation Performance}
and use units such as {\em simulated days per day} (written sim-d/d or
'd'/d) in weather forecasting, where the apostrophe is used to
indicate 'simulated'; or {\em simulated pico-seconds per second}
(written sim-ps/s or 'ps'/s) in electronic device simulation. It is
important to use simulation performance rather than timestep/s if one
is comparing different simulation algorithms which may require
different sizes of timestep for the same accuracy (for example an
implicit scheme that can use a large timestep, compared with an
explicit scheme that requires a much smaller step). In order to
maintain numerical stability, explicit schemes also require the use of
a smaller timestep as the spatial grid is made finer. For such schemes
the simulation performance falls off dramatically as the problem size
is increased by introducing more mesh points in order to refine the
spatial resolution: the doubling of the number of mesh-points in each
of three dimensions can reduce the simulation performance by a factor
near 16 because the timestep must also be approximately halved. Even
though the larger problem will generate more Megaflop per second, in
forecasting, it is the simulated days per day (i.e. the simulation
performance) and not the Mflop/s, that matter to the user.

As we see below, benchmark performance is also measured in terms of the amount
of arithmetic performed per second or Mflop/s. However it is important to
realise that it is incorrect to compare the Mflop/s achieved by two algorithms 
and to conclude that the algorithm with the highest Mflop/s rating is the best
algorithm. This is because the two algorithms may be performing quite 
different amounts of arithmetic during the solution of the same problem.
The temporal performance metric, $R_T$, defined above, has been introduced
to overcome this problem, and provide a measure that can be used to compare
different algorithms for solving the same problem. However, it should be 
remembered that the temporal performance only has the same meaning within the 
confines of a fixed problem, and no meaning can be attached to a
comparison of the temporal performance on one problem with the temporal
performance on another.

\subsection{Benchmark Performance}

In order to compare the performance of a computer on one benchmark with 
its performance on another, account must be taken of the different amounts of 
work (measured in flop) that the different problems require for their solution.
Using the flop-count for the benchmark, $F_B(N)$, we can
define the {\em Benchmark Performance} as
\begin{equation}
                            R_B(N;p)=F_B(N)/{T(N;p)}           \label{Eqn(2)}
\end{equation}
The units of benchmark performance are Mflop/s (benchmark name), where we 
include the name of the benchmark in parentheses to emphasise that the 
performance may depend strongly on the problem being solved, and to emphasise 
that the values are based on the nominal benchmark flop-count. In other 
contexts such performance figures would probably be quoted as examples of the 
so-called {\em sustained} performance of a computer. We feel that the use of 
this term is meaningless unless the problem being solved and the degree of 
code optimisation is quoted, because the performance is so varied across 
different benchmarks and different levels of optimisation. Hence we favour 
the quotation of a selection of benchmark performance figures, rather than a 
single sustained performance, because the latter implies that the quoted 
performance is maintained over all problems.

Note also that the flop-count $F_B(N)$ is that for the defining sequential 
version of the benchmark, and that the same count is used to calculate $R_B$ 
for the distributed-memory (DM) version of the program, even though the DM 
version may actually perform
a different number of operations.  It is usual for DM programs to perform more
arithmetic than the defining sequential version, because often numbers are
recomputed on the nodes in order to save communicating their values from a
master processor. However such calculations are redundant (they have already
been performed on the master) and it would be incorrect to credit them to the
flop-count of the distributed program. 

Using the sequential flop-count in the
calculation of the DM programs benchmark performance has the additional 
advantage that it is possible to conclude that, for a given benchmark,
the implementation that has the highest benchmark performance is the best 
because it executes in the least time.  This would not necessarily be the 
case if a different $F_B(N)$ were used for different implementations of the 
benchmark. For example, the use of a better algorithm which obtains the
solution with less than $F_B(N)$ operations will show up as higher benchmark
performance. For this reason it should cause no surprise if the benchmark 
performance occasionally exceeds the maximum possible hardware performance. 
To this extent benchmark performance Mflop/s must be understood
to be nominal values, and not necessarily exactly the number of operations
executed per second by the hardware, which is the subject of the next
metric. The purpose of benchmark performance is to compare different
implementations and algorithms on different computers for the solution of
the same problem, on the basis that the best performance means the least
execution time. For this to be true $F_B(N)$ must be kept the same for
all implementations and algorithms.  


\subsection{Hardware Performance}

If we wish to compare the observed performance with the theoretical 
capabilities of the computer hardware, we must compute the actual number of
floating-point operations performed, $F_H(N;p)$, and from it the actual
{\em Hardware Performance} 
\begin{equation}
                            R_H(N;p)=F_H(N,p)/{T(N;p)}             \label{Eqn(3)}
\end{equation}
The hardware performance also has the units Mflop/s, and will have the same 
value as the benchmark performance for the sequential version of the benchmark. 
However, the hardware performance may be higher than the benchmark performance 
for the distributed version, because the hardware performance gives credit for 
redundant arithmetic operations, whereas the benchmark performance does not.
Because the hardware performance measures the actual floating-point operations
performed per second, unlike the benchmark performance, it can never exceed
the theoretical peak performance of the computer.

Assuming a computer with multiple-CPUs each with multiple arithmetic pipelines,
delivering a maximum of one flop per clock period, the theoretical peak value
of hardware performance is
\begin{equation}
   r^*= \frac{fl.pt.pipes/CPU}{clock.period}\times number.CPUs   \label{Eqn(4)}
\end{equation}
with units of Mflop/s if the clock period is expressed in microseconds. By 
comparing the measured hardware performance, $R_H(N;p)$, with the theoretical 
peak performance, we can assess the fraction of the available performance that 
is being realised by a particular implementation of the benchmark.


\subsection{Speedup, Efficiency and Performance per Node}

``Parallel speedup'' is a popular metric that has been used for many
years in the study of parallel computer performance.  However, its
definition is open to ambiguity and misuse because it always begs the
question ``speedup over what?''  

Speedup is usually defined as
\begin{equation} 
       \frac{T_1}{T_p}
\end{equation}

\noindent 
where $T_p$ is the $p$-processor time to perform some benchmark, and
$T_1$ is the one-processor time.  There is no doubt about the meaning
of $T_p$ --- this is the measured time $T(N;p)$ to perform the
benchmark.  There is often considerable dispute over the meaning of
$T_1$: should it be the time for the parallel code running on one
processor, which probably contains unnecessary parallel overhead, or
should it be the best serial code (possibly using a different
algorithm) running on one processor?  Many scientists feel the latter
is a more responsible choice, but this requires research to determine
the best practical serial algorithm for the given application.  If at
a later time a better algorithm is found, current speedup figures
might be considered obsolete.  An additional difficulty with this
definition is that even if a meaning for $T_1$ is agreed to, there may
be insufficient memory on a single node to store an entire large
problem.  Thus in many cases it may be impossible to measure $T_1$
using this definition.

One principal objective in the field of performance analysis is
benchmarking: to compare the performance of different computers.  It
is generally agreed that the best performance corresponds to the least
wall-clock execution time.  In order to adapt the speedup statistic
for benchmarking, it is thus necessary to define a single reference
value of $T_1$ to be used for all calculations.  It does not matter
how $T_1$ is defined, or what its value is, only that the same value
of $T_1$ is used to calculate all speedup values used in the
comparison.

However, defining $T_1$ as a reference time unrelated to the parallel
computer being benchmarked unfortunately has the consequence that many
properties that many people regard as essential to the concept of
parallel speedup are lost:

\begin{enumerate}

\item It is no longer necessarily true that the speedup of the
parallel code on one processor is unity.  It may be, but only
by chance.

\item It is no longer true that the maximum speedup using
$p$-processors is $p$.

\item Because of the last item, efficiency figures computed as speedup
divided by $p$ are no longer a meaningful measure of processor
utilization.

\end{enumerate}

There are other difficulties with this formulation of speedup.  If we
use $T_1$ as the run time on a very fast single processor (currently,
say, a Cray C90 or a NEC SX-3), then manufacturers of highly parallel
systems will be reluctant to quote the speedup of their system in the
above way.  For example, if the speedup of a 100 processor parallel
system over a single node of the same system is a respectable factor
of 80, it is likely that the speedup computed from the ``standard''
$T_1$ would be reduced to 10 or less.  This is because a fast vector
processor is typically at least ten times faster than the RISC
processors used in many highly parallel systems of a comparable
generation.

Thus it appears that if one sharpens the definition of speedup to make
it an acceptable metric for comparing the performance of different
computers, one has to throw away the main properties that have made
the concept of speedup useful in the past.  

Accordingly, the ParkBench committee has decided the following [DHB's
suggestion]:

\begin{enumerate}

\item No speedup statistic will be kept in the ParkBench database.

\item Speedup statistics based on ParkBench benchmarks must never be
used as figures of merit when comparing the performance of different
systems.  We further recommend that speedup figures based on other
benchmarks not be used as figures of merit in such comparisons.

\item Speedup statistics may be used in a study of the performance
characteristics of an individual parallel system.  But the basis for
the determination of $T_1$ must be clearly and explicitly stated.

\item The value of $T_1$ should be based on an efficient uniprocessor
implementation.  Code for message passing, synchronization, etc.
should not be present.  The author should also make a reasonable
effort to insure that the algorithm used in the uniprocessor
implementation is the best practical serial algorithm for this
purpose.

\item Given that a large problem frequently does not fit on a single
node, it is permissible to cite speedup statistics based on the timing
of a smaller number of nodes.  In other words, it is permissible to
compute speedup as $T_p / T_m$, for some $m, \; 1 < m < p$.  If this
is done, however, this usage must be clearly stated, and full details
of the basis of this calculation must be presented.  As above, care
must be taken to insure that the unit timing $T_m$ is based on an
efficient implementation of appropriate algorithms.

\end{enumerate}

\section{Performance Database}

\begin{verbatim}
It was agreed that this subsection be redrafted by Jack Dongarra.

- --------------------------  START OLD TEXT  ---------------------------------
\end{verbatim}

\input{pds.tex}

\begin{verbatim}
- ---------------------------  END OLD TEXT  ---------------------------------

- ----------------------  SOME PROPOSED NEW TEXT  ----------------------------
\end{verbatim}

At present each benchmark measurement for a particular problem size $N$ and
processor number $p$, is represented by one line in the database with
variable length fields chosen by the benchmark writer as suitable and 
comprehensive to describe the conditions of the benchmark run. The fields
separated by a marker include, benchmarkers name and e-mail, computer 
location and date, hardware specification, compiler date and optimisation 
level, $N$, $p$, $T(N,p)$, $R_B(N,P)$ and other metrics as deemed appropriate
by the benchmark writer. Ideally, the line for the database would be 
produced automatically as output by the benchmark program itself.

\begin{verbatim}
- ----------------------  END PROPOSED NEW TEXT  ----------------------------
\end{verbatim}


\section{Interactive Graphical Interface}

The Southampton Group has agreed to provide an interactive graphical front 
end to the ParkBench~ database of performance results. To achieve this,
the basic data held in the Performance Data Base should be values of
$T(N;p)$ for at least 4 values of problem size $N$, each for sufficient
$p$-values (say 5 to 10) to determine the trend of variation of performance
with number of processors for constant problem size. It is important that
there be enough $p$-values to see Amdahl saturation, if present, or any 
peak in performance followed by degradation. A graphical interface is
really essential to allow this multidimensional data to be viewed in any
of the metrics defined above, as chosen interactively by the user.
The user could also be offered (by suitable interpolation) a display of 
the results in various scaled metrics, in which the problem size is 
expanded with the number of processors.

In order to encompass as wide a range of performance and number of 
processors as possible, a log-scale on both axes is unavoidable, and
the format and scale range should be kept fixed as long as possible
to enable easy comparison between graphs. A three-cycle by three-cycle
log-log graph with range 1 to 1000 in both $p$ and Mflop/s would cover
most needs in the immediate future. Examples of such graphs are to be
found in \cite{Hoc92,Add93}. 

A log/log graph is also desirable because the size and shape of the Amdahl 
saturation curve is the same wherever it is plotted on such a graph. 
That is to say there is a universal Amdahl curve that is invariant to 
its position on any log/log graph. Amdahl saturation is a two-parameter 
description of any of the performance metrics, $R$, as a function of $p$ 
for fixed $N$, which can be expressed by
\begin{equation}
                   R = \frac{R_\infty}{(1 + \phalf/p)}
\end{equation}
where $R_\infty$ is the saturation performance approached as $p \rightarrow 
\infty$ and \phalf is the number of processors required to reach half
the saturation performance. The graphical interface should allow this
universal Amdahl curve to be moved around the graphical display, and
be matched against the performance curves. The changing values of the two 
parameters \Rphalf should be displayed as the Amdahl curve is moved.

As more experience is gained with performance analysis, that is to say
the fitting of performance data to parametrised formulae, it is to be
expected that the graphical interface will allow more complicated formulae
to be compared with the experimental data, perhaps allowing 3 to 5
parameters in the theoretical formula. But, as yet, we do not know what
these for parametrised formula should be.


\section{Benchmarking Procedure and Code Optimisation}

Manufacturers will always feel that any benchmark not tuned specifically
by themselves, is an unfair test of their hardware and software. This is
inevitable and from their viewpoint it is true. NASA have overcome this 
problem by only specifying the problems (the NAS paper-and-pencil 
benchmarks \cite{naspar2}) and leaving the manufacturers to write the 
code, but in many circumstances this would require unjustifiable effort
and take too long. It is also a perfectly valid question to ask how a
particular parallel computer will perform on existing parallel code, and
that is the viewpoint of ParkBench. 

The benchmarking procedure is to run the distributed ParkBench~ suite on
an 'as-is' basis, making only such non-substantive changes that are required 
to make the code run (e.g. changing the names of header files to a local
variant). The as-is run may use the highest level of automatic compiler
optimisation that works, but the level used and compiler date should be
noted in the appropriate section of the performance database entry.      

After completing the as-is run, which gives a base-line result, any form of 
optimisation may be applied to show the particular computer to its best 
advantage, up to completely rethinking the algorithm, and rewriting
the code. The only requirement on the benchmarker is to state what has been
done. However, remember that, even if the algorithm is changed, the official
flop-count, $F_B(N)$ that is used in the calculation of nominal benchmark
Mflop/s, $R_B(N,p)$, does not. In this way a better algorithm will show up
with a higher $R_B$, as we would want it to, even though the hardware 
Mflop/s is likely to be little changed.

Typical steps in optimisation might be:
\begin{enumerate} 
\item explore the effect of different compiler optimisations on a single 
      processor, and choose the best for the as-is run.
\item perform the as-is run on multiple processors, using enough values 
      of $p$ to determine any peak in performance or saturation.
\item return to single processor and optimise code for vectorisation,
      if a vector processor is being used. This means restructuring loops 
      to permit vectorisation.
\item continue by replacement of selected loops with optimal assembly coded 
      library routines (e.g. BLAS where appropriate).
\item replacement of whole benchmark by a tuned library routine with the
      same functionality.
\item replace whole benchmark with locally written version with the same 
      functionality but using possibly an entirely different algorithm 
      that is more suited to the architecture.
\end{enumerate} 
% ----------------------------------------------------------------------------



------- End of Forwarded Message
From owner-parkbench-method@CS.UTK.EDU Fri Sep  8 16:37:32 1995
Return-Path: <owner-parkbench-method@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id QAA14461; Fri, 8 Sep 1995 16:37:32 -0400
Received: from localhost by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id QAA04711; Fri, 8 Sep 1995 16:37:50 -0400
X-Resent-To: parkbench-method@CS.UTK.EDU ; Fri, 8 Sep 1995 16:37:48 EDT
Errors-to: owner-parkbench-method@CS.UTK.EDU
Received: from franklin.seas.gwu.edu by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id QAA04695; Fri, 8 Sep 1995 16:37:47 -0400
Received: from felix.seas.gwu.edu (abdullah@felix.seas.gwu.edu [128.164.9.3]) by franklin.seas.gwu.edu (v8) with ESMTP id QAA10248 for <parkbench-method@cs.utk.edu>; Fri, 8 Sep 1995 16:37:45 -0400
Received: (from abdullah@localhost) by felix.seas.gwu.edu (8.6.12/8.6.12) id QAA07221 for parkbench-method@cs.utk.edu; Fri, 8 Sep 1995 16:37:41 -0400
Date: Fri, 8 Sep 1995 16:37:41 -0400
From: Abdullah Meajil <abdullah@seas.gwu.edu>
Message-Id: <199509082037.QAA07221@felix.seas.gwu.edu>
To: parkbench-method@CS.UTK.EDU
Subject: subscribe

subscribe

From owner-parkbench-method@CS.UTK.EDU Fri Sep  8 16:38:32 1995
Return-Path: <owner-parkbench-method@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id QAA14473; Fri, 8 Sep 1995 16:38:32 -0400
Received: from localhost by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id QAA04768; Fri, 8 Sep 1995 16:38:51 -0400
X-Resent-To: parkbench-method@CS.UTK.EDU ; Fri, 8 Sep 1995 16:38:50 EDT
Errors-to: owner-parkbench-method@CS.UTK.EDU
Received: from franklin.seas.gwu.edu by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id QAA04759; Fri, 8 Sep 1995 16:38:48 -0400
Received: from felix.seas.gwu.edu (abdullah@felix.seas.gwu.edu [128.164.9.3]) by franklin.seas.gwu.edu (v8) with ESMTP id QAA10342 for <parkbench-method@cs.utk.edu>; Fri, 8 Sep 1995 16:38:43 -0400
Received: (from abdullah@localhost) by felix.seas.gwu.edu (8.6.12/8.6.12) id QAA07361 for parkbench-method@cs.utk.edu; Fri, 8 Sep 1995 16:38:40 -0400
Date: Fri, 8 Sep 1995 16:38:40 -0400
From: Abdullah Meajil <abdullah@seas.gwu.edu>
Message-Id: <199509082038.QAA07361@felix.seas.gwu.edu>
To: parkbench-method@CS.UTK.EDU
Subject: subscribe

subscribe

From sgreen@cs.utk.edu Mon Sep 11 13:37:57 1995
Return-Path: <sgreen@cs.utk.edu>
Received: from dancer.cs.utk.edu by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id NAA27033; Mon, 11 Sep 1995 13:37:56 -0400
From: <sgreen@cs.utk.edu>
Received:  by dancer.cs.utk.edu (cf v2.11c-UTK)
          id NAA04860; Mon, 11 Sep 1995 13:38:27 -0400
Date: Mon, 11 Sep 1995 13:38:27 -0400
Message-Id: <199509111738.NAA04860@dancer.cs.utk.edu>
To: parkbench-method-archive@netlib2.cs.utk.edu

Mon Sep 11 13:38:26 EDT 1995

From owner-parkbench-comm@CS.UTK.EDU Fri May  2 15:53:02 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id PAA00358; Fri, 2 May 1997 15:53:02 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id PAA13341; Fri, 2 May 1997 15:44:43 -0400
Received: from blueberry.cs.utk.edu (BLUEBERRY.CS.UTK.EDU [128.169.92.34]) 
        by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id PAA13327; Fri, 2 May 1997 15:44:36 -0400
Received:  by blueberry.cs.utk.edu (cf v2.11c-UTK)
          id TAA08348; Fri, 2 May 1997 19:44:04 GMT
From: "Erich Strohmaier" <erich@CS.UTK.EDU>
Message-Id: <9705021544.ZM8346@blueberry.cs.utk.edu>
Date: Fri, 2 May 1997 15:44:03 -0400
X-Face: ,v?vp%=2zU8m.23T00H*9+qjCVLwK{V3T{?1^Bua(Ud:|%?@D!~^v^hoA@Z5/*TU[RFq_n'n"}z{qhQ^Q3'Mexsxg0XW>+CbEOca91voac=<YfvQ8HrQFkH>P/w]>n_nS]V_ZL>XRSYWi:{MzalK9Hb^=B}Y*[x*MOX7R=*V}PI.HG~2
X-Mailer: Z-Mail (3.2.0 26oct94 MediaMail)
To: parkbench-comm@CS.UTK.EDU
Subject: ParkBench Committee Meeting
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Dear Colleague,

Here is the revised agenda.
Please send me ASAP a short email if you come
so that we can arrange for a meeting room.

-------------------
The ParkBench (Parallel Benchmark Working Group)
will meet in Knoxville, Tennessee on
May 9th, 1997.

The meeting site will be the Knoxville Downtown Hilton Hotel.
We have made arrangements with the Hilton Hotel in Knoxville.

  Hilton Hotel
  501 W. Church Street
  Knoxville, TN
  Phone:  423-523-2300

When making arrangements tell the hotel you are associated with
the 'ParkBench'. The rate about $79.00/night.
You can download a postscript map of the area by looking at
http://www.netlib.org/utk/people/JackDongarra.html.

----------------
The tentative agenda for the meeting is:

  1. Minutes of last meeting (MBe)

     Changes to Current release:
  2. Low Level (ES, VG, RS)
     comms1, comms2, comms3, poly2
  3. Linear Algebra (ES)
  4. Compact Applications - NPBs (SS, ES)

     New benchmarks:
  5. HPF Low Level benchmarks (MBa)
  6. Java Low-Level Benchmarks (VG)
  7. New I/O benchmark benchmarks (MBa)
  8. New performance database design and new benchmark output format
     Update of GBIS with new Web front-end (MBa,TH)

     Report from other benchmark activities
  9. ASCI Benchmark Codes (AH)
 10. SPEC-HPG (RE, JD)

     ParkBench:
 11. ParkBench Bibliography
 12. ParkBench Report 2

     Other Activities:
 13. Discussion of the ParkBench Workshop 11/12 September, UK (TH, MBa)
 14. PEMCS - "Electronic Benchmarking Journal" - status report - (TH, MBa)
 15. Status of Funding proposals (JD, TH)

 15. Miscellaneous -

 16. Date and venue for next meeting -


  (MBa) Mark Baker          Univ. of Portsmouth
  (MBe) Michael Berry       Univ. of Tennessee
  (JD)  Jack Dongarra       Univ. of Tenn./ORNL
  (RE)  Rudi Eigenmann      SPEC
  (VG)  Vladimir Getov      Univ. of Westminister
  (TH)  Tony Hey            Univ. of Southampton
  (AH)  Adolfy Hoisie       LLNL
  (SS)  Subhash Saini       NASA Ames
  (RS)  Ron Sercely         HP/CXTC
  (ES)  Erich Strohmaier    Univ. of Tennessee


Jack Dongarra
Erich Strohmaier


From owner-parkbench-comm@CS.UTK.EDU Tue May  6 14:46:45 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id OAA04480; Tue, 6 May 1997 14:46:45 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id OAA25737; Tue, 6 May 1997 14:34:05 -0400
Received: from punt-2.mail.demon.net (relay-11.mail.demon.net [194.217.242.137]) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id OAA25715; Tue, 6 May 1997 14:33:58 -0400
Received: from minnow.demon.co.uk ([158.152.73.63]) by punt-2.mail.demon.net
           id aa1000641; 6 May 97 19:07 BST
Message-ID: <UOrwADAXM3bzEwfw@minnow.demon.co.uk>
Date: Tue, 6 May 1997 19:06:15 +0100
To: parkbench-comm@CS.UTK.EDU
From: Roger Hockney <roger@minnow.demon.co.uk>
Subject: Parkbench Meeting Documents
In-Reply-To: <9705021544.ZM8346@blueberry.cs.utk.edu>
MIME-Version: 1.0
X-Mailer: Turnpike Version 3.01 <kRL7V2isFfDmnKSZb08I5Tyfx$>

AGENDA ITEM:

>     Changes to Current release:
>  2. Low Level (VG)
>     comms1, comms2,

Two documents will be submitted to the committee on this item by Roger
Hockney and Vladimir Getov (Westminster University, UK). They can be
downloaded as postscript files from:

"New COMMS1 Benchmark: Results and Recommendations"
http://www.minow.demon.co.uk/Pbench/comms1/PBPAPER2.PS
 
"New COMMS1 Benchmark: The Details"
http://www.minow.demon.co.uk/Pbench/comms1/PBPAPER3.PS

The papers will be presented by Vladimir who will bring some paper
copies with him.

Best wishes
Roger and Vladimir
-- 
Roger Hockney.  Checkout my new Web page at URL   http://www.minnow.demon.co.uk
University of   and link to my new book: "The Science of Computer Benchmarking"
Westminster UK  suggestions welcome. Know any fish movies or suitable links?

From owner-parkbench-comm@CS.UTK.EDU Tue May  6 17:54:47 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id RAA07526; Tue, 6 May 1997 17:54:46 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id RAA17012; Tue, 6 May 1997 17:48:50 -0400
Received: from punt-1.mail.demon.net (relay-7.mail.demon.net [194.217.242.9]) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id RAA17003; Tue, 6 May 1997 17:48:47 -0400
Received: from minnow.demon.co.uk ([158.152.73.63]) by punt-1.mail.demon.net
           id aa0623986; 6 May 97 21:37 BST
Message-ID: <IQsX3CAKQ5bzEw9M@minnow.demon.co.uk>
Date: Tue, 6 May 1997 21:26:50 +0100
To: parkbench-comm@CS.UTK.EDU
From: Roger Hockney <roger@minnow.demon.co.uk>
Subject: Parkbench Meeting Documents (Correction)
MIME-Version: 1.0
X-Mailer: Turnpike Version 3.01 <kRL7V2isFfDmnKSZb08I5Tyfx$>

I am resending this because there was a typo in the URLs:
There are two MM in "minnow". 

Also if you took PBPAPER2.PS before receiving this repeat message,
please take it again as I have corrected two errors in the graphs.

SORRY 
Roger
************************
AGENDA ITEM:

>     Changes to Current release:
>  2. Low Level (VG)
>     comms1, comms2,

Two documents will be submitted to the committee on this item by Roger
Hockney and Vladimir Getov (Westminster University, UK). They can be
downloaded as postscript files from:

CORRECTED URLs:

"New COMMS1 Benchmark: Results and Recommendations"
http://www.minnow.demon.co.uk/Pbench/comms1/PBPAPER2.PS
              
"New COMMS1 Benchmark: The Details"
http://www.minnow.demon.co.uk/Pbench/comms1/PBPAPER3.PS

The papers will be presented by Vladimir who will bring some paper
copies with him.

Best wishes
Roger and Vladimir
-- 
-- 
Roger Hockney.  Checkout my new Web page at URL   http://www.minnow.demon.co.uk
University of   and link to my new book: "The Science of Computer Benchmarking"
Westminster UK  suggestions welcome. Know any fish movies or suitable links?

From owner-parkbench-comm@CS.UTK.EDU Mon May 12 05:36:41 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id FAA24086; Mon, 12 May 1997 05:36:41 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id FAA10068; Mon, 12 May 1997 05:18:21 -0400
Received: from haven.EPM.ORNL.GOV (haven.epm.ornl.gov [134.167.12.69]) 
        by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id FAA10051; Mon, 12 May 1997 05:18:18 -0400
Received: (from worley@localhost) by haven.EPM.ORNL.GOV (8.8.3/8.8.3) id FAA29262; Mon, 12 May 1997 05:18:16 -0400 (EDT)
Date: Mon, 12 May 1997 05:18:16 -0400 (EDT)
From: Pat Worley <worley@haven.EPM.ORNL.GOV>
Message-Id: <199705120918.FAA29262@haven.EPM.ORNL.GOV>
To: parkbench-comm@CS.UTK.EDU
Subject: Gordon Conference on HPC and NII 
Forwarding: Mail from 'Tony Skjellum <tony@aurora.cs.msstate.edu>'
     dated: Sat, 10 May 1997 16:32:12 -0500 (CDT)
Cc: worley@haven.EPM.ORNL.GOV

Just in case you haven't received information on this already, here is a
blurb on the 1997 Gordon conference in high performance computing. 
Unlike previous years, there is not an explicit emphasis on performance
evaluation in this year's stated themes, but you can't (shouldn't) discuss
future architectures and their impacts without discussing how to
evaluate performance, and I am hoping that some benchmarking-minded people
will show up and keep the discussion honest.

---------- Begin Forwarded Message ----------

The deadline for applying to attend the 1997 Gordon conference in high
performance computing is June 1. If you are interested in attending,
please apply as soon as possible. The simplest way to apply is to download
the application form from the web site indicated below, or to use the online
registration option. If you have any problems with either of these,
please contact the organizers at tony@cs.msstate.edu and worleyph@ornl.gov.

-------------------------------------------------------------------------------
The 1997 Gordon Conference on High Performance Computing and
Information Infrastructure: "Practical Revolutions in HPC and NII"

Chair, Anthony Skjellum, Mississippi State University, tony@cs.msstate.edu,
       601-325-8435
Co-Chair, Pat Worley, Oak Ridge National Laboratory, worleyph@ornl.gov,
       615-574-3128

Conference web page: http://www.erc.msstate.edu/conferences/gordon97

July 13-17, 1997
Plymouth State College
Plymouth NH

The now bi-annual Gordon conference series in HPC and NII commenced in 1992
and has had its second meeting in 1995.  The Gordon conferences are an
elite series of conferences designed to advance the state-of-the-art in
covered disciplines. Speakers are assured of anonymity and
referencing presentations done at Gordon conferences is prohibited by
conference rules in order to promote science, rather than publication
lists.  Previous meetings have had good international participation,
and this is always encouraged. Experts, novices, and technically
interested parties from other fields interested in HPC and NII are
encouraged to apply to attend.

All attendees, including speakers, poster presenters, and session chairs
must apply to attend. We *strongly* encourage all poster presenters to have
their poster proposals in by May 13, 1997, though we will consider poster
presentations up to six weeks prior to the conference.  Application to
attend the conference is also six weeks in advance.

More information on the conference can be found at the web page
listed above, including the list of speakers and poster presenters
and information on applying for attendance.


----------- End Forwarded Message -----------


From owner-parkbench-comm@CS.UTK.EDU Tue May 13 13:58:00 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id NAA20879; Tue, 13 May 1997 13:57:59 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id NAA11997; Tue, 13 May 1997 13:33:14 -0400
Received: from timbuk.cray.com (timbuk-fddi.cray.com [128.162.8.102]) 
        by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id NAA11983; Tue, 13 May 1997 13:33:10 -0400
Received: from ironwood.cray.com (root@ironwood-fddi.cray.com [128.162.21.36]) by timbuk.cray.com (8.8.5/CRI-gate-news-1.3) with ESMTP id MAA20939 for <parkbench-comm@CS.UTK.EDU>; Tue, 13 May 1997 12:33:07 -0500 (CDT)
Received: from magnet.cray.com (magnet [128.162.173.162]) by ironwood.cray.com (8.8.4/CRI-ironwood-news-1.0) with ESMTP id MAA16428 for <parkbench-comm@CS.UTK.EDU>; Tue, 13 May 1997 12:33:06 -0500 (CDT)
From: Charles Grassl <cmg@cray.com>
Received: by magnet.cray.com (8.8.0/btd-b3)
          id RAA20181; Tue, 13 May 1997 17:33:04 GMT
Message-Id: <199705131733.RAA20181@magnet.cray.com>
Subject: Parkbench directions
To: parkbench-comm@CS.UTK.EDU
Date: Tue, 13 May 1997 12:33:04 -0500 (CDT)
X-Mailer: ELM [version 2.4 PL24-CRI-d]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit


To:   ParkBench Group
From: Charles Grassl

Date: May 13, 1997

(Long)

I appreciated the meeting this past week and wish to thank Eric and Jack 
for hosting it.  I am aware of the great effort of many individuals
have contributed to developing and implementing the ParkBench suite.
In spite of this, I feel that we need to evaluate and correct our course.

ParkBench should not merge with or use benchmarks from the SPEC/HPG
(High Performance Group) group.  SGI/Cray and IBM have already
withdrawn from the SPEC/HPG group and Fujitsu and NEC are no longer
participating.  The reasons for these companies and other institutions
no longer participating should indicate to us (ParkBench) that
something is amiss with the SPEC/HPG benchmarks and paradigm.

Several of the reasons for the supercomputer manufacturers not
supporting the SPEC/HPG effort are listed below.  I list these reasons
so that the ParkBench group can learn from them and avoid the same
problems.

- Relevance.  The particular benchmark programs being used by SPEC/HPG
  are not relevant or appropriate for supercomputing.  The programs in
  the current SPEC/HPG suite do not represent any leading edge software
  which is more typical of usage for high performance systems.

- Redundancy.  The programs being developed by SPEC/HPG
  are not qualitatively or quantitatively different from the SPEC/OSG
  programs and as such, it is viewed as redundant and expensive.

- Methodology.  The methodology being used by SPEC/HPG to
  procure, develop and run benchmarks lacks scientific and technical
  basis and hence results have a vague and arbitrary interpretation.

- Programming model.  Designing benchmarks for portability across
  systems is a convenient idea but does not reflect actual constraints
  or usage.  More often than not, compatibility with a PREVIOUS model
  of computer is more important than compatibility ACROSS computers.

- Expense.  Some of the large data cases for the SPEC/HPG programs
  will requires hours or days to run with little new data or
  information gained by the exercise.  These exercises are extremely
  expensive both in time and capital equipment and in logistics.

- Ergonomics.  The cumbersome design of SPEC/HPG Makefiles and build
  procedures make the programs difficult and expensive to test,
  maintain and analyze.

We in the ParkBench group must acknowledge the above items if we are to
maintain interest and participation from computer vendors.  I believe
that reorganizing and refocusing the group could revitalize high
performance computer benchmarking and and re-invigorate the ParkBench
group.

As the ParkBench suite now stands, there are too many programs and they
are difficult to build, test and maintain.  This situation impedes
usage and participation.  Here are a few suggestions for our future
practices and directions:

- Design and write benchmarks programs.  Don't borrow or solicit old
  code.  The borrowed or solicited code is never quite appropriate and
  usually obsolete.  Our greatest asset is that we have scientist who
  are capable of designing experiments (benchmarks).  (Build value.)

- Monitor and evaluate accuracy.  Though we mention accuracy in
  ParkBench Report 1, we haven't applied it to the current programs
  (Scientifically validate, or invalidate, our experiments.)

- Make it simple.  Write and develop simple programs which do not need
  elaborate build procedures and which easier to test and to maintain.
  (Keep It Simple, Stupid.)

- Build a better user interface.  The belabored "run rules" and the
  interface with layers of Makefiles, includes and embedded relative
  file paths is unacceptable.  An acceptable interface might require
  binary distribution and hence a desirable emphasis on designing and
  running rather than building and porting the benchmarks.  (Make the
  product more attractive to more users.)

- Make the suite truly modular.  The current structure makes the
  simplest one CPU program as difficult to build and run as the most
  complicated program with Makefile includes, special compilers, source
  file includes, special libraries, suite libraries, etc. (Make it
  manageable.)

- Drop the connection with SPEC/HPG and with NPB.  This "grand
  unifying" scheme make redundant code.  It has had the opposite effect
  of focusing benchmarking attention on ParkBench because it is yet
  another collection of benchmarks used by other organizations.  (Be
  distinguishable and identifiable.)

- Emphasis what ParkBench is associated with:  benchmarking distributed
  memory parallel computers.  We should write and develop benchmark
  programs which measure and instrument the parallel processing aspect
  of MPP systems.  (Keep our focus.)


I volunteer to develop and write a suite of message passing test
programs which measure the performance and variance of message passing
communication schemes.  I have much experience with writing such a
programs and believe that such suite would be useful for others and for
the computer industry in general.

I hesitate to contribute such programs to the present structure for
several reasons:

- The network test suite does not logically fit into the current
  "hierarchy" and hence might further clutter the ParkBench suite and
  make it further unfocused.

- The current ParkBench structure is not manageable.  Testing and
  maintenance would be extremely expensive in the current structure.

- My company's effort may be interpreted as an endorsement of the
  current structure and model.  The suite is not popular with vendors
  for reasons outlined above.  Participation is currently discouraged.


Discussion?  


Regards,
Charles Grassl
SGI/Cray
Eagan, Minnesota  USA

From owner-parkbench-comm@CS.UTK.EDU Wed May 21 17:25:15 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id RAA27513; Wed, 21 May 1997 17:25:15 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id RAA07579; Wed, 21 May 1997 17:18:07 -0400
Received: from rastaman.rmt.utk.edu (root@TCHM11A6.RMT.UTK.EDU [128.169.27.188]) 
        by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id RAA07571; Wed, 21 May 1997 17:18:02 -0400
Received: from rastaman.rmt.utk.edu (localhost [127.0.0.1]) by rastaman.rmt.utk.edu (8.7.6/8.7.3) with SMTP id RAA01108; Wed, 21 May 1997 17:24:43 -0400
Sender: mucci@CS.UTK.EDU
Message-ID: <3383681A.D98C5FB@cs.utk.edu>
Date: Wed, 21 May 1997 17:24:42 -0400
From: "Philip J. Mucci" <mucci@CS.UTK.EDU>
Organization: University of Tennessee, Knoxville
X-Mailer: Mozilla 3.01 (X11; I; Linux 2.0.28 i586)
MIME-Version: 1.0
To: parkbench-comm@CS.UTK.EDU
CC: "PVM Developer's Mailing List" <pvmspankers@msr.epm.ornl.gov>
Subject: Mesg Passing Benchmarks
References: <199705131733.RAA20181@magnet.cray.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi all,

Charles Grassl in his last message to this committee volunteered
to write a suite of message passing benchmarks to replace the Low
Levels...Before any action on his or this committee's part, I would
recommend that you all have a look at version 3 of my pvmbench
package. It now does MPI as well and can easily support other
message passing primitives with a few #defines. 

Version 3 along with some sample results can be found at
http://www.cs.utk.edu/~mucci/pvmbench.

Note that this has not been tested on any MPP's with UTK PVM.

This benchmark will generate and graph the following:

bandwidth
gap time (to buffer an outgoing message)
roundtrip (latency /2)
barrier/sec
broadcast
summation reduction

Other tests can easily be added...I would highly recommend before any 
action done that this code be examined. It is less than a year old, 
version 3 available on that page is in beta, i.e. it has not been
released to the general public. Let me know what you think...

-Phil

-- 
/%*\ Philip J. Mucci | GRA in CS under Dr. JJ Dongarra /*%\
\*%/ http://www.cs.utk.edu/~mucci  PVM/Active Messages \%*/

From owner-parkbench-comm@CS.UTK.EDU Fri May 23 12:03:04 1997
Return-Path: <owner-parkbench-comm@CS.UTK.EDU>
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with ESMTP (cf v2.9t-netlib)
	id MAA06549; Fri, 23 May 1997 12:03:03 -0400
Received: from localhost (root@localhost) 
        by CS.UTK.EDU with SMTP (cf v2.9s-UTK)
	id LAA15901; Fri, 23 May 1997 11:05:32 -0400
Received: from berry.cs.utk.edu (BERRY.CS.UTK.EDU [128.169.94.70]) 
        by CS.UTK.EDU with ESMTP (cf v2.9s-UTK)
	id LAA15895; Fri, 23 May 1997 11:05:30 -0400
Received: from cs.utk.edu by berry.cs.utk.edu with ESMTP (cf v2.11c-UTK)
          id LAA01370; Fri, 23 May 1997 11:05:31 -0400
Message-Id: <199705231505.LAA01370@berry.cs.utk.edu>
to: parkbench-comm@CS.UTK.EDU
Subject: Minutes of May ParkBench Meeting
Date: Fri, 23 May 1997 11:05:31 -0400
From: "Michael W. Berry" <berry@CS.UTK.EDU>

Here are the minutes from the recent ParkBench meeting in Knoxville.
Best regards,
Mike

-----------------------------------------------------------------
Minutes of ParkBench Meeting - Knoxville Hilton, May 9, 1997
-----------------------------------------------------------------

ParkBench Attendee List:

     (MBa) Mark Baker          Univ. of Portsmouth   mab@sis.port.ac.uk
     (MBe) Michael Berry       Univ. of Tennessee    berry@cs.utk.edu
           Shirley Browne      Univ. of Tennessee    browne@cs.utk.edu
     (JD)  Jack Dongarra       Univ. of Tenn./ORNL   dongarra@cs.utk.edu
           Jeff Durachta       Army Res. Lab MSRC    durachta@arl.mil
     (VG)  Vladimir Getov      Univ. of Westminister getovv@wmin.ac.uk
     (CG)  Charles Grassl      SGI/Cray              cmg@cray.com
     (TH)  Tony Hey            Univ. of Southampton  ajgh@ecs.soton.ac.uk
     (AH)  Adolfy Hoisie       Los Alamos Nat'l Lab  hoisie@lanl.gov
     (CK)  Charles Koelbel     Rice University       chk@cs.rice.edu
     (PM)  Phil Mucci          Univ. of Tennessee    mucci@cs.utk.edu
           Erik Riedel         GENIAS Software GmbH  erik@genias.de
     (SS)  Subhash Saini       NASA Ames             saini@nas.nasa.gov
     (RS)  Ron Sercely         HP-Convex             sercely@convex.hp.com
           Alan Stagg          CEWES                 stagga@wes.army.mil
     (ES)  Erich Strohmaier    Univ. of Tennessee    erich@cs.utk.edu
     (PW)  Pat Worley          Oak Ridge Nat'l Lab   worleyph@ornl.gov

SPEC-HPG Visitors:

           Don Dossa           DEC                   dossa@eng.pko.dec.com
     (RE)  Rudi Eigenmann      Purdue University     eigenman@ecn.purdue.edu
           Greg Gaertner       DEC                   ggg@zko.dec.com
           Jean Suplick        HP                    suplick@rsn.hp.com
           Joe Throp           Kuck & Associates     throp@kai.com

At 9:05am EST, TH opened the meeting and ask that all the attendees
introduce themselves.  After a brief overview of the proposed agenda,
MBe reviewed the minutes from the last ParkBench meeting in October
of '96.  The minutes were unanimously accepted and TH asked VG to
present the proposed changes to the low-level benchmarks (9:20am).

VG reviewed the original COMMS1 (ping-pong or simplex communication) and
the COMMS2 (duplex communication) low-level benchmarks.  He discussed
some of the problems with the previous versions.  These included the
omission of calculated bandwidth, large message length problems, and
large errors in the asymptotic fit.   In collaboration with RS and CG,
a number of improvements have been made to these benchmarks:

	1. Measured bandwidth is provided in output.
	2. Time for shortest message is provided.
	3. Maximum measured bandwidth and the corresponding message
	   length is now provided.
	4. The accuracy of the least-squares 2-parameter fit has been
	   improved (sum of squares of the "relative" and not absolute
	   error is now used).
	5. New 3-parameter variable-power fit for certain cases added.
	6. Can report parametric fits if the error is less than some
   	   user-specified tolerance.
	7. Introduce KDIAG parameter to invoke diagnostic outputs.
	8. Modifications fo ESTCOM.f (as suggested by RS).
    
CG pointed out that it may not always be possible to interpret zero-length 
messages for these codes.  On the Cray machines, such messages force an 
immediate return (i.e., no synchronization).  He proposed that allowing zero-
length messages be removed for the COMMS benchmarks.  RS showed an actual
COMMS1 performance graph demonstrating the difficulty of data extrapolation
(if used to get latency for zero-length message-passing).  RS pointed out,
however, that zero-length message are defined w/in MPI, and suggested that
a simple return (as in the case of Cray machines) is not standard.

VG displayed some of the observed COMMS1/2 performance obtained on the
Cray T3E.  The 3-parameter fit yielded a 7% relative error for messages
ranging from 8 to 1.E+7 bytes.  CG questioned how the breakpoints were
determined?  He indicated the input parameters to the program required
previous knowledge of where breakpoints occur (although implementations
could change constantly).  TH suggested that the parametric fitting should
not be the default for these benchmarks, i.e., separate the analysis from
the actual benchmarking (this concept was seconded by CG).  RS suggested
that the fitting routines could be placed on the WWW/Internet and the
COMMS1/2 codes simply produce data.  CK, however, stressed that the codes
should maintain some minimal parametric fitting for clarity and
consistency of output interpretations.  

The minimal message length shown for the T3E results shown by VG was 8 and
the corresponding minimal message length for a Convex CXD set of
COMMS benchmarks was 1.  The lack of similar ranges of messages could
pose problems for comparisons.  JD strongly felt that users will return
to the notion of "latency" and want zero-length message overheads.  Users
may be primarily interested in start-up time for message-passing.  RS pointed
out that MPI does process zero-length messages.  JD suggested that
the minimal message length for the COMMS benchmarks be 8 bytes and RS proposed
that the minimal message-passing time and corresp. message length be
an output.  After more discussion, the following COMMS changes/outputs were 
unanimously agreed upon:

	1.  Maximum bandwidth with corresp. message size.
	2.  Minimum message-passing time with corresp. message size.
	3.  Time for minimum message length (could be 0, 1, 8, or 32 bytes
            but must be specified).
	4.  The software will be split into two program: one to report
	    the spot measurements and the other for the analysis.


At 10:00 am, SPEC-HPG members joined the ParkBench meeting for a joint
session.  CK reviewed the DoD Modernization Program.  He indicated that
the program is based on 3 primary components:

	1. CHSSI (Commonly Highly Scalable Software Initiative)
	2. DREN (Defense Research & Engineering Network)
	3. Shared Resource Centers (4 Major Shared Resource Centers or
           MSRC's and 20 Distributed Centers or DC's)

Benchmarking is part of the mission of the MSRC's, especially for
system integration and the Programming Environment & Training (PET)
team.  CK mentioned that the resources available at the MSRC's include:

256-proc. Cray T3E, SGI Power Challenge (CEWES), 256 proc. IBM SP/2 and
SGI Origin 2000 at ASC, SGI 790 at NAVO, and a collection of {SGI Origin,
Cray Titan, J90} at the Army Research Lab.

The benchmarking needs of the DoD program can be categorized as either
contractual or training.  The contractual needs are specified as PL1
(evaluation of initial machines), PL2 (upgrade to gain 3 times the
performance of PL1), and PL3 (upgrade to gain 10 times the performance
of PL1).  CK mentioned that the MSRC's are planning for the PL2 phase
later this year with PL3 scheduled in approx. 3 years.
The training needs include: the evaluation of programming paradigms,
the evaluation of performance trade-offs, templates for designing new
codes, and benchmarks for training examples.

The contractual benchmarks comprise 30 benchmarks (22 programs) some
of which are export-controlled or proprietary (data may not be used
in the public domain in some cases).  The run rules specify the number
of iterations for each benchmark in the suite.  Each MSRC uses a different
number of iterations per benchmark.  Code modifications are allowed (parallel
directives and message-passing can be used but no assembler) and algorithm
substitutions are permitted provided the problem does not become specialized.
The only performance metric reported for these benchmarks is the elapsed
time for the entire suite.  Benchmarks can be upgraded to reflect current
workloads of the MSRCs but they must be compared head-to-head with 
previous systems.

Example codes included in the DoD benchmark suite include: CTH (finite
volume shock simulation), X3D (explicit finite element code), OCEAN-O2 (an
ocean modeling code), NIKE3D (implicit nonlinear 3D FEM), and Aggregate
I/O benchmark.

Planned benchmarking activites for the DoD Modernization Program include:

	1. benchmarks for evaluating programming techniques (determine what
           works; develop decision trees)
	2. benchmarks for teaching (classes on "worked" examples; template
           modification)

This effort currently has 1 FTE and over 50 University personnel (in PET
program) involved (although they are not primarily responsible for
benchmarking work).

At 10:35am, TH asked AH from Los Alamos Nat'l Lab to overview their ASCI
benchmark suite.  He began by pointing out that these codes formulate the
"Los Alamos set of" ASCI Benchmarks.  Before presenting the list of codes,
AH noted that the philosophy of this activity was to achieve 
"experiment ahead" capability especially with immature computing platforms.
Los Alamos is also interested in developing performance modes as well as
kernels.  The list of active/research codes and compact applications 
comprising this suite are:

Code		Language(s)	Parallelism 	Description           
    
*HEAT(RAGE)	f77, f90	MPI(f90)	Eulerian adaptive mesh
				MPIfSM(f77)	refinement based on
						Riemann solvers; coupled
						physics-CFD; particle &
						radiative transport

EULER		f90		MPI		Admissable fluid (for SIMD);
				SIMD(SP		unstructured mesh, explicit
				vector)		solution; high-speed fluids;
						SP=single processor

NEUT		f77		MPI,SM,		Monte-Carlo, particle
				SHMEM

SWEEP3D		f90		MPI, SHMEM	Inner/outer iteration (kernel)
                                                (compact application)

HYDRO(T)	f77		Serial          (compact application)

TBON		f77		MPI		Material science; quantum
						mechanics; polymer age    
						simulation

*TECOLOTE	C++		MPI		Mixed call hydro. with regular
						structured grid

*TELURIDE	f90		MPI		Casting simulation; irregular
						structured grid; Krylov solution
						methods

*DANTE		HPF		MPI

* = export controlled

The codes and compact apps above vary in size from 2,000 to 35,000 lines.
AK noted that LANL could provide support for future ASCI-based ParkBench codes. 
The ASCI benchmark suite presented might include in the future tri-lab
(Livermore, Sandia, Los Alamos) contributions.  The ASCI application suite can
be set up with data sets leading to varying run-times.  AH mentioned that Los 
Alamos' ASCI benchmarking efforts are focused on high performance computing,
leading edge architectures, algorithms, and applications.  They are 
particularly concentrating in developing expertise in distributed shared-memory
performance evaluation and modeling.  AH expressed the hope that the efforts of
ParkBench will follow similar directions.

At 11:05am, SS reviewed some of the most recent NAS Parallel Ben