The Gordon Bell Awards for 1987

                          Jack Dongarra
                        Alan Karp (chair)
                           Ken Kennedy

The Gordon Bell Awards recognize outstanding achievement in the
application of supercomputers to scientific and engineering
problems. In 1987 these awards were for the largest speed-up on
MIMD computers. Six entries were judged to have met the rules
for the competition.

The winning entry in the general purpose computer category was
submitted by Robert Benner, John Gustafson, and Gary Montry of
the Parallel Processing Division of Sandia National Laboratory,
Albuquerque, New Mexico. They used a 1,024 processor NCUBE to
run three production applications with speed-ups ranging from
400 to over 600 times. The applications were a beam stress
analysis, a surface wave simulation, and an unstable fluid flow
model. A check for $1,000 was presented to the winners at
CompCon 88.

There were no acceptable entries in the special purpose
computer category. Gordon Bell decided to divide the money
among several groups, some of which did not meet the criteria
specified in the rules. As he put it, "It's my money, and I can
do whatever I want with it."

A special award of $500 was presented to Robert Chervin of the
National Center for Atmospheric Research in Boulder, Colorado,
for a production global ocean model running at 450 Mflops
(Million floating point operations per second) on all four
processors of a Cray X-MP/48. This work is impressive in two
respects. It is a production code running in parallel, and it
achieves a significant fraction of the peak performance of the
machine.

Second place in the Bell Award competition went to Marina Chen,
Yale University, Erik Benedictus, Bell Labs, Geoffery Fox,
Caltech, Jingke Li, Yale University, and David Walker, Caltech,
for work done on the Caltech and NCUBE hypercubes. Results of
three programs were submitted - a computational kernel with a
speed-up of 98, a QCD calculation with a speed-up of 458, and a
circuit simulation with a speed-up of 39. A check for $300 will
be sent to this group.

Although the rules for the competition limited consideration to
MIMD machines, the Committee and Gordon Bell thought that the submission by
Stavros Zenios of The Wharton School of the University of
Pennsylvania worthy of recognition. Zenios used a 16,384
processor Connection Machine to solve nonlinear network
optimization problems. Although it is impossible to measure
parallel speed-up on such a machine, Zenios was able to solve a
problem in 1.5 seconds that ran over 92 second on one processor
of an IBM 3090-600. A check for $200 will be sent to him.

In the following sections, we have summarized the entries in
more detail.

First Place

The Sandia National Laboratory entry was submitted by Robert
Benner, John Gustafson, and Gary Montry (BGM). They did their
work on an NCUBE hypercube multicomputer with 1,024 processors.
Each processor has 512 KBytes of private memory and is directly
connected to 10 other processors. Data is shared among
processor by explicitly sending messages from one processor to
another. Each processor is capable of about 80 Kflops (thousand
floating point operations per second).

One problem addressed by BGM is that of measuring speed-up.
Speed-up is the time it takes to solve the problem on one
processor divided by the time it takes to solve the problem on
all the processors. There are two difficulties with this
definition.

To get a reliable measurement, the program must run for at
least a few minutes. However, a problem that runs one minute on
1,024 processors would take tens of hours on one processor. BGM
solve this problem by running a program that takes at least 15
seconds on 1,024 processors and about 5 hours on one processor.

The second problem is more severe. Although the total memory of
a multicomputer is quite large, each processor has access to
only a modest amount of the total. Thus, the largest problem
that fits on one processor uses only a small part of the full
configuration. Such a problem is likely to be compute bound on
a small number of processors but communication bound on the
full system.

BGM solve this problem by following two curves in the
problem-size versus number-of-processors plane. The "research
line" runs the largest problem that fits on one processor on
different numbers of processors. The numbers submitted for the
competition follow this research line. These speed-up numbers
represent a lower bound on what is achievable because the
problem is quite small relative to the full system.

BGM also present timings for the "application line" in which
problem size increases as more processors are used. The
application line is more representative of how multicomputers
are used in production work. By measuring the idle time of each
processor, they are able to estimate the speed-up they would
get if the large problem fit on one processor. They have
validated this model and find the predictions are correct to
within a few percent.

BGM submitted three different programs - a beam stress analysis
using finite elements, a baffled surface wave simulation using
explicit finite differences, and an unstable fluid flow model
using flux corrected transport. All three show a speed-up of
over 400. While one such result might be considered a special
case, the weight of evidence is significant. The judges were
particularly impressed with these results since we expected the
winning speed-up to be 50 or less.

The beam strain analysis problem looks at the two-dimensional
deflection of a beam fixed at one end. (See figure 3 of entry.)
The program uses a finite element formulation with a
matrix-free preconditioned conjugate gradient algorithm. This
algorithm is frequently used on supercomputers when memory
constraints prevent the use of somewhat more efficient methods.
The most time consuming part of this approach is the
accumulation of dot products.

The beam is divided into 2,048 bilinear finite elements which
is the largest problem that will fit on one node. Each
processor is assigned a rectangular subdomain. Synchronization
occurs at two points during each iteration. First, each
processor must exchange boundary information with the four
processors sharing common edges with it. Second, five dot
products must be computed and the results distributed to each
processor.

Following the research line, BGM find a speed-up of over 452
for this code. Equally interesting is the application line.
Running a problem with a 64 by 32 grid on one processor takes
1614 seconds. A problem with 64 by 32 finite elements per
processor on 1,024 processors (2,000,000 elements) takes 1619
seconds. The nearly constant execution time as the problem size
increases indicates that their scaled speed-up of 1,021 is
reasonably accurate.

The second application submitted by BGM tracks two-dimensional
waves as they move past a set of barriers. The program can
handle any configuration of barriers and wave velocities that
depend nonlinearly on position and wave state.

The wave equation is discretized using a 5-point stencil in
space and an explicit time step scheme. The domain is divided
into rectangular blocks, and boundary information is passed to
adjoining processors on each time step. In order to reduce
communications overhead, a special quadruple buffering
technique is used.

Since this program does relatively few computations per
communication, some assembly language was used to improve the
speed-up. First, a special routine to transfer data to
neighbors was written. This routine helps most when
non-contiguous data was to be transferred. Second, the inner
loop of the update was written in assembler to reduce the
relative importance of loop start-up.

Following the research line, BGM report speed-up of 637 while
the scaled results along the application line indicate a
speed-up of 1,020. These scaled results appear to be reliable
because a 4,096 point problem run on one node takes 12,780
seconds while a 4,096 point per node problem (4,000,000 points)
takes only 12,824 seconds.

The third problem submitted by BGM is a computational fluid
dynamics program that uses a flux corrected transport
algorithm. This approach uses a low order spatial
discretization where the flow is smooth and second or fourth
order schemes in regions of large gradients. The time behavior
is followed explicitly. The test problem follows the
development of a Kelvin-Helmholtz instability in a shear flow.

This code is nearly perfectly parallel. As in the previous
problems, each processor must share boundary information with
its neighbors on each time step. In addition, a each processor
determines the allowed time step based on the grid points in
its subdomain. These local time steps must be combined to
produce a single step size and this value broadcast to all
processors.

Speed-up along the research line reached 519 while the scaled
result along the application line is 1016. A 32 by 32 grid run
on one processor takes 28039 seconds; a 32 by 32 grid for each
of 1,024 processors takes 28258 seconds. As before, the
overhead of running in parallel is small.

Honorable Mentions

Robert Chervin of NCAR won an honorable mention award for the
global ocean model he parallelized with the help of Albert
Semtner of the Naval Postgraduate School in Monterey,
California. This program is used to study such large scale
ocean phenomena as El Nino epsiodes in the Pacific Basin. It is
also used to produce input data to the climate models run to
predict such things as the long term impact of the greenhouse
effect.

The ocean model program uses a standard, second order, finite
difference discretization in three space dimensions and a
leap-frog explicit time step procedure. The grid is made up of
a point every 1/2 degree in latitude and longitude, and 20
points in height for a total of 4,000,000 grid point. Four
variables are computed at each grid point.

Since the time stepping procedure requires data from three
consective times, the problem is far too large to be contained
in the main memory of the Cray. In particular, the program uses
6 Mwords of main memory and 60 Mwords on the Cray SSD, a large,
relatively slow memory used as a fast I/O device. On each time
step, over 100 Mwords are transferred to and from the SSD.

In spite of the large amount of I/O, this program shows a
speed-up of 3.74 on a four processor Cray X-MP/48 using Cray
microtasking. For the runs cited in the award, Chervin used 280
tasks. The sustained processing rate of 450 Mflops is over half
the theoretical peak of 880 Mflops. This speed-up is
considerably higher than others have obtained.

One of the limiting factors for
parallel performance on the Cray is normally
memory bank conflicts that occur when one task accesses a
memory bank needed by another task. It is normally thought that
nothing can be done about this problem, but Chervin was able to
nearly eliminate it. The trick is to set up the data so each
task uses a distinct set of memory banks. It may be necessary
to do some extra data movement, as Chervin does, but the gain
in memory access rates clearly exceeds the cost.

Chervin's success with this program is not an isolated event.
He has also achieved a speed-up of 3.7 on a climate model that
uses a spectral method for the horizontal spatial
discretization. Here, too, he was able to eliminate most of the
inter-task memory bank conflicts by carefully distributing his
data. This code is largely scalar so the raw speed is not as
impressive as the ocean model, but this program is 99.5%
parallel.

Second place went to the group from Yale, Bell Labs, and
Caltech (YBC). This group submitted 3 programs running on
different hypercubes. One of these programs is a computational
kernel, LU decomposition, and will not be described here.

One of the calculations presented is the computation of the
potential energy of a quark and an antiquark using quantum
chromodynamics (QCD) theory to model the behavior of quarks.
The problem is discretized on a four dimensional lattice
representing space-time. (This grid is the "lattice" of the
lattice guage theories.) Each lattice point has a set of
variables representing the quarks associated with it; the links
between points contain information on the gluons which bind
quarks together.

A step of the Monte Carlo procedure involves randomly changing
the variables on one of the links. This change induces changes
in the corresponding lattice variables. After a large number of
changes have been made, it is possible to determine a number of
physical quantities from the statistical fluctuations of these
variables.

Each processor is assigned a part of the lattice. At each step
all links that are not connected to a common lattice point can
be updated simultaneously. Since each processor runs at its own
rate, the only way to assure that this condition is met is to
synchronize at each step. In addition, if the link data is held
by one processor and the lattice point variables on another
processor, a communications step is needed. When run on a large
number of processors, data on most of the links is sent between
processors at least once on each sweep over the lattice.

A further synchronization is needed due to an artifact of the
implementation. Each update involves multiplying 3 by 3 complex
matrices. In order to save memory and time, these calculations
are carried out using 16-bit integers instead of floating point
numbers. The round-off errors associated with the
multiplications are large enough to destroy the accuracy of the
calculation unless a correction is applied. This correction is
applied following every second sweep through the lattice.

YBC submitted a problem discretized on a 24 by 24 by 24 by 48
point lattice. They calculated how the potential energy of a
quark-antiquark pair varies with their separation. In
particular, they show that, when the quarks are close, the
potential energy follows a Coulomb law as would a proton and
electron. At larger separations, the potential energy falls
logarithmically with separation. Such an energy law implies
that it would take an infinite amount of energy to separate
these particles. A speed-up of 458 was obtained on a 512
processor, Caltech hypercube.

The second application is a circuit simulation used to predict
the time behavior of a circuit to a given set of inputs. At
each time step, each circuit element takes its input voltages
and computes a set of output voltages. The problem is
parallelized by dividing the circuit elements among the
processors. The difficult part of making sure that each circuit
element uses the correct input voltage is handled by using
queues.

The calculation can not proceed completely asynchronously since
some circuit elements require more computation than others to
update their output voltages. During an arbitrarily long run,
the input queues of the slowest elements would certainly
overflow. One way to handle this problem is to run the
simulation in segments of some fixed number of time steps. At
the end of each segment, the processors synchronize (without
passing any data) and then continue. Using such a programming
style, YBC report a speed-up of 39 on a 127 processor NCUBE
hypercube.

The third honorable mention was given to a program that did not
meet the eligibility criteria. However,
the Committee and Gordon Bell were so
impressed with the results that he awarded a special prize to
Stavros Zenios of The Wharton School of the University of
Pennsylvania. Zenios solved a number of problems in nonlinear
network optimization on a SIMD Connection Machine CM-1.

Nonlinear network optimization involves minimizing a strictly
convex, continuously differentiable, real-valued function
subject to a set of linear constraints. These problems are
difficult because the spatial dimension is high, making the
resulting matrices extremely large. Fortunately, these matrices
are also extremely sparse. This fact means that if one variable
is changed, only a small subset of the equations must be
reevaluated to fully account for the change.

A standard approach is to change one variable at a time
(coordinate-wise minimization) until a global minimum is
achieved. We can think of the nonlinear system as a graph. Each
node in the graph represents an unknown. An arc joins two
nodes, say i and j, if variable j appears in equation i. When
variable j is changed, only equations explicitly containing it
must be updated.

Zenios used this fact to map the problem onto the CM-1. Each
arc in the graph is associated with two processors. On each
iteration all variables are updated simultaneously. Next, each
equation is evaluated by combining the results of all
processors associated with each unknown using a special function
on the CM-1 called a "segmented-plus-scan" which forms a set of
running sums. Finally, the results of this scan are distributed
to all the nodes.

This procedure appears to be inefficient in that it uses two
processors per unknown and does some computations twice.
However, the resulting reduction in communication overhead
makes it run quite fast. About 35% of the elapsed time is spent
communicating. On three different problems, the CM-1 needed 1,
8 and 29 seconds while an IBM 3090 uniprocessor needed 93, 109,
and 83 seconds for the same jobs. The CM-2 with floating point
should be even faster.

Other Entries

Four other entries were judged eligible for considerations.
Although they did not win, they represent interesting work
worthy of commendation.

Paul Fischer of MIT submitted measurements of NEKTON, a
commercial fluid dynamics and heat flow package, running on an
Intel iPSC-VX hypercube. The package uses the so-called "p"
finite element method, sometimes called the spectral element
method. Convergence is obtained by increasing the order of the
approximation on each finite element instead of increasing the
number of finite elements as done in the "h" finite element
methods. The resulting linear systems are solved with a
diagonally preconditioned conjugate gradient algorithm.

Fischer's results point out one of the problems with the
contest rules. He got a speed-up of 7.2 running with 32 vector
processors. When he turned off vector processing, he achieved a
speed-up of 16. However, the execution time of the non-vector
job was almost 100 times longer than the vector job.

Richard Roloff of the Center for Supercomputing Research and
Development at the University of Illinois at Urbana-Champaign
parallelized a program heavily used by the Chemical Engineering
Department. It models turbulent incompressible flow using a
pseudo-spectral algorithm. The compiler did most of the
parallelization by vectorizing inner loops and parallelizing
outer loops. In addition, compiler directives were used to
allow subroutine calls to be executed in parallel. Roloff
achieved a 6.5 speed-up on an 8 processor Alliant FX/8.

Richard Hessel of Alliant Computer Systems submitted
measurements of ANSYS, a general purpose finite element
package, on an Alliant FX/8. He ran a standard benchmark data
set called S4. This linear, static analysis problem requires
4,000 elements and has over 24,000 unknowns.

Hessel achieved a speed-up of 5 on the 8 processor Alliant.
Most of the parallelism was extracted by the compiler with the
exception of one routine. The rank-n update routine was recoded
to use a block algorithm better suited to the Alliant
architecture.

As is often the case, optimizing for parallelism leads to other
improvements. Hessel reports that the modified program runs 1.5
times faster on one processor than the best previous code.

David George and Frederica Darema-Rogers of IBM Research in
Hawthorne, NY, submitted a parallelized version of a 3D fluid
code from NASA, AIR3D. This program models flow past a complex
surface including boundary layers. It is heavily used modelling
the space shuttle.

The computationally slow part is the solution of large linear
systems which are solved using an alternating direction
implicit (ADI) scheme on planes. On each time step, the system
is decoupled into a set of two dimensional problems be assuming
the unknowns in the x direction are known and solving the
resulting set of 2D problems. Next, the unknowns in y are
fixed, and a set of 2D problems solved. Finally, the z unknowns
are fixed. This solution is done for each iteration of the
nonlinear solution step.

The IBM group used EPEX, an experimental system for use within
IBM, to parallelize at the loop level. On each of the three
steps of the iteration, the solution of the 2D problems are
independent and were run in parallel. They ran on an IBM 3090
with six vector processors and achieved a speed-up of 4.9.

-----------------------------------------------------------------------

                    Sidebar on Karp Challenge

Below is the text of the challenge I issued in 1985. The Sandia
Labs group satisfied all the requirements. I presented them
with a check for $100 made out to the Association of Retarded
Citizens of Albuquerque and a plaque recognizing their
achievement.

---------------------------------------------------------------

I have just returned from the Second SIAM Conference on
Parallel Processing for Scientific Computing in Norfolk,
Virginia. There I heard about 1,000 processor systems, 4,000
processor systems, and even a proposed 1,000,000 processor
system. Since I wonder if such systems are the best way to
do general purpose, scientific computing, I am making the
following offer.

    I will pay $100 to the first person to demonstrate a
    speed-up of at least 200 on a general purpose, MIMD
    computer used for scientific computing. This offer
    will be withdrawn at 11:59 PM on 31 December 1995.

Some definitions are in order.

Speed-up: The time taken to run an application on one
    processor using the best sequential algorithm divided
    by the time to run the same application on N
    processors using the best parallel algorithm.

General purpose: The parallel system should be able to run
    a wide variety of applications. For the purposes of
    this test, the machine must run 3 different programs
    that demonstrate its ability to solve different
    problems. I suggest a large, dynamic structures
    calculation, a trans-sonic fluid flow past a complex
    barrier, and a large problem in econometric modelling.
    These are only suggestions; I will be quite flexible
    in the choice of the problems. Several people have
    volunteered to adjudicate disputes in the selection of
    the applications.

Application: The problems run for this test must be
    complete applications, no computational kernels
    allowed. They must contain all input, data transfers
    from host to parallel processors, and all output.
    The problems chosen should be the kind of job that a
    working scientist or engineer would submit as a batch
    job to a large supercomputer. In addition, I am
    arbitrarily disqualifying all problems that Cleve
    Moler calls "embarrassingly parallel". These include
    signal processing with multiple independent data
    streams, the computation of the Mandelbrot set, etc.

There are some rather obvious ground rules.

    Simulations of the hardware are not permitted.  I am
    looking for a demonstration on a running piece of
    hardware.

    The same problem should be run on the sequential and
    parallel processors.

    It is not fair to cripple the sequential processor.
    For example, if your operating system uses 99% of one
    processor, the single processor run will spend all its
    time in the operating system. Super-linear speed-up as
    the number of processors is increased is evidence of
    this problem.

    It is not fair to memory starve the single processor
    run. If you have 100K words of memory on each of 1,000
    processors, and you run a 10 MW problem, it is not
    fair for the sequential run to be made on a 100 KW
    processor. After all, we are not interested in seeing
    the impact of additional memory; we want to see how
    much the extra CPUs help.

It may not be possible to follow all these additional rules.
For example, you can't be expected to build a single processor
with lots of extra memory or write a new operating system just
for this test. However, you should do your best to make the
demonstration fair. A third party has agreed to adjudicate any
scaling disputes.

Anyone claiming this prize will be expected to present his or
her results at a suitable conference. At the end of the
presentation I will have the audience vote on whether the
demonstration was successful. Remember, the purpose of the
bet is to force developers to demonstrate the utility of
massively parallel systems, not to show they can find holes
in the rules.