The Gordon Bell Awards for 1987 Jack Dongarra Alan Karp (chair) Ken Kennedy The Gordon Bell Awards recognize outstanding achievement in the application of supercomputers to scientific and engineering problems. In 1987 these awards were for the largest speed-up on MIMD computers. Six entries were judged to have met the rules for the competition. The winning entry in the general purpose computer category was submitted by Robert Benner, John Gustafson, and Gary Montry of the Parallel Processing Division of Sandia National Laboratory, Albuquerque, New Mexico. They used a 1,024 processor NCUBE to run three production applications with speed-ups ranging from 400 to over 600 times. The applications were a beam stress analysis, a surface wave simulation, and an unstable fluid flow model. A check for $1,000 was presented to the winners at CompCon 88. There were no acceptable entries in the special purpose computer category. Gordon Bell decided to divide the money among several groups, some of which did not meet the criteria specified in the rules. As he put it, "It's my money, and I can do whatever I want with it." A special award of $500 was presented to Robert Chervin of the National Center for Atmospheric Research in Boulder, Colorado, for a production global ocean model running at 450 Mflops (Million floating point operations per second) on all four processors of a Cray X-MP/48. This work is impressive in two respects. It is a production code running in parallel, and it achieves a significant fraction of the peak performance of the machine. Second place in the Bell Award competition went to Marina Chen, Yale University, Erik Benedictus, Bell Labs, Geoffery Fox, Caltech, Jingke Li, Yale University, and David Walker, Caltech, for work done on the Caltech and NCUBE hypercubes. Results of three programs were submitted - a computational kernel with a speed-up of 98, a QCD calculation with a speed-up of 458, and a circuit simulation with a speed-up of 39. A check for $300 will be sent to this group. Although the rules for the competition limited consideration to MIMD machines, the Committee and Gordon Bell thought that the submission by Stavros Zenios of The Wharton School of the University of Pennsylvania worthy of recognition. Zenios used a 16,384 processor Connection Machine to solve nonlinear network optimization problems. Although it is impossible to measure parallel speed-up on such a machine, Zenios was able to solve a problem in 1.5 seconds that ran over 92 second on one processor of an IBM 3090-600. A check for $200 will be sent to him. In the following sections, we have summarized the entries in more detail. First Place The Sandia National Laboratory entry was submitted by Robert Benner, John Gustafson, and Gary Montry (BGM). They did their work on an NCUBE hypercube multicomputer with 1,024 processors. Each processor has 512 KBytes of private memory and is directly connected to 10 other processors. Data is shared among processor by explicitly sending messages from one processor to another. Each processor is capable of about 80 Kflops (thousand floating point operations per second). One problem addressed by BGM is that of measuring speed-up. Speed-up is the time it takes to solve the problem on one processor divided by the time it takes to solve the problem on all the processors. There are two difficulties with this definition. To get a reliable measurement, the program must run for at least a few minutes. However, a problem that runs one minute on 1,024 processors would take tens of hours on one processor. BGM solve this problem by running a program that takes at least 15 seconds on 1,024 processors and about 5 hours on one processor. The second problem is more severe. Although the total memory of a multicomputer is quite large, each processor has access to only a modest amount of the total. Thus, the largest problem that fits on one processor uses only a small part of the full configuration. Such a problem is likely to be compute bound on a small number of processors but communication bound on the full system. BGM solve this problem by following two curves in the problem-size versus number-of-processors plane. The "research line" runs the largest problem that fits on one processor on different numbers of processors. The numbers submitted for the competition follow this research line. These speed-up numbers represent a lower bound on what is achievable because the problem is quite small relative to the full system. BGM also present timings for the "application line" in which problem size increases as more processors are used. The application line is more representative of how multicomputers are used in production work. By measuring the idle time of each processor, they are able to estimate the speed-up they would get if the large problem fit on one processor. They have validated this model and find the predictions are correct to within a few percent. BGM submitted three different programs - a beam stress analysis using finite elements, a baffled surface wave simulation using explicit finite differences, and an unstable fluid flow model using flux corrected transport. All three show a speed-up of over 400. While one such result might be considered a special case, the weight of evidence is significant. The judges were particularly impressed with these results since we expected the winning speed-up to be 50 or less. The beam strain analysis problem looks at the two-dimensional deflection of a beam fixed at one end. (See figure 3 of entry.) The program uses a finite element formulation with a matrix-free preconditioned conjugate gradient algorithm. This algorithm is frequently used on supercomputers when memory constraints prevent the use of somewhat more efficient methods. The most time consuming part of this approach is the accumulation of dot products. The beam is divided into 2,048 bilinear finite elements which is the largest problem that will fit on one node. Each processor is assigned a rectangular subdomain. Synchronization occurs at two points during each iteration. First, each processor must exchange boundary information with the four processors sharing common edges with it. Second, five dot products must be computed and the results distributed to each processor. Following the research line, BGM find a speed-up of over 452 for this code. Equally interesting is the application line. Running a problem with a 64 by 32 grid on one processor takes 1614 seconds. A problem with 64 by 32 finite elements per processor on 1,024 processors (2,000,000 elements) takes 1619 seconds. The nearly constant execution time as the problem size increases indicates that their scaled speed-up of 1,021 is reasonably accurate. The second application submitted by BGM tracks two-dimensional waves as they move past a set of barriers. The program can handle any configuration of barriers and wave velocities that depend nonlinearly on position and wave state. The wave equation is discretized using a 5-point stencil in space and an explicit time step scheme. The domain is divided into rectangular blocks, and boundary information is passed to adjoining processors on each time step. In order to reduce communications overhead, a special quadruple buffering technique is used. Since this program does relatively few computations per communication, some assembly language was used to improve the speed-up. First, a special routine to transfer data to neighbors was written. This routine helps most when non-contiguous data was to be transferred. Second, the inner loop of the update was written in assembler to reduce the relative importance of loop start-up. Following the research line, BGM report speed-up of 637 while the scaled results along the application line indicate a speed-up of 1,020. These scaled results appear to be reliable because a 4,096 point problem run on one node takes 12,780 seconds while a 4,096 point per node problem (4,000,000 points) takes only 12,824 seconds. The third problem submitted by BGM is a computational fluid dynamics program that uses a flux corrected transport algorithm. This approach uses a low order spatial discretization where the flow is smooth and second or fourth order schemes in regions of large gradients. The time behavior is followed explicitly. The test problem follows the development of a Kelvin-Helmholtz instability in a shear flow. This code is nearly perfectly parallel. As in the previous problems, each processor must share boundary information with its neighbors on each time step. In addition, a each processor determines the allowed time step based on the grid points in its subdomain. These local time steps must be combined to produce a single step size and this value broadcast to all processors. Speed-up along the research line reached 519 while the scaled result along the application line is 1016. A 32 by 32 grid run on one processor takes 28039 seconds; a 32 by 32 grid for each of 1,024 processors takes 28258 seconds. As before, the overhead of running in parallel is small. Honorable Mentions Robert Chervin of NCAR won an honorable mention award for the global ocean model he parallelized with the help of Albert Semtner of the Naval Postgraduate School in Monterey, California. This program is used to study such large scale ocean phenomena as El Nino epsiodes in the Pacific Basin. It is also used to produce input data to the climate models run to predict such things as the long term impact of the greenhouse effect. The ocean model program uses a standard, second order, finite difference discretization in three space dimensions and a leap-frog explicit time step procedure. The grid is made up of a point every 1/2 degree in latitude and longitude, and 20 points in height for a total of 4,000,000 grid point. Four variables are computed at each grid point. Since the time stepping procedure requires data from three consective times, the problem is far too large to be contained in the main memory of the Cray. In particular, the program uses 6 Mwords of main memory and 60 Mwords on the Cray SSD, a large, relatively slow memory used as a fast I/O device. On each time step, over 100 Mwords are transferred to and from the SSD. In spite of the large amount of I/O, this program shows a speed-up of 3.74 on a four processor Cray X-MP/48 using Cray microtasking. For the runs cited in the award, Chervin used 280 tasks. The sustained processing rate of 450 Mflops is over half the theoretical peak of 880 Mflops. This speed-up is considerably higher than others have obtained. One of the limiting factors for parallel performance on the Cray is normally memory bank conflicts that occur when one task accesses a memory bank needed by another task. It is normally thought that nothing can be done about this problem, but Chervin was able to nearly eliminate it. The trick is to set up the data so each task uses a distinct set of memory banks. It may be necessary to do some extra data movement, as Chervin does, but the gain in memory access rates clearly exceeds the cost. Chervin's success with this program is not an isolated event. He has also achieved a speed-up of 3.7 on a climate model that uses a spectral method for the horizontal spatial discretization. Here, too, he was able to eliminate most of the inter-task memory bank conflicts by carefully distributing his data. This code is largely scalar so the raw speed is not as impressive as the ocean model, but this program is 99.5% parallel. Second place went to the group from Yale, Bell Labs, and Caltech (YBC). This group submitted 3 programs running on different hypercubes. One of these programs is a computational kernel, LU decomposition, and will not be described here. One of the calculations presented is the computation of the potential energy of a quark and an antiquark using quantum chromodynamics (QCD) theory to model the behavior of quarks. The problem is discretized on a four dimensional lattice representing space-time. (This grid is the "lattice" of the lattice guage theories.) Each lattice point has a set of variables representing the quarks associated with it; the links between points contain information on the gluons which bind quarks together. A step of the Monte Carlo procedure involves randomly changing the variables on one of the links. This change induces changes in the corresponding lattice variables. After a large number of changes have been made, it is possible to determine a number of physical quantities from the statistical fluctuations of these variables. Each processor is assigned a part of the lattice. At each step all links that are not connected to a common lattice point can be updated simultaneously. Since each processor runs at its own rate, the only way to assure that this condition is met is to synchronize at each step. In addition, if the link data is held by one processor and the lattice point variables on another processor, a communications step is needed. When run on a large number of processors, data on most of the links is sent between processors at least once on each sweep over the lattice. A further synchronization is needed due to an artifact of the implementation. Each update involves multiplying 3 by 3 complex matrices. In order to save memory and time, these calculations are carried out using 16-bit integers instead of floating point numbers. The round-off errors associated with the multiplications are large enough to destroy the accuracy of the calculation unless a correction is applied. This correction is applied following every second sweep through the lattice. YBC submitted a problem discretized on a 24 by 24 by 24 by 48 point lattice. They calculated how the potential energy of a quark-antiquark pair varies with their separation. In particular, they show that, when the quarks are close, the potential energy follows a Coulomb law as would a proton and electron. At larger separations, the potential energy falls logarithmically with separation. Such an energy law implies that it would take an infinite amount of energy to separate these particles. A speed-up of 458 was obtained on a 512 processor, Caltech hypercube. The second application is a circuit simulation used to predict the time behavior of a circuit to a given set of inputs. At each time step, each circuit element takes its input voltages and computes a set of output voltages. The problem is parallelized by dividing the circuit elements among the processors. The difficult part of making sure that each circuit element uses the correct input voltage is handled by using queues. The calculation can not proceed completely asynchronously since some circuit elements require more computation than others to update their output voltages. During an arbitrarily long run, the input queues of the slowest elements would certainly overflow. One way to handle this problem is to run the simulation in segments of some fixed number of time steps. At the end of each segment, the processors synchronize (without passing any data) and then continue. Using such a programming style, YBC report a speed-up of 39 on a 127 processor NCUBE hypercube. The third honorable mention was given to a program that did not meet the eligibility criteria. However, the Committee and Gordon Bell were so impressed with the results that he awarded a special prize to Stavros Zenios of The Wharton School of the University of Pennsylvania. Zenios solved a number of problems in nonlinear network optimization on a SIMD Connection Machine CM-1. Nonlinear network optimization involves minimizing a strictly convex, continuously differentiable, real-valued function subject to a set of linear constraints. These problems are difficult because the spatial dimension is high, making the resulting matrices extremely large. Fortunately, these matrices are also extremely sparse. This fact means that if one variable is changed, only a small subset of the equations must be reevaluated to fully account for the change. A standard approach is to change one variable at a time (coordinate-wise minimization) until a global minimum is achieved. We can think of the nonlinear system as a graph. Each node in the graph represents an unknown. An arc joins two nodes, say i and j, if variable j appears in equation i. When variable j is changed, only equations explicitly containing it must be updated. Zenios used this fact to map the problem onto the CM-1. Each arc in the graph is associated with two processors. On each iteration all variables are updated simultaneously. Next, each equation is evaluated by combining the results of all processors associated with each unknown using a special function on the CM-1 called a "segmented-plus-scan" which forms a set of running sums. Finally, the results of this scan are distributed to all the nodes. This procedure appears to be inefficient in that it uses two processors per unknown and does some computations twice. However, the resulting reduction in communication overhead makes it run quite fast. About 35% of the elapsed time is spent communicating. On three different problems, the CM-1 needed 1, 8 and 29 seconds while an IBM 3090 uniprocessor needed 93, 109, and 83 seconds for the same jobs. The CM-2 with floating point should be even faster. Other Entries Four other entries were judged eligible for considerations. Although they did not win, they represent interesting work worthy of commendation. Paul Fischer of MIT submitted measurements of NEKTON, a commercial fluid dynamics and heat flow package, running on an Intel iPSC-VX hypercube. The package uses the so-called "p" finite element method, sometimes called the spectral element method. Convergence is obtained by increasing the order of the approximation on each finite element instead of increasing the number of finite elements as done in the "h" finite element methods. The resulting linear systems are solved with a diagonally preconditioned conjugate gradient algorithm. Fischer's results point out one of the problems with the contest rules. He got a speed-up of 7.2 running with 32 vector processors. When he turned off vector processing, he achieved a speed-up of 16. However, the execution time of the non-vector job was almost 100 times longer than the vector job. Richard Roloff of the Center for Supercomputing Research and Development at the University of Illinois at Urbana-Champaign parallelized a program heavily used by the Chemical Engineering Department. It models turbulent incompressible flow using a pseudo-spectral algorithm. The compiler did most of the parallelization by vectorizing inner loops and parallelizing outer loops. In addition, compiler directives were used to allow subroutine calls to be executed in parallel. Roloff achieved a 6.5 speed-up on an 8 processor Alliant FX/8. Richard Hessel of Alliant Computer Systems submitted measurements of ANSYS, a general purpose finite element package, on an Alliant FX/8. He ran a standard benchmark data set called S4. This linear, static analysis problem requires 4,000 elements and has over 24,000 unknowns. Hessel achieved a speed-up of 5 on the 8 processor Alliant. Most of the parallelism was extracted by the compiler with the exception of one routine. The rank-n update routine was recoded to use a block algorithm better suited to the Alliant architecture. As is often the case, optimizing for parallelism leads to other improvements. Hessel reports that the modified program runs 1.5 times faster on one processor than the best previous code. David George and Frederica Darema-Rogers of IBM Research in Hawthorne, NY, submitted a parallelized version of a 3D fluid code from NASA, AIR3D. This program models flow past a complex surface including boundary layers. It is heavily used modelling the space shuttle. The computationally slow part is the solution of large linear systems which are solved using an alternating direction implicit (ADI) scheme on planes. On each time step, the system is decoupled into a set of two dimensional problems be assuming the unknowns in the x direction are known and solving the resulting set of 2D problems. Next, the unknowns in y are fixed, and a set of 2D problems solved. Finally, the z unknowns are fixed. This solution is done for each iteration of the nonlinear solution step. The IBM group used EPEX, an experimental system for use within IBM, to parallelize at the loop level. On each of the three steps of the iteration, the solution of the 2D problems are independent and were run in parallel. They ran on an IBM 3090 with six vector processors and achieved a speed-up of 4.9. ----------------------------------------------------------------------- Sidebar on Karp Challenge Below is the text of the challenge I issued in 1985. The Sandia Labs group satisfied all the requirements. I presented them with a check for $100 made out to the Association of Retarded Citizens of Albuquerque and a plaque recognizing their achievement. --------------------------------------------------------------- I have just returned from the Second SIAM Conference on Parallel Processing for Scientific Computing in Norfolk, Virginia. There I heard about 1,000 processor systems, 4,000 processor systems, and even a proposed 1,000,000 processor system. Since I wonder if such systems are the best way to do general purpose, scientific computing, I am making the following offer. I will pay $100 to the first person to demonstrate a speed-up of at least 200 on a general purpose, MIMD computer used for scientific computing. This offer will be withdrawn at 11:59 PM on 31 December 1995. Some definitions are in order. Speed-up: The time taken to run an application on one processor using the best sequential algorithm divided by the time to run the same application on N processors using the best parallel algorithm. General purpose: The parallel system should be able to run a wide variety of applications. For the purposes of this test, the machine must run 3 different programs that demonstrate its ability to solve different problems. I suggest a large, dynamic structures calculation, a trans-sonic fluid flow past a complex barrier, and a large problem in econometric modelling. These are only suggestions; I will be quite flexible in the choice of the problems. Several people have volunteered to adjudicate disputes in the selection of the applications. Application: The problems run for this test must be complete applications, no computational kernels allowed. They must contain all input, data transfers from host to parallel processors, and all output. The problems chosen should be the kind of job that a working scientist or engineer would submit as a batch job to a large supercomputer. In addition, I am arbitrarily disqualifying all problems that Cleve Moler calls "embarrassingly parallel". These include signal processing with multiple independent data streams, the computation of the Mandelbrot set, etc. There are some rather obvious ground rules. Simulations of the hardware are not permitted. I am looking for a demonstration on a running piece of hardware. The same problem should be run on the sequential and parallel processors. It is not fair to cripple the sequential processor. For example, if your operating system uses 99% of one processor, the single processor run will spend all its time in the operating system. Super-linear speed-up as the number of processors is increased is evidence of this problem. It is not fair to memory starve the single processor run. If you have 100K words of memory on each of 1,000 processors, and you run a 10 MW problem, it is not fair for the sequential run to be made on a 100 KW processor. After all, we are not interested in seeing the impact of additional memory; we want to see how much the extra CPUs help. It may not be possible to follow all these additional rules. For example, you can't be expected to build a single processor with lots of extra memory or write a new operating system just for this test. However, you should do your best to make the demonstration fair. A third party has agreed to adjudicate any scaling disputes. Anyone claiming this prize will be expected to present his or her results at a suitable conference. At the end of the presentation I will have the audience vote on whether the demonstration was successful. Remember, the purpose of the bet is to force developers to demonstrate the utility of massively parallel systems, not to show they can find holes in the rules.