Lattice QCD is truly a ``grand challenge'' computing problem. It has been estimated that it will take on the order of a TeraFLOP-year of dedicated computing to obtain believable results for the hadron mass spectrum in the quenched approximation, and adding dynamical fermions will involve many orders of magnitude more operations. Where is the computer power needed for QCD going to come from? Today, the biggest resources of computer time for research are the conventional supercomputers at the NSF and DOE centers. These centers are continually expanding their support for lattice gauge theory, but it may not be long before they are overtaken by several dedicated efforts involving concurrent computers. It is a revealing fact that the development of most high-performance parallel computers-the Caltech Cosmic Cube, the Columbia Machine, IBM's GF11, APE in Rome, the Fermilab Machine and the PAX machines in Japan-was actually motivated by the desire to simulate lattice QCD [Christ:91a], [Weingarten:92a].

As described already, Caltech built the first hypercube computer, the Cosmic Cube or Mark I, in 1983. It had 64 nodes, each of which was an Intel 8086/87 microprocessor with of memory, giving a total of about (measured for QCD). This was quickly upgraded to the Mark II hypercube with faster chips, twice the memory per node, and twice the number of nodes in 1984. Then, QCD was run on the last internal Caltech hypercube, the 128-node Mark IIIfp (built by JPL), at sustained [Ding:90b]. Each node of the Mark IIIfp hypercube contains two Motorola 68020 microprocessors, one for communication and the other for calculation, with the latter supplemented by one 68881 co-processor and a 32-bit Weitek floating point processor.

Norman Christ and Anthony Terrano built their first parallel computer for doing lattice QCD calculations at Columbia in 1984 [Christ:84a]. It had 16 nodes, each of which was an Intel 80286/87 microprocessor, plus a TRW 22-bit floating point processor with of memory, giving a total peak performance of . This was improved in 1987 using Weitek rather than TRW chips so that 64 nodes gave peak. In 1989, the Columbia group finished building their third machine: a 256-node, , lattice QCD computer [Christ:90a].

QCDPAX is the latest in the line of PAX (Parallel Array eXperiment) machines developed at the University of Tsukuba in Japan. The architecture is very similar to that of the Columbia machine. It is a MIMD machine configured as a two-dimensional periodic array of nodes, and each node includes a Motorola 68020 microprocessor and a 32-bit vector floating-point unit. Its peak performance is similar to that of the Columbia machine; however, it achieves only half the floating-point utilization for QCD code [Iwasaki:91a].

Don Weingarten initiated the GF11 project in 1984 at IBM. The GF11 is a SIMD machine comprising 576 Weitek floating point processors, each performing at to give the total peak implied by the name. Preliminary results for this project are given in [Weingarten:90a;92a].

The APE (Array Processor with Emulator) computer is basically a collection of 3081/E processors (which were developed by CERN and SLAC for use in high energy experimental physics) with Weitek floating-point processors attached. However, these floating-point processors are attached in a special way-each node has four multipliers and four adders, in order to optimize the calculations, which form the major component of all lattice QCD programs. This means that each node has a peak performance of . The first small machine-Apetto-was completed in 1986 and had four nodes yielding a peak performance of . Currently, they have a second generation of this machine with peak from 16 nodes. By 1993, the APE collaboration hopes to have completed the 2048-node ``Apecento,'' or APE-100, based on specialized VLSI chips that are software compatible with the original APE [Avico:89a], [Battista:92a]. The APE-100 is a SIMD machine with the architecture based on a three-dimensional cubic mesh of nodes. Currently, a 128-node machine is running with a peak performance of .

**Table 4.3:** Peak and Real Performances in MFLOPS of ``Homebrew'' QCD
Machines

Not to be outdone, Fermilab has also used its high energy experimental physics emulators to construct a lattice QCD machine called ACPMAPS. This is a MIMD machine, using a Weitek floating-point chip set on each node. A 16-node machine, with a peak rate of , was finished in 1989. A 256-node machine, arranged as a hypercube of crates, with eight nodes communicating through a crossbar in each crate, was completed in 1991 [Fischler:92a]. It has a peak rate of , and a sustained rate of about for QCD. An upgrade of ACPMAPS is planned, with the number of nodes being increased and the present processors being replaced with two Intel i860 chips per node, giving a peak performance of per node. These performance figures are summarized in Table 4.3. (The ``real'' performances are the actual performances obtained on QCD codes.)

Major calculations have also been performed on commercial SIMD machines, first on the ICL Distributed Array Processor (DAP) at Edinburgh University during the period from 1982 to 1987 [Wallace:84a], and now on the TMC Connection Machine (CM-2); and on commercial distributed memory MIMD machines like the nCUBE hypercube and Intel Touchstone Delta machines at Caltech. Currently, the Connection Machine is the most powerful commercial QCD machine available, running full QCD at a sustained rate of approximately on a CM-2 [Baillie:89e], [Brickner:91b]. However, simulations have recently been performed at a rate of on the experimental Intel Touchstone Delta at Caltech. This is a MIMD machine made up of 528 Intel i860 processors connected in a two-dimensional mesh, with a peak performance of for 32-bit arithmetic. These results compare favorably with performances on traditional (vector) supercomputers. Highly optimized QCD code runs at about per processor on a CRAY Y-MP, or on a fully configured eight-processor machine.

The generation of commercial parallel supercomputers, represented by the CM-5 and the Intel Paragon, have a peak performance of over . There was a proposal for the development of a TeraFLOPS parallel supercomputer for QCD and other numerically intensive simulations [Christ:91a], [Aoki:91a]. The goal was to build a machine based on the CM-5 architecture in collaboration with Thinking Machines Corporation, which would be ready by 1995 at a cost of around $40 million.

It is interesting to note that when the various groups began building their ``home-brew'' QCD machines, it was clear they would outperform all commercial (traditional) supercomputers; however, now that commercial parallel supercomputers have come of age [Fox:89n], the situation is not so obvious. To emphasize this, we describe QCD calculations on both the home grown Caltech hypercube and on the commercially available Connection Machine.

Wed Mar 1 10:19:35 EST 1995