

# Survey of "Present and Future Supercomputer Architectures and their Interconnects"

Jack Dongarra
University of Tennessee
and
Oak Ridge National Laboratory

1

# ECT DL

# Overview

- Processors
- Interconnects
- A few machines
- Examine the Top242



# Vibrant Field for High Performance Computers

- Cray X1
- SGI Altix
- IBM Regatta
- Sun
- HP
- Bull NovaScale
- Fujitsu PrimePower
- + Hitachi SR11000
- NEC SX-7
- Apple

- Coming soon ...
  - > Cray RedStorm
  - Cray BlackWidow
  - > NEC SX-8
  - > IBM Blue Gene/L

3

# € ICLUT

# Architecture/Systems Continuum

#### Loosely Coupled

- Commodity processor with commodity interconnect
  - Clusters
    - > Pentium, Itanium, Opteron, Alpha
    - > GigE, Infiniband, Myrinet, Quadrics, SCI
  - > NEC TX7
  - > HP Alpha
  - > Bull NovaScale 5160
- Commodity processor with custom interconnect
  - > SGI Altix
    - Intel Itanium 2
  - > Cray Red Storm
    - > AMD Opteron
- Custom processor with custom interconnect
  - > Cray X1
  - ➤ NEC SX-7
  - > IBM Regatta
  - > IBM Blue Gene/L

Tightly Coupled



# **Commodity Processors**

- Intel Pentium Xeon
  - > 3.2 GHz, peak = 6.4 Gflop/s
  - > Linpack 100 = 1.7 Gflop/s
  - > Linpack 1000 = 3.1 Gflop/s
- AMD Opteron
  - > 2.2 GHz, peak = 4.4 Gflop/s
  - > Linpack 100 = 1.3 Gflop/s
  - > Linpack 1000 = 3.1 Gflop/s
- Intel Itanium 2
  - > 1.5 GHz, peak = 6 Gflop/s
  - > Linpack 100 = 1.7 Gflop/s
  - Linpack 1000 = 5.4 Gflop/s

- HP PA RISC
- ◆ Sun UltraSPARC IV
- HP Alpha EV68
  - > 1.25 GHz, 2.5 Gflop/s peak
- MIPS R16000

5



### High Bandwidth vs Commodity Systems

- High bandwidth systems have traditionally been vector computers
  - > Designed for scientific problems
  - > Capability computing
- Commodity processors are designed for web servers and the home PC market

(should be thankful that the manufactures keep the 64 bit fl pt)

- Used for cluster based computers leveraging price point
- Scientific computing needs are different
  - Require a better balance between data movement and floating point operations. Results in greater efficiency.

|                            | Earth Simulator | Cray X1      | ASCI Q      | MCR         | Apple Xserve |
|----------------------------|-----------------|--------------|-------------|-------------|--------------|
|                            | (NEC)           | (Cray)       | (HP EV68)   | Xeon        | IBM PowerPC  |
| Year of Introduction       | 2002            | 2003         | 2002        | 2002        | 2003         |
| Node Architecture          | Vector          | Vector       | Alpha       | Pentium     | Power PC     |
| Processor Cycle Time       | 500 MHz         | 800 MHz      | 1.25 GHz    | 2.4 GHz     | 2 GHz        |
| Peak Speed per Processor   | 8 Gflop/s       | 12.8 Gflop/s | 2.5 Gflop/s | 4.8 Gflop/s | 8 Gflop/s    |
| Operands/Flop(main memory) | 0.5             | 0.33         | 0.1         | 0.055       | 0.063        |









# BlueGene/L Interconnection Networks



#### 3 Dimensional Torus

- Interconnects all compute nodes (65,536)
- Virtual cut-through hardware routing
- 1.4Gb/s on all 12 node links (2.1 GB/s per node)
- $1\ \mu s$  latency between nearest neighbors,  $5\ \mu s$  to the farthest
- 4  $\mu s$  latency for one hop with MPI, 10  $\mu s$  to the farthest
- Communications backbone for computations
- 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth

#### Global Tree

- > Interconnects all compute and I/O nodes (1024)
- One-to-all broadcast functionality
- > Reduction operations functionality
- > 2.8 Gb/s of bandwidth per link
- > Latency of one way tree traversal 2.5 µs
- > ~23TB/s total binary tree bandwidth (64k machine)

#### Ethernet

- > Incorporated into every node ASIC
- > Active in the I/O nodes (1:64)
- > All external comm. (file I/O, control, user interaction, etc.)

Low Latency Global Barrier and Interrupt

> Latency of round trip 1.3 µs

Control Network









- Four multistream processors (MSPs), each 12.8 Gflops
- High bandwidth local shared memory (128 Direct Rambus channels)
- 32 network links and four I/O links per node





### A Tour de Force in Engineering

- Homogeneous, Centralized, Proprietary, Expensive!
- Target Application: CFD-Weather, Climate, Earthquakes
- 640 NEC SX/6 Nodes (mod)
  - > 5120 CPUs which have vector ops
  - > Each CPU 8 Gflop/s Peak
- 40 TFlop/s (peak)
- A record 5 times #1 on Top500
- H. Miyoshi; architect
  - NAL, RIST, ESFujitsu AP, VP400, NWT, ES
- Footprint of 4 tennis courts
- Expect to be on top of Top500 for another 6 months to a year.
- From the Top500 (June 2004)
  - Performance of ESC
     Σ Next Top 2 Computers







# The Top242

- Focus on machines that are at least 1 TFlop/s on the Linpack benchmark
- Linpack Based
  - > Pros
    - > One number
    - > Simple to define and rank
    - > Allows problem size to change with machine and over time
  - > Cons
    - Emphasizes only "peak" CPU speed and number of CPUs
    - Does not stress local bandwidth
    - Does not stress the networkDoes not test

    - gather/scatter
      > Ignores Amdahl's Law (Only does weak scaling)



- 1993:
  - > #1 = 59.7 GFlop/s
  - > #500 = 422 MFlop/s
- 2004:
  - $\rightarrow$  #1 = 35.8 TFlop/s
  - > #500 = 813 *G*Flop/s











# What About Efficiency?

- Talking about Linpack
- What should be the efficiency of a machine on the Top242 be?
  - > Percent of peak for Linpack
  - > 90% ?
  - > 80% ?
  - > 70% ?
  - > 60% ?

...

Remember this is O(n³) ops and O(n²) data
 ➤ Mostly matrix multiply

















## Real Crisis With HPC Is With The Software

- Programming is stuck
  - > Arguably hasn't changed since the 70's
- It's time for a change
  - > Complexity is rising dramatically
    - > highly parallel and distributed systems
      - From 10 to 100 to 1000 to 10000 to 100000 of processors!!
  - > multidisciplinary applications
- A supercomputer application and software are usually much more long-lived than a hardware
  - > Hardware life typically five years at most.
  - > Fortran and C are the main programming models
- Software is a major cost component of modern technologies.
  - > The tradition in HPC system procurement is to assume that the software is free.

29



### Some Current Unmet Needs

- Performance / Portability
- Fault tolerance
- Better programming models
  - > Global shared address space
  - > Visible locality
- Maybe coming soon (since incremental, yet offering) real benefits):
  - > Global Address Space (GAS) languages: UPC, Co-Array Fortran, Titanium)
    - "Minor" extensions to existing languages
    - > More convenient than MPI
    - > Have performance transparency via explicit remote memory
- The critical cycle of prototyping, assessment, and commercialization must be a long-term, sustaining investment, not a one time, crash program.



# Collaborators / Support

- Top500 Team
  - > Erich Strohmaier, NERSC
  - > Hans Meuer, Mannheim
  - > Horst Simon, NERSC







#### >For more information:

- > Google "dongarra"
- > Click on "talks"



Advertise with Us - Business Solutions - Services & Tools - Jobs, Press, & Help