An Overview of High Performance Computing and Challenges for the Future

Jack Dongarra
INNOVATIVE COMPUTING LABORATORY

University of Tennessee
Oak Ridge National Laboratory
University of Manchester
Overview

- **Quick look at High Performance Computing**
  - Top 500
- **Challenges for Math Software**
  - Linear Algebra Software for Multicore and Beyond
- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from LINPACK MPP
  \[ Ax = b, \text{ dense problem} \]
- Updated twice a year
  SC‘xy in the States in November Meeting in Germany in June
- All data available from www.top500.org
Performance Development & Projections

- 10 Eflop/s (~1000 years)
- 1 Eflop/s (~1 year)
- 10 Pflop/s (~8 hours)
- 1 Pflop/s (~1 min)

- 100 Tflop/s
- 10 Tflop/s
- 1 Tflop/s
- 100 Gflop/s
- 10 Gflop/s
- 1 Gflop/s
- 100 Mflop/s
- 10 Mflop/s
- 1 Mflop/s

- N=1
- N=500

- SUM

- Cray 2: 1 Gflop/s, O(1) Thread
- ASCI Red: 1 Tflop/s, O(10^3) Threads
- Roadrunner: 1 Pflop/s, O(10^6) Threads
- 1 Eflop/s, O(10^9) Threads
LANL Roadrunner
A Petascale System in 2008

“Connected Unit” cluster
192 Opteron nodes
(180 w/ 2 dual-Cell blades
connected w/ 4 PCIe x8 links)

≈ 13,000 Cell HPC chips
• ≈ 1.33 PetaFlop/s (from Cell)
≈ 7,000 dual-core Opterons
≈ 122,000 cores

17 clusters

2nd stage InfiniBand 4x DDR interconnect
(18 sets of 12 links to 8 switches)

Based on the 100 Gflop/s (DP) Cell chip

Hybrid Design (2 kinds of chips & 3 kinds of cores)
Programming required at 3 levels.
# Top 10 of the June 2008 List

<table>
<thead>
<tr>
<th>Rank</th>
<th>Computer</th>
<th>Rmax [TF/s]</th>
<th>Rmax / Rpeak</th>
<th>Installation Site</th>
<th>Country</th>
<th>#Cores</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>IBM / Roadrunner BladeCenter QS22/LS21</td>
<td>1,026</td>
<td>75%</td>
<td>DOE/NNSA/LANL</td>
<td>USA</td>
<td>122,400</td>
</tr>
<tr>
<td>2</td>
<td>IBM / BlueGene/L eServer Blue Gene Solution</td>
<td>478</td>
<td>80%</td>
<td>DOE/NNSA/LLNL</td>
<td>USA</td>
<td>212,992</td>
</tr>
<tr>
<td>3</td>
<td>IBM / Intrepid Blue Gene/P Solution</td>
<td>450</td>
<td>81%</td>
<td>DOE/OS/ANL</td>
<td>USA</td>
<td>163,840</td>
</tr>
<tr>
<td>4</td>
<td>SUN / Ranger SunBlade x6420</td>
<td>326</td>
<td>65%</td>
<td>NSF/TACC</td>
<td>USA</td>
<td>62,976</td>
</tr>
<tr>
<td>5</td>
<td>CRAY / Jaguar Cray XT4 QuadCore</td>
<td>205</td>
<td>79%</td>
<td>DOE/OS/ORNL</td>
<td>USA</td>
<td>30,976</td>
</tr>
<tr>
<td>6</td>
<td>IBM / JUGENE Blue Gene/P Solution</td>
<td>180</td>
<td>81%</td>
<td>Forschungszentrum Juelich (FZJ)</td>
<td>Germany</td>
<td>65,536</td>
</tr>
<tr>
<td>7</td>
<td>SGI / Encanto SGI Altix ICE 8200</td>
<td>133.2</td>
<td>77%</td>
<td>New Mexico Computing Applications Center</td>
<td>USA</td>
<td>14,336</td>
</tr>
<tr>
<td>8</td>
<td>HP / EKA Cluster Platform 3000 BL460c</td>
<td>132.8</td>
<td>77%</td>
<td>Computational Research Lab, TATA SONS</td>
<td>India</td>
<td>14,384</td>
</tr>
<tr>
<td>9</td>
<td>IBM / Blue Gene/P Solution</td>
<td>112</td>
<td>81%</td>
<td>IDRIS</td>
<td>France</td>
<td>40,960</td>
</tr>
<tr>
<td>10</td>
<td>SGI / Altix ICE 8200EX</td>
<td>106</td>
<td>86%</td>
<td>Total Exploration Production</td>
<td>France</td>
<td>10,240</td>
</tr>
</tbody>
</table>
## Top10 of the June 2008 List

<table>
<thead>
<tr>
<th>Rank</th>
<th>Computer</th>
<th>Rmax [TF/s]</th>
<th>Rmax / Rpeak</th>
<th>Installation Site</th>
<th>Country</th>
<th>#Cores</th>
<th>Power [MW]</th>
<th>MFlots/Watt</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>IBM / Roadrunner BladeCenter QS22/LS21</td>
<td>1,026</td>
<td>75%</td>
<td>DOE/NNSA/LANL</td>
<td>USA</td>
<td>122,400</td>
<td>2.35</td>
<td>437</td>
</tr>
<tr>
<td>2</td>
<td>IBM / BlueGene/L eServer Blue Gene Solution</td>
<td>478</td>
<td>80%</td>
<td>DOE/NNSA/LLNL</td>
<td>USA</td>
<td>212,992</td>
<td>2.33</td>
<td>205</td>
</tr>
<tr>
<td>3</td>
<td>IBM / Intrepid Blue Gene/P Solution</td>
<td>450</td>
<td>81%</td>
<td>DOE/OS/ANL</td>
<td>USA</td>
<td>163,840</td>
<td>1.26</td>
<td>357</td>
</tr>
<tr>
<td>4</td>
<td>SUN / Ranger SunBlade x6420</td>
<td>326</td>
<td>65%</td>
<td>NSF/TACC</td>
<td>USA</td>
<td>62,976</td>
<td>2.00</td>
<td>163</td>
</tr>
<tr>
<td>5</td>
<td>CRAY / Jaguar Cray XT4 QuadCore</td>
<td>205</td>
<td>79%</td>
<td>DOE/OS/ORNL</td>
<td>USA</td>
<td>30,976</td>
<td>1.58</td>
<td>130</td>
</tr>
<tr>
<td>6</td>
<td>IBM / JUGENE Blue Gene/P Solution</td>
<td>180</td>
<td>81%</td>
<td>Forschungszentrum Juelich (FZJ)</td>
<td>Germany</td>
<td>65,536</td>
<td>0.50</td>
<td>357</td>
</tr>
<tr>
<td>7</td>
<td>SGI / Encanto SGI Altix ICE 8200</td>
<td>133.2</td>
<td>77%</td>
<td>New Mexico Computing Applications Center</td>
<td>USA</td>
<td>14,336</td>
<td>0.86</td>
<td>155</td>
</tr>
<tr>
<td>8</td>
<td>HP / EKA Cluster Platform 3000 BL460c</td>
<td>132.8</td>
<td>77%</td>
<td>Computational Research Lab, TATA SONS</td>
<td>India</td>
<td>14,384</td>
<td>0.79</td>
<td>169</td>
</tr>
<tr>
<td>9</td>
<td>IBM / Blue Gene/P Solution</td>
<td>112</td>
<td>81%</td>
<td>IDRIS</td>
<td>France</td>
<td>40,960</td>
<td>0.32</td>
<td>357</td>
</tr>
<tr>
<td>10</td>
<td>SGI / Altix ICE 8200EX</td>
<td>106</td>
<td>86%</td>
<td>Total Exploration Production</td>
<td>France</td>
<td>10,240</td>
<td>0.44</td>
<td>240</td>
</tr>
</tbody>
</table>

• Over the next 5 years ORNL/UTK will deploy 2 large Petascale systems
• Using 4 MW today, going to 15MW before year end
• By 2012 could be using more than 50MW!!
• Cost estimates based on $0.07 per KwH

Power becomes the architectural driver for future large systems

Cost Per Year
Includes both DOE and NSF systems.
Something’s Happening Here...

From K. Olukotun, L. Hammond, H. Sutter, and B. Smith

A hardware issue just became a software problem

- In the “old days” it was: each year processors would become faster
- Today the clock speed is fixed or getting slower
- Things are still doubling every 18-24 months
- Moore’s Law reinterpreted.
  - Number of cores double every 18-24 months
Multicore

• What is multicore?
  ▪ A multicore chip is a single chip (socket) that combines two or more independent processing units that provide independent threads of control

• Why multicore?
  ▪ The race for ever higher clock speeds is over.
    • In the old days, new the chips where faster
      • Applications ran faster on the new chips
    • Today new chips are not faster, just have more processors per chip
      • Applications and software must use those extra processors to become faster
Power Cost of Frequency

- Power $\propto$ Voltage$^2 \times$ Frequency \( (V^2F) \)
- Frequency $\propto$ Voltage
- Power $\propto$ Frequency$^3$

<table>
<thead>
<tr>
<th></th>
<th>Cores</th>
<th>V</th>
<th>Freq</th>
<th>Perf</th>
<th>Power</th>
<th>PE (Bops/watt)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Superscalar</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>“New” Superscalar</td>
<td>1X</td>
<td>1.5X</td>
<td>1.5X</td>
<td>1.5X</td>
<td>3.3X</td>
<td>0.45X</td>
</tr>
</tbody>
</table>
# Power Cost of Frequency

- **Power ∝ Voltage\(^2\) x Frequency (V\(^2\)F)**
- **Frequency ∝ Voltage**
- **Power ∝ Frequency\(^3\)**

<table>
<thead>
<tr>
<th></th>
<th>Cores</th>
<th>V</th>
<th>Freq</th>
<th>Perf</th>
<th>Power</th>
<th>PE (Bops/watt)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Superscalar</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>&quot;New&quot; Superscalar</td>
<td>1X</td>
<td>1.5X</td>
<td>1.5X</td>
<td>1.5X</td>
<td>3.3X</td>
<td>0.45X</td>
</tr>
<tr>
<td>Multicore</td>
<td>2X</td>
<td>0.75X</td>
<td>0.75X</td>
<td>1.5X</td>
<td>0.8X</td>
<td>1.88X</td>
</tr>
</tbody>
</table>

(Bigger # is better)

50% more performance with 20% less power

Preferable to use multiple slower devices, than one superfast device
Today’s Multicores
98% of Top500 Systems Are Based on Multicore

- 282 use Quad-Core
- 204 use Dual-Core
- 3 use Nona-core

- Sun Niagara2 (8 cores)
- SciCortex (6 cores)
- AMD Opteron (4 cores)
- IBM Cell (9 cores)
- Intel Clovertown (4 cores)
- Intel Polaris (80 cores)
- IBM BG/P (4 cores)
And then there’s the GPU’s NVIDIA’s Tesla T10P

- **T10P chip**
  - 240 cores; 1.5 GHz
  - Tpeak 1 Tflop/s - 32 bit floating point
  - Tpeak 100 Gflop/s - 64 bit floating point

- **S1070 board**
  - 4 - T10P devices;
  - 700 Watts

- **C1060 card**
  - 1 - T10P; 1.33 GHz
  - 160 Watts
  - Tpeak 887 Gflop/s - 32 bit floating point
  - Tpeak 88.7 Gflop/s - 64 bit floating point
What’s Next? Multicore to Manycore

All Large Core

Mixed Large and Small Core

Many Small Cores

All Small Core

Many Floating-Point Cores

+ 3D Stacked Memory

SRAM

Different Classes of Chips
Home
Games / Graphics
Business
Scientific

The question is not whether this will happen but whether we are ready
Coding for an Abstract Multicore

Parallel software for multicores should have two characteristics:

- **Fine granularity:**
  - High level of parallelism is needed
  - Cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality.

- **Asynchronicity:**
  - As the degree of thread level parallelism grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.
ManyCore - Parallelism for the Masses

• We are looking at the following concepts in designing the next numerical library implementation:
  ▪ Dynamic Data Driven Execution
  ▪ Self Adapting
  ▪ Block Data Layout
  ▪ Mixed Precision in the Algorithm
  ▪ Exploit Hybrid Architectures
  ▪ Fault Tolerant Methods
Major Changes to Software

- **Must rethink the design of our software**
  - Another disruptive technology
    - Similar to what happened with cluster computing and message passing
  - Rethink and rewrite the applications, algorithms, and software

- **Numerical libraries for example will change**
  - For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this
A New Generation of Software:

<table>
<thead>
<tr>
<th>Software/Algorithms follow hardware evolution in time</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>LINPACK (70’s)</strong></td>
</tr>
<tr>
<td>(Vector operations)</td>
</tr>
<tr>
<td>Rely on</td>
</tr>
<tr>
<td>- Level-1 BLAS operations</td>
</tr>
</tbody>
</table>

Those new algorithms need new kernels and rely on efficient scheduling algorithms.
A New Generation of Software:

<table>
<thead>
<tr>
<th>Software/Algorithms follow hardware evolution in time</th>
<th>LINPACK (70’s) (Vector operations)</th>
<th>Rely on - Level-1 BLAS operations</th>
</tr>
</thead>
<tbody>
<tr>
<td>LAPACK (80’s) (Blocking, cache friendly)</td>
<td></td>
<td>Rely on - Level-3 BLAS operations</td>
</tr>
</tbody>
</table>

Those new algorithms have a very low granularity, they scale very well (multicore, distributed computing, …)

- removes a lot of dependencies among the tasks
- avoid latency (distributed computing, out-of-core)
- rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.
## A New Generation of Software:

| Software/Algorithms follow hardware evolution in time | LINPACK (70’s)  
(Vector operations) | LAPACK (80’s)  
(Blocking, cache friendly) | ScaLAPACK (90’s)  
(Distributed Memory) |
|-----------------------------------------------------|-------------------------------------------------|-------------------------------------------------|-------------------------------------------------|
| Rely on  
- Level-1 BLAS operations | Rely on  
- Level-3 BLAS operations | Rely on  
- PBLAS Mess Passing |

Those new algorithms have a very low granularity, they scale very well (multicore, distributed computing, …) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.
## A New Generation of Software:
Parallel Linear Algebra Software for Multicore Architectures (PLASMA)

Software/Algorithms follow hardware evolution in time

<table>
<thead>
<tr>
<th>Software/Algorithms</th>
<th>LINPACK (70’s)</th>
<th>LAPACK (80’s)</th>
<th>ScaLAPACK (90’s)</th>
<th>PLASMA (00’s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(Vector operations)</td>
<td>Rely on Level-1 BLAS operations</td>
<td>Rely on Level-3 BLAS operations</td>
<td>Rely on PBLAS Mess Passing</td>
<td>Rely on a DAG/scheduler, block data layout, some extra kernels</td>
</tr>
<tr>
<td>Blocking, cache friendly</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>(Distributed Memory)</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Those new algorithms
- have a very **low granularity**, they scale very well (multicore, petascale computing, … )
- **removes a lots of dependencies** among the tasks, (multicore, distributed computing)
- **avoid latency** (distributed computing, out-of-core)
- **rely on fast kernels**

Those new algorithms need new kernels and rely on efficient scheduling algorithms.
Steps in the LAPACK LU

- DGETF2 (Factor a panel)
- DLSWP (Backward swap)
- DLSWP (Forward swap)
- DTRSM (Triangular solve)
- DGEMM (Matrix multiply)

LAPACK

BLAS
LU Timing Profile (4 processor system)

Threads – no lookahead

Time for each component

Bulk Sync Phases
Adaptive Lookahead - Dynamic

Event Driven Multithreading

Ideas not new.

Many papers use the DAG approach.

while(1)
    fetch_task();
    switch(task.type) {
    case PANEL:
        dgetf2();
        update_progress();
    case COLUMN:
        dlaswp();
        dtrsm();
        dgemm();
        update_progress();
    case END:
        for()
            dlaswp();
        return;
    }

Reorganizing algorithms to use this approach
Achieving Fine Granularity

Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data.

Column-Major
Achieving Fine Granularity

Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data.

Column-Major

Block data layout
LU – 16 Core
(8 Socket - Dual Core Opteron 2.2 GHz)

1. LAPACK (BLAS Fork-Join Parallelism)
2. ScaLAPACK (Mess Pass using mem copy)
3. DAG Based (Dynamic Scheduling)
4. Intel MKL Library

Problem Size

GFlop/s
Cholesky on the CELL

Single precision results on the Cell

- **1 CELL (8 SPEs)**
  - 186 Gflop/s
  - 91 % peak
  - 97 % SGEMM peak

- **2 CELLs (16 SPEs)**
  - 365 Gflop/s
  - 89 % peak
  - 95 % SGEMM peak
If We Had A Small Matrix Problem

- We would generate the DAG, find the critical path and execute it.
- DAG too large to generate ahead of time
  - Not explicitly generate
  - Dynamically generate the DAG as we go
- Machines will have large number of cores in a distributed fashion
  - Will have to engage in message passing
  - Distributed management
  - Locally have a run time system
The DAGs are Large

- Here is the DAG for the QR factorization on a 20 x 20 matrix

- For a large matrix say $O(10^6)$ the DAG is huge
- Many challenges for the software
Each Node or Core Will Have A Run Time System

- BIN 1: some dependencies satisfied, waiting for all dependencies
- BIN 2: all dependencies satisfied, some data delivered, waiting for all data
- BIN 3: all data delivered, waiting for execution
# Performance of Single Precision on Conventional Processors

- Realized have the similar situation on our commodity processors.
  - That is, SP is 2X as fast as DP on many systems

- The Intel Pentium and AMD Opteron have SSE2
  - 2 flops/cycle DP
  - 4 flops/cycle SP

- IBM PowerPC has AltiVec
  - 8 flops/cycle SP
  - 4 flops/cycle DP
  - No DP on AltiVec

<table>
<thead>
<tr>
<th>Processor</th>
<th>Size</th>
<th>SGEMM/ DGEMM</th>
<th>Size</th>
<th>SGEMV/ DGEMV</th>
</tr>
</thead>
<tbody>
<tr>
<td>AMD Opteron 246</td>
<td>3000</td>
<td>2.00</td>
<td>5000</td>
<td>1.70</td>
</tr>
<tr>
<td>UltraSparc-Ile</td>
<td>3000</td>
<td>1.64</td>
<td>5000</td>
<td>1.66</td>
</tr>
<tr>
<td>Intel PIII Coppermine</td>
<td>3000</td>
<td>2.03</td>
<td>5000</td>
<td>2.09</td>
</tr>
<tr>
<td>PowerPC 970</td>
<td>3000</td>
<td>2.04</td>
<td>5000</td>
<td>1.44</td>
</tr>
<tr>
<td>Intel Woodcrest</td>
<td>3000</td>
<td>1.81</td>
<td>5000</td>
<td>2.18</td>
</tr>
<tr>
<td>Intel XEON</td>
<td>3000</td>
<td>2.04</td>
<td>5000</td>
<td>1.82</td>
</tr>
<tr>
<td>Intel Centrino Duo</td>
<td>3000</td>
<td>2.71</td>
<td>5000</td>
<td>2.21</td>
</tr>
</tbody>
</table>

Single precision is faster because:
- Higher parallelism in SSE/vector units
- Reduced data motion
- Higher locality in cache
32 or 64 bit Floating Point Precision?

- A long time ago 32 bit floating point was used
  - Still used in scientific apps but limited
- Most apps use 64 bit floating point
  - Accumulation of round off error
    - A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop \((10^{18})\) ops.
  - Ill conditioned problems
  - IEEE SP exponent bits too few (8 bits, \(10^{\pm38}\))
  - Critical sections need higher precision
    - Sometimes need extended precision (128 bit fl pt)
  - However some can get by with 32 bit fl pt in some parts
- Mixed precision a possibility
  - Approximate in lower precision and then refine or improve solution to high precision.
Idea Goes Something Like This...

- **Exploit 32 bit floating point as much as possible.**
  - Especially for the bulk of the computation
- **Correct or update the solution with selective use of 64 bit floating point to provide a refined results**
- **Intuitively:**
  - Compute a 32 bit result,
  - Calculate a correction to 32 bit result using selected higher precision and,
  - Perform the update of the 32 bit results with the correction using high precision.
Mixed-Precision Iterative Refinement

- Iterative refinement for dense systems, \( Ax = b \), can work this way.

\[
\begin{align*}
L U &= lu(A) \\
x &= L \backslash (U \backslash b) \\
r &= b - Ax
\end{align*}
\]

\[
\text{WHILE } \| r \| \text{ not small enough}
\]

\[
\begin{align*}
z &= L \backslash (U \backslash r) \\
x &= x + z \\
r &= b - Ax
\end{align*}
\]

- Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
Mixed-Precision Iterative Refinement

- Iterative refinement for dense systems, \( Ax = b \), can work this way.

\[
\begin{align*}
L \ U &= lu(A) \\
x &= L \backslash (U \backslash b) \\
r &= b - Ax
\end{align*}
\]

\[
\text{WHILE } ||r|| \text{ not small enough}
\]

\[
\begin{align*}
z &= L \backslash (U \backslash r) \\
x &= x + z \\
r &= b - Ax
\end{align*}
\]

\[
\text{END}
\]

- Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
- It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

- Requires extra storage, total is 1.5 times normal;
- \( O(n^3) \) work is done in lower precision
- \( O(n^2) \) work is done in high precision
- Problems if the matrix is ill-conditioned in sp; \( O(10^8) \)
Results for Mixed Precision Iterative Refinement for Dense $Ax = b$

- Single precision is faster than DP because:
  - Higher parallelism within vector units
    - 4 ops/cycle (usually) instead of 2 ops/cycle
  - Reduced data motion
    - 32 bit data instead of 64 bit data
  - Higher locality in cache
    - More data items in cache

<table>
<thead>
<tr>
<th>Architecture (BLAS)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 Intel Pentium III Coppermine (Goto)</td>
</tr>
<tr>
<td>2 Intel Pentium III Katmai (Goto)</td>
</tr>
<tr>
<td>3 Sun UltraSPARC IIE (Sunperf)</td>
</tr>
<tr>
<td>4 Intel Pentium IV Prescott (Goto)</td>
</tr>
<tr>
<td>5 Intel Pentium IV-M Northwood (Goto)</td>
</tr>
<tr>
<td>6 AMD Opteron (Goto)</td>
</tr>
<tr>
<td>7 Cray X1 (libsci)</td>
</tr>
<tr>
<td>8 IBM Power PC G5 (2.7 GHz) (VecLib)</td>
</tr>
<tr>
<td>9 Compaq Alpha EV6 (CXML)</td>
</tr>
<tr>
<td>10 IBM SP Power3 (ESSL)</td>
</tr>
<tr>
<td>11 SGI Octane (ATLAS)</td>
</tr>
</tbody>
</table>
Results for Mixed Precision Iterative Refinement for Dense $Ax = b$

### Architecture (BLAS)

<table>
<thead>
<tr>
<th>Architecture (BLAS-MPI)</th>
<th># procs</th>
<th>$n$</th>
<th>DP Solve/SP Solve</th>
<th>DP Solve/Iter Ref</th>
<th># iter</th>
</tr>
</thead>
<tbody>
<tr>
<td>AMD Opteron (Goto – OpenMPI MX)</td>
<td>32</td>
<td>22627</td>
<td>1.85</td>
<td>1.79</td>
<td>6</td>
</tr>
<tr>
<td>AMD Opteron (Goto – OpenMPI MX)</td>
<td>64</td>
<td>32000</td>
<td>1.90</td>
<td>1.83</td>
<td>6</td>
</tr>
</tbody>
</table>

- Single precision is faster than DP because:
  - **Higher parallelism within vector units**
    - 4 ops/cycle (usually) instead of 2 ops/cycle
  - **Reduced data motion**
    - 32 bit data instead of 64 bit data
  - **Higher locality in cache**
    - More data items in cache
Sparse Direct Solver and Iterative Refinement

MUMPS package based on multifrontal approach which generates small dense matrix multiplies

**Opteron w/Intel compiler**

<table>
<thead>
<tr>
<th></th>
<th>Iterative Refinement</th>
<th>Single Precision</th>
</tr>
</thead>
<tbody>
<tr>
<td>G84</td>
<td></td>
<td></td>
</tr>
<tr>
<td>S1/H16</td>
<td></td>
<td></td>
</tr>
<tr>
<td>blockq29</td>
<td></td>
<td></td>
</tr>
<tr>
<td>c-71</td>
<td></td>
<td></td>
</tr>
<tr>
<td>cavity26</td>
<td></td>
<td></td>
</tr>
<tr>
<td>dawson15</td>
<td></td>
<td></td>
</tr>
<tr>
<td>epb3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>finan512</td>
<td></td>
<td></td>
</tr>
<tr>
<td>heart1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>kivap004</td>
<td></td>
<td></td>
</tr>
<tr>
<td>kivap006</td>
<td></td>
<td></td>
</tr>
<tr>
<td>mult_decop_01</td>
<td></td>
<td></td>
</tr>
<tr>
<td>nassrb</td>
<td></td>
<td></td>
</tr>
<tr>
<td>nemeth26</td>
<td></td>
<td></td>
</tr>
<tr>
<td>qa0k</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ma10</td>
<td></td>
<td></td>
</tr>
<tr>
<td>tors2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>venkat01</td>
<td></td>
<td></td>
</tr>
<tr>
<td>wethen1120</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Tim Davis’s Collection, n=100K - 3M
Sparse Iterative Methods (PCG)

- **Outer/Inner Iteration**
  
  Outer iterations using 64 bit floating point

  ```
  Compute \( r^{(0)} = b - Ax^{(0)} \) for some initial guess \( x^{(0)} \)
  for \( i = 1, 2, \ldots \)
  solve \( Mz^{(i-1)} = r^{(i-1)} \)
  if \( i = 1 \)
  \[ p^{(1)} = z^{(0)} \]
  else
  \[ \beta_{i-1} = \rho_{i-1} / \rho_{i-2} \]
  \[ p^{(i)} = z^{(i-1)} + \beta_{i-1} p^{(i-1)} \]
  endif
  \[ q^{(i)} = Ap^{(i)} \]
  \[ \alpha_i = \rho_{i-1} / p^{(i)^T} q^{(i)} \]
  \[ x^{(i)} = x^{(i-1)} + \alpha_i p^{(i)} \]
  \[ r^{(i)} = r^{(i-1)} - \alpha_i q^{(i)} \]
  check convergence; continue if necessary
  ```

- **Inner iteration:**
  
  In 32 bit floating point

  ```
  Compute \( r^{(0)} = b - Ax^{(0)} \) for some initial guess \( x^{(0)} \)
  for \( i = 1, 2, \ldots \)
  solve \( Mz^{(i-1)} = r^{(i-1)} \)
  \[ \rho_{i-1} = r^{(i-1)^T} z^{(i-1)} \]
  if \( i = 1 \)
  \[ p^{(1)} = z^{(0)} \]
  else
  \[ \beta_{i-1} = \rho_{i-1} / \rho_{i-2} \]
  \[ p^{(i)} = z^{(i-1)} + \beta_{i-1} p^{(i-1)} \]
  endif
  \[ q^{(i)} = Ap^{(i)} \]
  \[ \alpha_i = \rho_{i-1} / p^{(i)^T} q^{(i)} \]
  \[ x^{(i)} = x^{(i-1)} + \alpha_i p^{(i)} \]
  \[ r^{(i)} = r^{(i-1)} - \alpha_i q^{(i)} \]
  check convergence; continue if necessary
  ```

- **Outer iteration in 64 bit floating point and inner iteration in 32 bit floating point**
Mixed Precision Computations for Sparse Inner/Outer-type Iterative Solvers

**Speedups** for mixed precision
Inner SP/Outer DP (SP/DP) iter. methods vs DP/DP (CG^2, GMRES^2, PCG^2, and PGMRES^2 with diagonal prec.)
*(Higher is better)*

**Iterations** for mixed precision
SP/DP iterative methods vs DP/DP
*(Lower is better)*

**Machine:**
Intel Woodcrest (3GHz, 1333MHz bus)

**Stopping criteria:**
Relative to $r_0$ residual reduction ($10^{-12}$)
Cray XD-1 (OctigaBay Systems)

Experiments with Field Programmable Gate Array (FPGA)
Specify arithmetic precision

Six Xilinx Virtex-4 Field Programmable Gate Arrays (FPGAs) per chassis
### Characteristics of multiplier on an FPGA* (using DSP48)

<table>
<thead>
<tr>
<th>Data Formats</th>
<th>DSP48s</th>
<th>Frequency (MHz)</th>
<th>GFLOPs</th>
</tr>
</thead>
<tbody>
<tr>
<td>s52e11 (double)</td>
<td>16/96</td>
<td>237</td>
<td>1.42</td>
</tr>
<tr>
<td>s51e11</td>
<td>16/96</td>
<td>238</td>
<td>1.43</td>
</tr>
<tr>
<td>s50e11</td>
<td>9/96</td>
<td>245</td>
<td>2.61</td>
</tr>
<tr>
<td>s34e8</td>
<td>9/96</td>
<td>289</td>
<td>3.08</td>
</tr>
<tr>
<td>s33e8</td>
<td>4/96</td>
<td>292</td>
<td>7.01</td>
</tr>
<tr>
<td>s23e8 (single)</td>
<td>4/96</td>
<td>339</td>
<td>8.14</td>
</tr>
<tr>
<td>s17e8</td>
<td>4/96</td>
<td>370</td>
<td>8.88</td>
</tr>
<tr>
<td>s16e8</td>
<td>1/96</td>
<td>331</td>
<td>31.78</td>
</tr>
<tr>
<td>s16e7</td>
<td>1/96</td>
<td>352</td>
<td>33.79</td>
</tr>
<tr>
<td>s13e7</td>
<td>1/96</td>
<td>336</td>
<td>32.26</td>
</tr>
</tbody>
</table>

* XC4LX160-10

TENNESSEE ADVANCED COMPUTING LABORATORY
Refinement iterations for customized formats (sXXe11). Random matrices (average number of iterations/random matrices)

<table>
<thead>
<tr>
<th>Problem Size</th>
<th>Mantissa Bits</th>
<th>12</th>
<th>16</th>
<th>23</th>
<th>31</th>
<th>48</th>
<th>52</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td></td>
<td>8.9</td>
<td>4</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>256</td>
<td></td>
<td>11.1</td>
<td>5.1</td>
<td>2.1</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>512</td>
<td></td>
<td>19.7</td>
<td>6.1</td>
<td>2.5</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1024</td>
<td></td>
<td>28</td>
<td>6.3</td>
<td>2.6</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>2048</td>
<td></td>
<td>-</td>
<td>9.3</td>
<td>3</td>
<td>1.3</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>4096</td>
<td></td>
<td>-</td>
<td>13.3</td>
<td>3.1</td>
<td>1.43</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

Mixed Precision Iterative Refinement
- Random Matrix Test - Junqing Sun et al

TENNESSEE ADVANCED COMPUTING LABORATORY
Mixed Precision Hybrid Direct Solver

- Profiled Time* on Cray-XD1 - Junqing Sun et al

LU w Partial Pivoting using variable precision on an FPGA

Data types for LU on FPGAs

- double
- s31e8
- s16e7

LU  communication  triangular solvers  Refinement

* For a 128x128 matrix

High Performance Mixed-Precision Linear Solver for FPGAs,
Junqing Sun, Gregory D. Peterson, Olaf Storaasli, To appear IEEE TPDC
Intriguing Potential

- Exploit lower precision as much as possible
  - Payoff in performance
    - Faster floating point
    - Less data to move
- Automatically switch between SP and DP to match the desired accuracy
  - Compute solution in SP and then a correction to the solution in DP
- Potential for GPU, FPGA, special purpose processors
  - What about 16 bit floating point?
    - Use as little you can get away with and improve the accuracy
- Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used.

\[ x_{i+1} - x_i = \frac{f(x_i)}{f'(x_i)} \]

Correction = - A\(b - Ax)
Conclusions

• For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware.

• This strategy needs to be rebalanced - barriers to progress are increasingly on the software side.

• Moreover, the return on investment is more favorable to software.
  ▪ Hardware has a half-life measured in years, while software has a half-life measured in decades.

• High Performance Ecosystem out of balance
  ▪ Hardware, OS, Compilers, Software, Algorithms, Applications
    • No Moore’s Law for software, algorithms and applications
Collaborators / Support

Marc Baboulin, UCoimbra
Alfredo Buttari, ENS/INRIA
Bilel Hadri, UTK
Julien Langou, UColorado
Julie Langou, UTK
Hatem Ltaief, UTK
Piotr Luszczek, MathWorks
Jakub Kurzak, UTK
Stan Tomov, UTK
PLASMA Collaborators

- **U Tennessee, Knoxville**
  - Jack Dongarra, Julie Langou, Stan Tomov, Jakub Kurzak, Hatem Ltaief, Alfredo Buttari, Julien Langou, Piotr Luszczek, Marc Baboulin

- **UC Berkeley**
  - Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Yozo Hida, Jason Riedy, Vasily Volkov, Christof Voemel, David Bindel

- **Other Academic Institutions**
  - UC Davis, CU Denver, Florida IT, Georgia Tech, U Maryland, North Carolina SU, UC Santa Barbara, UT Austin, LBNL
  - TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb, UPC Barcelona, ENS Lyon, INRIA

- **Industrial Partners**
  - Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, NVIDIA, Microsoft