HPC-ASIA 2000 The Fourth International Conference/Exhibition on High Performance Computing in Asia-Pacific Region May 14-17, 2000 Friendship Hotel, Beijing, China **Tools for High Performance Numerical Kernels, and Performance Measurement** Jack Dongarra University of Tennessee and

Oak Ridge National Laboratory

http://www.cs.utk.edu/~dongarra/





## Optimizing Computation and Memory Use

? Computational optimizations

?



## How To Get Performance From Commodity Processors?

- ? Today's processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
- ? Hardware and software have a large design space w/many parameters
  - **?** Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.
  - ? Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.
- ? Until recently, no tuned BLAS for Pentium for Linux.
- ? Need for quick/dynamic deployment of optimized routines.
- ? ATLAS Automatic Tuned Linear Algebra Software
   ? PhiPac from Berkeley
  - ?





## Why ATLAS Is Fast?

- ? ATLAS does not implement a single fixed algorithm.
- ? The code is generated by a program that tests, probes, and runs 100's of experiments on the target sw/hw architecture.
- ? During installation the program generator determines an efficient implementation
  - ? Probes systems for critical parameters
  - **?** Measures the speed of different code strategies and chooses the best using an adaptive procedure.
- ? This leads to a new model of high performance programming in which performance critical code is machine generated using parameter optimization.
- ? Done once to build the library, then used on that machine.







-



# 500x500 Level 2 BLAS DGEMV









? Timing and performance evaluation has been an art
?Resolution of the clock
?Issues about cache effects
?Different systems
? Situation about to change
?Today's processors have internal

counters

## Performance Data That May Be Available

- ?Cycle count
- ? Floating point instruction count
- ? Integer instruction count
- ?Instruction count
- ?Load/store count
- **?** Branch taken / not taken count
- **?** Branch mispredictions

- ? Pipeline stalls due to memory subsystem
- **?** Pipeline stalls due to resource conflicts
- **?** I/D cache misses for different levels
- ? Cache invalidations
- **?TLB misses**
- **?TLB** invalidations



















# Contributors to These Ideas

#### ? ATLAS

- ? Clint Whaley, UTK
- ? Antoine Petitet, UTK
- ? Tatebe Osamu, ETL/UTK
- ? Sathish Vadhiyar, UTK

### ? **PAPI**

- ? Shirley Browne, UTK
- ? Nathan Garner, UTK
- ? Kevin London, UTK
- ? Phil Mucci, UTK





For additional information see... http://www.netlib.org/atlas/ http://icl.cs.utk.edu/projects/papi/ http://www.cs.utk.edu/~dongarra/<sup>31</sup>

