How To Get Performance From Commodity  Processors?
 
 
- Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning. 
- Routines have a large design space w/many parameters
- 
- Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules. 
- Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.
 
- A few months ago no tuned BLAS for Pentium for Linux.
- Need for quick/dynamic deployment of optimized routines.
- ATLAS - Automatic Tuned Linear Algebra Software
- 
- PhiPac from Berkeley
- FFTW from MIT