This paper describes the method by which users can speed up ATLAS (Automatically Tuned Linear Algebra Software) for themselves, as well as contribute any such speedup to the ATLAS project.
ATLAS is an implementation of a new style of high performance software production/maintenance called Automated Empirical Optimization of Software (AEOS). In an AEOS-enabled library, many different ways of performing a given kernel operation are supplied, and timers are used to empirically determine which implementation is best for a given architectural platform. ATLAS uses two techniques for supplying different implementations of kernel operations: multiple implementation and code generation.
In code generation, a highly-parameterized program is written that can generate many different kernel implementations. The matrix multiply code generator is an example of this. The second method is multiple implementation, and this, as its name suggests, is simply supplying various hand-written implementations of the same kernel.
ATLAS provides a standard way for users to help with multiple implementation. ATLAS is designed such that several kernel routines supply performance for the entire library. The entire Level 3 BLAS may be speeded up by improving the L1 matmul, and the Level 2 routines may be sped up by improving the GER and GEMV kernels. ATLAS has standard timers which can call user-programmed versions of these kernels, and automatically use them throughout the library when they are superior to the ATLAS-produced versions.
Although users may contribute any code improvements they like, the most useful contributions will probably be machine specific optimizations. Most general optimizations will be handled by ATLAS's code generators (the Level 2 kernels at present do not have code generators, but they eventually will), but it is not planned to have the code generators produce machine-specific code. Thus, adding machine-specific prefetch instructions to a kernel can provide optimizations that ATLAS will get in no other way. Some clear targets for such machine-specific intervention are:
One thing to consider when getting started is to take the best ATLAS kernel found, and, for instance, add some prefetch instructions, and see if you can get noticeable improvements.