The performance kernel for the entire Level 3 BLAS is matrix multiply. Matrix multiply is written in terms of a lower-level building block that we call the L1 matmul. The L1 matmul is a special matrix multiply where the input dimensions are fixed at , where the blocking factor is chosen in order to maximize L1 cache reuse.

ATLAS actually has two different L1 matmul kernels: one for copied matrices,
and one that operates directly on the user's matrices. For matrices of
sufficient size, ATLAS copies the input matrix into *block-major* storage.
In block-major storage, the
blocks operated on by the
L1 matmul are actually contiguous. This optimization prevents unnecessary
cache misses, cache conflicts, and TLB problems. However, for sufficiently
small matrices, the cost of this data copy is prohibitively expensive,
and thus ATLAS has kernels that operate on non-copied data. However,
without the copy to simplify the process, there are multiple non-copy
kernels (differing kernels for differing transpose settings, for instance).
Since the non-copy kernels are typically only used for very small problems,
and they are much more complex, ATLAS presently accepts contributed code
only for the copy L1 matmul. For most problems, well over 98% of ATLAS time is
spent in the copy L1 matmul, so this should not be much of a problem.

- Building the General Matrix Multiply From the L1 Cache-contained Multiply
- The L1 matmul

- Putting it together with some examples
- More timing info
- Complex L1 matmul
- Providing ATLAS with kernel cleanup code
- ATLAS and cleanup
- User supplied cleanup
- Indicating cleanup in the index file
- Testing and timing cleanup
- Importance of cleanup

- L1 matmul usage notes
- Getting ATLAS to use your kernel

- Contributing a complete GEMM implementation