Building the General Matrix Multiply From the L1 Cache-contained Multiply

This section describes the code necessary to build the BLAS's general matrix-matrix multiply using an L1 cache-contained matmul (hereafter referred to as the L1 matmul).

For our present discussion, it is enough to know that ATLAS has at its disposal highly optimized routines for doing matrix multiplies whose dimensions are chosen such that cache blocking is not required (i.e., the hand-written code discussed in this section deals with cache blocking; the generated/user supplied kernel assumes things fit into cache).

When the user calls GEMM, ATLAS must decide whether the problem is large enough to tolerate copying the input matrices

and

. If the matrices are large enough to support this

overhead, ATLAS will copy

and

into block-major format. ATLAS's block-major format breaks up the input matrices into contiguous blocks of a fixed size

, where

is chosen in order to maximize L1 cache reuse. Once in block-major format, the blocks are contiguous, which eliminates TLB problems, minimizes cache thrashing and maximizes cache line use. It also allows ATLAS to apply alpha (if alpha is not already one) to the smaller of

, thus minimizing this cost as well. Finally, the package can use the copy to transform the problem to a particular transpose setting, which for load and indexing optimization, is set so A is copied to transposed form, and B is in normal (non-transposed) form. This means our L1-cache contained code is of the form $C \leftarrow A^T B$ , $C \leftarrow A^T B + C$ , and $C \leftarrow A^T B + \beta C$ , where all dimensions, including the non-contiguous stride, are known to be . Knowing all of the dimensions of the loops allows for arbitrary unrollings (i.e., if the instruction cache could support it, ATLAS could unroll all loops completely, so that the L1 cache-contained multiply had no loops at all). Further, when the code generator knows leading dimension of the matrices (i.e., the row stride), all indexing can be done up front, without the need for expensive integer or pointer computations.

If the matrices are too small, the

data copy cost can actually dominate the algorithm cost, even though the computation cost is

. For these matrices, ATLAS will call an L1 matmul which operates on non-copied matrices (i.e. directly on the user's operands). The non-copy L1 matmul will generally not be as efficient as the copy L1 matmul; at this problem size the main drawback is the additional pointer arithmetic required in order to support the user-supplied leading dimension.

The choice of when a copy is dictated and when it is prohibitively expensive is an AEOS parameter; it turns out that this crossover point depends strongly both on the particular architecture, and the shape of the operands (matrix shape effectively sets limits on which matrix dimensions can enjoy cache reuse). To handle this problem, ATLAS simply compares the speed of the copy and non-copy L1 matmul for variously shaped matrices, varying the problem size until the copying provides a speedup (on some platforms, and with some shapes, this point is never reached). These crossover points are determined at install time, and then used to make this decision at runtime. Because it is the dominant case, this paper describes only the copied matmul algorithm in detail.

Figure 1 shows the necessary steps for computing a $N_B \times N_B$ section of

using the L1 matmul.

**Figure 1:** One step of matrix-matrix multiply
$\begin{figure}\begin{picture}(430,100)(-20,0) \put(20,15){\framebox (20,20){$C_{... ...$B_{2,2}$}} \put(250,10){\framebox (20,10){$B_{3,2}$}} \end{picture}\end{figure}$

More formally, the following actions are performed in order to calculate the $N_B \times N_B$ block $C_{i,j}$ , where

and

are in the range $0 \leq i < \lceil M/N_B \rceil$ , $0 \leq j < \lceil N/N_B \rceil$ :

As this example demonstrates, if a given dimension is not a multiple of the L1 blocking factor

, partial blocks results. ATLAS has special routines that handle cases where one or more dimension is less than

; these routines are referred to as cleanup codes.