The first line of each file is a comment line, and is ignored. The next
line indicates the number of user-contributed codes to search, and
each subsequent line supplies information about a given user-supplied
L1 matmul. The form of these lines is:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<author>"
<author>" are strings, and the rest of the
parameters are signed integers.
The meaning of these parameters are:
ID: Strictly positive integer which uniquely identifies this descriptor line. ID must by unique only within a precision.
<flag>: flag indicating special conditions. See table below.
<kb>: Used to indicate restriction on the input parameter (, resp.), and its associated blocking MB (NB, KB, resp.). If the value is zero, the internal routine handles any ; i.e. the loop-limit is a runtime variable. If the value is negative, then = MB = -
<mb>(i.e., the blocking factor cannot be varied using a macro). If the value is positive, the blocking factor can be varied by setting the appropriate macro (MB NB, KB, resp.), but the blocking factor must be a multiple of the value. Therefore, setting
<mb>= 4, indicates that MB must be a multiple of 4, while setting it to 1 indicates that MB is an arbitrary compile-time constant.
<muladd>: Set to zero if you are using separate multiply and add instructions, 1 otherwise. If you don't know the answer, put 1.
<lat>: Set to the latency you use between floating point instructions. If you don't know the answer, put 1.
<mu>: Unrolling you are using for the loop.
<nu>: Unrolling you are using for the loop.
<ku>: Unrolling you are using for the loop.
<rout>: The filename of the user-contributed routine, relative to the path ATLAS/tune/blas/gemm/CASES. Maximum length 64 chars.
<author>: The name of the author or authors, enclosed in quotes. Maximum length 64 chars.
Table 1 summarizes the presently defined flag values.
Here's an example:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<Contributer>" 3 1 0 0 0 0 1 1 1 1 1 ATL_mm1x1x1.c "R. Clint Whaley" 2 0 1 1 1 1 1 1 1 1 ATL_mm1x1x1b.c "R. Clint Whaley" 3 0 1 1 8 1 1 1 1 4 ATL_mm2.c "R. Clint Whaley"
So, we have 3 user-supplied routines, all written by me. The first loops over , , and , but the following two routines loop over the cpp macros MB, NB, KB. The third routine insists that KB be a multiple of 8. The first two routines don't unroll any of the loops, while the third unrolls the K loop to a depth of 4. They all use a combined muladd style of programming, and don't worry about latency.