    Next: Providing ATLAS with kernel Up: Speeding up the Level Previous: More timing info   Contents

## Complex L1 matmul

Vainly hoping we were done, eh? Nope, complex codes are special. ATLAS actually uses the real matrix multiply generator in order to do complex multiplication. It needs some tricks to do this, obviously. The first thing to note is that if denotes the real elements of , and indicates the imaginary components, then complex matrix-matrix multiply of the form may be accomplished by the following four real matrix multiplies:
1. 2. 3. 4. This works fine assuming is real, otherwise must be applied explicitly as a complex scalar, and then set in the above outline.

Therefore, in order to use this trick, upon copying and to block-major storage, ATLAS also splits the arrays into real and imaginary components. The only matrix not expressed as two real matrices is then , and to fix this problem, ATLAS demands that the complex L1 matmul stride by 2. An example will solidify the confusion.

A simple 3-loop implementation of an ATLAS complex L1 matmul is:

void ATL_USERMM
(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc)
{
int i, j, k;
register double c00;

for (j=0; j < N; j++)
{
for (i=0; i < M; i++)
{
#ifdef BETA0
c00 = 0.0;
#else
c00 = C[2*(i+j*ldc)];
#ifdef BETAX
c00 *= beta;
#endif
#endif
for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb];
C[2*(i+j*ldc)] = c00;
}
}
}


First we test that it produces the right answer:

make cmmutstcase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=0
make cmmutstcase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=1
make cmmutstcase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=8


Then we scope the awesome performance:

make cmmucase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=1
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251


Now, since we are clearly gluttons for punishment, we compare our masterwork to ATLAS's generated kernel:

speedy. cat res/zMMRES
MULADD  LAT  NB  MU  NU  KU  FFTCH  IFTCH  NFTCH    MFLOP
0    5  36   2   1  36      0      2      1   186.42
10

speedy. make mmcase muladd=0 lat=5 nb=36 mu=2 nu=1 ku=36 beta=1
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.500000, mflop=195.581952
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420    Next: Providing ATLAS with kernel Up: Speeding up the Level Previous: More timing info   Contents
R. Clint Whaley 2001-08-04