[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SSE Level 3 drop in gemm



Camm,

No knowledge/understanding of the register reservation, unfortunately . . .

>Otherwise, the kernel is working fine.  Performance fluctuates on the
>short timer runs, but is somewhere between 670 and 700 MFLOPS for the
>beta=0 case, and about 670 for arbitrary beta.

Great, that represents something like a 1.9 speedup over ATLAS's kernel,
doesn't it?

>On another front -- Do you have any word on the complex compilation
>procedure, Clint?  The deal is that all beta cases seem to be
>referenced by the same timer (fc.c) program, regardless of beta= flag.

Yep, ATLAS/doc/atlas_contrib.ps explains this in the section on complex
matmul: it's done with 4 calls to essentially a real matmul.  Even the
case of beta=1 requires a real beta=X, 'cause you need the -1.0 case 
because the two imaginary elements that contribute to the real component
(notice steps 1 and 3 on page 14 use negative).  The timer compiles your
complex code 3 times to get the b1, b0, and bX cases.  What exactly is
the problem you are having with it?

Cheers,
Clint