[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SSE Level 3 drop in gemm


R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> No knowledge/understanding of the register reservation, unfortunately . . .

I've uncovered a bit on this.  gcc 2.95.2 globally fixes ebx when the
-fPIC flag is set.  I see no explicit mention of this in the
documentation, but a hint in the explanation of the -mregparm option,
which can specify up to three registers, EAX, ECX and EDX to pass
integer arguments to functions.  Presumably EBX is omitted due to its
special use in the compiler.

The kernel I have works around this, but I'm a bit unsettled.

1) Could other registers be thus used in place of ebx in other
   versions of gcc?  If so, the kernel as written will simply fail to
   compile if this register is anything other than edi, in which case
   it may produce erratically incorrect code.
2) What about other compilers?  Anyone know if atlas is extensively
   used with other compilers, and whether those even accept __asm__
3) Most users building a static lib will never notice this, but in the
   Debian package, we build a shared lib as well, so need to be able
   to compile with -fPIC.

On a separate front, I've taken the optimally generated double
precision matmul kernel and tried adding prefetch.  Thus far, an
inexplicable (to me at least) dramatic drop in performance,
approaching a factor of 2!  I thought some on this list might know
what is going on.

Take care,

> >Otherwise, the kernel is working fine.  Performance fluctuates on the
> >short timer runs, but is somewhere between 670 and 700 MFLOPS for the
> >beta=0 case, and about 670 for arbitrary beta.
> Great, that represents something like a 1.9 speedup over ATLAS's kernel,
> doesn't it?
> >On another front -- Do you have any word on the complex compilation
> >procedure, Clint?  The deal is that all beta cases seem to be
> >referenced by the same timer (fc.c) program, regardless of beta= flag.
> Yep, ATLAS/doc/atlas_contrib.ps explains this in the section on complex
> matmul: it's done with 4 calls to essentially a real matmul.  Even the
> case of beta=1 requires a real beta=X, 'cause you need the -1.0 case 
> because the two imaginary elements that contribute to the real component
> (notice steps 1 and 3 on page 14 use negative).  The timer compiles your
> complex code 3 times to get the b1, b0, and bX cases.  What exactly is
> the problem you are having with it?
> Cheers,
> Clint

Camm Maguire			     			camm@enhanced.com
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah