[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

No Subject


This is a note on some of the stuff I've been doing, with some questions I
haven't figured out yet.  I guess it's likely someone out there might know
some of the answers, and save me some work.  So, if you know something
about the topics mentioned below, by all means send it in . . .

   Does anyone know how to pass an address to inline assembler using
   C compilers other than:
      * gcc
      * Dec's cc
   (I've already figured these out)?

   Does anyone know why gcc produces *much* slower code for single precision
   than double precision on UltraSparcs (see below for more detail)?

   Do you know how to do prefetch from C on some system where I haven't
   figured it out yet?

The main thing I've been doing is beefing up prefetch support.  So far, I
have prefetch supported on x86 (SSE/3DNow!) using gcc, ev6 using gcc OR 
Dec's cc, and UltraSparcs using gcc.

Prefetch is not very well documented by most of the vendors, and inline
assembler gets the same documentation treatment.  You know it's there
somewhere, but it takes some endurance to find it . . .

It appears to me that PowerPC (non-altivec) and SGI RXX000 prefetches are
not very usable from C source.  In particular, both seem to require you to
preload the addresses to registers, and I don't know of any way to ensure
that the C compiler isn't using the registers I'd choose to use in my prefetch
macro.  I could perhaps save them and restore, but something tells me that
would raise the cost of the prefetch pretty high . . .

With both of these prefetches, it may be possible to support using the
native compiler's #pragmas, but I have not taken the time to scope this
out yet, since I like to have solutions that will work on any compiler/OS . . .

I've seen some code in my net searches that indicates there's usable prefetch
on IA64 and HP machines, but I have not yet scoped them out.  I don't have
access to any modern HP architectures, so I'm not sure I'll be able to do
that anyway.

For AltiVec, the prefetch model is not a good match with the one I'm 
presently using in ATLAS, so I think we'll need Altivec-specific codes
to really use it well . . .

Initial results show prefetch to make huge differences on some Level 1 (as
much as order of magnitude).  From Camm's earlier work, I think it should
rock the Level 2 world as well, but I haven't had time to scope that myself.

The puny 8 registers of the x86 world don't allow for much time in gemm
when you aren't beating up the L1, and I blame this for my failure to get
x86 (athlon/PIII) gemm kernel speedups using prefetch.  So far, prefetch
provides a slowdown for the UltraSparc II gemm kernel.  Makes me
wonder if the USII prefetch is to L2, not L1 . . .

I managed to use prefetch to get a very nice speedup on the ev6.  This gave
me a 860Mflop kernel (810Mflop full GEMM) on our 500Mhz ev6.  The generated
full gemm clocks in around 730Mflop.  However, since Goto gets a 910Mflop
DGEMM, this work is pretty much of academic interest only (though the
kernel approach, unsuprisingly, seems to do better as a building block
for the gemm-based blas for small problems; for instance in timing LU,
ATLAS built with the new kernel beat ATLAS with Goto's GEMM  until around
N=600) . . .

During all of this work, I played with the UltraSparc, and using register
prefetch, was finally able to write a kernel that runs at the same speed
as the Nguyen & Strazdins UltraSparc kernel.  Finally understanding what
was going on here, I hit again a question I've had before: why is our
single precision performance so much worse than our double?

We have to use gcc on these kernels, because they rely on precise instruction
ordering, and we can make gcc not mess with our order (Sun's cc, like Dec's cc,
will mess up perfectly ordered code in an attempt to "optimize" it, no matter
what flags you throw).

We use the same C code for both single and double precision.  I would expect
performance to be similar (though you might need to vary NB), but what I
see is that double precision runs roughly 25% faster.  Obviously, if one
is to be faster, it should be single.  The only thing I can think of is that
it has something to do with the load instruction used; I know double precision
performance takes a beating if you don't assume it is 8-byte aligned.  Perhaps
this is the problem with single precision?  Does anyone know of any reason
for single to be slower than double on UltraSparcs, in particular with gcc?

Any info appreciated,