[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
ASIDE FOR GROUP:
I've been working on the altivec kernels. On my 533Mhz G4, I'm presently
getting around 710 Mflop for DGEMM kernel, and ~2.2 Gflop for a newly
written SGEMM kernel in non-IEEE mode, and ~2.0Gflop when in IEEE (java) mode,
and Peter is discussing getting his code generator cranking on the
problem . . .
>I am not impressed byt the 2 GFlop results. It might be all we get, but
>the Altivec should have a peak of 8 flops per clock cycle (4 muladds per
>cycle), so it should be doing better.
The main thing I wonder about is why java mode slows down the computation.
All it should do from my reading is add an extra stage to the pipeline.
Since we are using enough registers to handle the longer pipeline, it is
a real mystery to me why adding the extra stage drops us from 2.2
Gflop to 2. Makes me wonder if that extra stage is not pipelined . . .
I'm far more underwhelmed by the 710Mflop dgemm than I am by the 2Gflop
SGEMM. At 2 Gflop, maybe I start to think there are memory fetch limits
hitting us (though the jave mode slowing us down says this is probably not
the case), but since SGEMM can go that fast, there's no reason why dgemm
cannot hit close to FPU peak, unless the altivec can suck memory faster
than the normal FPU . . .
If we weren't memory bound, I thought I'd be able to get some speedup by
using the normal FPU along with the AltiVec. Essentially, since the AltiVec
is 4 times faster than the regular FPU, you can imagine doing your loop
such that 4/5 of the loop is done by the altivec, and 1/5 is done by the
FPU. I implemented this, and got a very nice slowdown.
Part of the problem is that cc totally craps out on scheduling when you
try to run both units, so you have to do the scheduling yourself instead
of letting cc do it for you. I played with this a bit, but no luck. My
2.2/2.0 kernel uses the normal FPU for C/beta computations, and the altivec
for everything else . . .
>Have you gotten any results on the UltraSparcIII machine?
Nothing spectacular. I think I made a kernel that was faster than the stock
US kernel, but I don't think it was any great shakes. I'd have to dig around
to recover the stuff . . .