[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions


>I've looked at the assembly produced by compiling Peter's generated C,
>and it looks very good!  Its giving me a few ideas, but raises above
>all one important question: Why is SSE so much worse than 3dNow!?  It
>makes me think that we're missing something on the SSE front.  In
>fact, I'm a little surprised that Peter's SSE code shouldn't have done
>better than what I submitted, as the pipelining certainly seems
>better.  My guess is that the Athlon really wants a mul followed
>immediately by a different add (which reportedly can be done in one
>cycle), whereas SSE prefers some non-fpu instruction(s) between these

I'll make the completely unwarrented assumption that knowing something about
the normal FPU allows me to say something about the SSE/3DNow! stuff.  I think
one of the key differences may be in the fact that the Athlon has two fpu units,
so you have a strong need to mix muls and adds, while the PIII has only one
unit for both ops.  I am pretty sure the current block in SSE performance is
one of adequately using the FPU (since not flushing the cache did not make a
big difference in performance), so pipelining and related issues certainly
seem promising.  

The Athlon has a couple of advantages.  First, with 4 times the L1, you can
get a lot longer K-loop going, which I think is important when pipelining
vectors.  Also, the 4-length SSE vector versus the 2-length 3DNow! vectors,
assuming I'm correct in guessing that you both loaded the vectors along the
K-dimension, means that the Athlon would have essentially twice as long a
K-loop even with equal K . . .

>Otherwise, it looks like we made the wrong CPU choice for our upcoming
>Beowulf upgrade.

If you chose PIIIs over Athlons for floating point work, you did anyway :)
The Athlon has twice the theoretical peak for floating point as a PIII
(normal FPU), actually _achieves_ greater than a similarly clocked PIII's
theoretical peak, and comes in faster Mhz versions as well . . .