[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UltraSparc kernel results



Peter,

>with zgemm(), I feel that it will always be hard for a kernel do better
>than a hand-written implementation (you can get close, but second-order
>things like stride-2 accesses may take that few % off that makes the
>difference). 

I agree; when you are dealing with such high percentages of peak, even
a low-order term like the access of C can be an insurmountable hurdle;
I increased the speed of your kernel by a few percentage points by fixing
the loops and so on, and in doing was amazed at the 10% drops you could get
in performance by moving single instructions . . .

>I'd be interested to hear how the full user-supplied US zgemm()
>implementations compare to SunPerf. 

A good point, and I'll be interested as well.  However, it will be a bit
before I get to them; I've been doing the work on your kernel as part of
the debugging of the new GEMM kernel install (which now allows user-supplied
cleanup; very important if you are using nb=80); I will not be looking at
user-supplied full gemm's (I have Doug's SSE sgemm as well as your stuff)
until the tarfile for the release is practically complete . . .

>  >>  That's the good news.  The bad news is I got access to an Ultra-5/10,
>  >>  sun's PCI-based low-end ultrasparc, and the submitted kernels don't
>  >>  seem to do very well on those machines; ATLAS's generated code is
>  >>  as good as the kernel there, and both get *completely* waxed by
>  >>  sunperf.  My guess is the motherboard can have such an effect
>  >>  because the UltraSparc II has an off-chip cache, and the PCI-based
>  >>  one makes the code really different . . .  Anyway, I'll have to 
>  >>  investigate this further, maybe I just messed up the build . . .
>  >>  
>Hmm, this must be the one based on the Ultra IIi chip.  I ran a
>benchmark on one of these some time ago, and was so disgusted with the
>performance (relative to clock speed), I vowed never to run numeric
>codes that procesor again :). 
>
>I read an article on the IIi, but there was nothing to suggest that it
>should be significantly different from the II for floating point. 
>Possibly you need to use an explicit prefetch instruction (which SunPerf
>uses) to get good performance? 

As I said, I suspect the difference is in the L2 caches.  The chip is pretty
much the same.  With the suspicion on the L2 cache, I would say the prefetch
instruction is probably the culprit.  It's worth noticing that the gap between
Sunperf and ATLAS is only 1/2 as wide on an UltraSparc I (which does not
implement the prefetch instruction) as it is on an UltraSparc II.  After the
release, we are planning to have an atlas_prefetch.h with some macros that
use the various computer-specific prefetches (SSE/3DNow/MMX/UltraSparc/Power3).
My hope is this might allow use to play with this kind of thing ourselves . . .

Here's some numbers in support of the Ultra5/Ultra2 difference coming from
L2 or memory difference:
                    out-of-cache  in-cache
Ultra5-269Mhz       285.7 (53%) 435.1 (81%)
Ultra2-200Mhz       283.9 (71%) 346.2 (87%)

So you see that the in-L1 performance is comparable, but when you exceed it,
the non-pci solution pulls ahead.  I did a little more work after the last
mail, and what I have found is the best blocking for your kernel is dependent
on the system:

Ultra5:   40
Ultra2:   80
Ultra4:  120

And, again, my guess the growth in block factor corresponds to better L2s;
It's too bad the ultrasparc L2 is not on-die to avoid this problem . . .

Also, as long as you adjust NB, your kernel is better than the ATLAS kernel
even on the Ultra 5, though it's percentage ahead is much less . . .

Cheers,
Clint