[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UltraSPARC dgemm user contribution

Hi Clint,

thanks for looking at my codes.

 >> From rwhaley@cs.utk.edu  Thu Aug 24 11:38:55 2000
 >> >and were primarily written by a Viet Nguyen, who worked with me last year
 >> >on (mainly complex) UltraSPARC BLAS (and did an excellent job too).
 >> >The kernels use `lookahead over the level 1 cache' (equivalent to prefetching)
 >> >so they can perform well for large blocksizes (eg 60-90).
 >> I just hastily scoped out your kernel, and found the best performance
 >> at NB=80.  More surprisingly, ATLAS itself does slightly better for
 >> NB > 44 (the last size that completely fits in cache); my guess is
 >> this is due to the L1 being 2-way associative; some stuff gets
 >> knocked out anyway due to conflicts, etc., so overflowing the cache
 >> is not a big deal when you have associativity and its corallary departure
 >> from true LRU.  Still, does not explain why your kernel likes 80 so
 >> much.  Any ideas?

OK, the lookahead over the level 1 cache, equivalent to prefetching,
effectively hides L1 cache misses (except direct-mapped cache conflicts
between 2 pieces of data being used at the same time). So it does not suffer
much greater L1 misses at NB=80 than at 40. 

The other part of the question: why doesnt the kernel do better at NB=40:
1) the lookahead deepens the pipelining so you get a higher startup cost.
   This is one reason why it is slower for small NB.
2) L2 cache useage will be better at larger NB, eg NB=80.
You may remember a few years ago when Ken Stanley & I were advocating
large blocking factors, a formula I had proposed for an optimal K 
was K ~= sqrt(CacheSize/2). 

The idea was that this would allow square blocks of A & B fitting in the
cache; their being square should mean that misses due to memory
references to A, B and C would be minimized. 

Here CacheSize would be the effective Level 2 cache size, ie. the part of
the level 2 cache that can be comfortably spanned by the TLB. 
On an SunOS Ultra, this works out to be 256KB, corresponding to K = 128
(empirically K=80-90 works out a bit better).

 >> This is just the kernel, but a 20% speedup looks pretty sweet to me . . .
 >> You know, as a child, no matter what the size of the piece of pie given
 >> to me, I always checked the pan for any remainder; do you have a kernel
 >> for other precisions as well? :)

well, since you are such a charming guy, I've had a go at double complex
and updated:


to reflect this and also made some minor fixes (eg. made static a few symbols
accidentally left global) to the double kernels.

The complex L1 kernel was easily derived from the double L1 kernel,
so its relative performance should be the same.

The full kernel (Viet Nguyen's masterpiece!) is a huge file, as his
zgemm() called zgemv() and lots of level 1 routines as well... 

I was unable to run the usual user-supplied tests (see below for the
problem), but I was able to re-install ATLAS with -DUSERMM and the
kernel ended up in libatlas.a so I assume it worked properly :). 

 >> >The full kernel generally shows  `speed ups' of > 1.0 in the
 >> >except for small matrices. eg.
 >> Did you find this faster than simply building an ATLAS with your kernel?

I hadn't tried the latter then. 

Anyway, I hope the complex codes will be useful.  The full kernel got a
quite convincing improvement over Sun PerfLib 1.2, so I'd be interested
to hear how you find it. 

Regards, Peter


peter@kaffa.anu.edu.au cd $ATLAS/src/blas/gemm
peter@kaffa.anu.edu.au touch ATL_AgemmXX.c ATL_gemmXX.c
peter@kaffa.anu.edu.au cd SunOS_SunUS1
peter@kaffa.anu.edu.au make zlib
make: Fatal error: Don't know how to make target `ATL_zNBmm_b1.o'