[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

PIII kni l2 kernels, all precisions

Greetings!  I've put the latest PIII level2 stuff at 


Here are the results:

0 16 2 0.96 199.81 ATL_cger1_SSE.c "Camm Maguire"
0 16 1 0.56 54.16 ATL_dger1_SSE.c "Camm Maguire"
32 32 2 0.90 98.30 ATL_sger1_SSE.c "Camm Maguire"
4 8 1 0.50 103.97 ATL_zger1_SSE.c "CM"

20 16 3 1.00 392.54 ATL_gemvN_SSE.c "Camm Maguire"
4 4 16 1.00 413.50 ATL_gemvT_SSE.c "Camm Maguire"
20 16 4 0.98 104.62 ATL_gemvN_SSE.c "Camm Maguire"
20 2 16 0.98 136.86 ATL_gemvT_SSE.c "Camm Maguire"
16 32 3 1.00 246.49 ATL_gemvN_SSE.c "Camm Maguire"
32 4 32 1.00 272.74 ATL_gemvT_SSE.c "Camm Maguire"
20 8 2 1.00 196.64 ATL_gemvN_SSE.c "Camm Maguire"
4 2 8 1.00 244.71 ATL_gemvT_SSE.c "Camm Maguire"

A few notes:

0)  This represents ~ +100% gain over standard gcc atlas for s and c,
    and ~ +50% for d and z, except dger, which is about +25%.
1)  You should probably comment out the NO_INLINE macro in camm_util.h
    in the final build.  This macro should only be necessary when compiling
    with -g and linking with -O3 libs (?)
2)  The gemv have a unified source file for all precisions, and
    transpose/no transpose.
	a) It might be a good idea to have a -DTRANSPOSE on the
	    compile line, i.e. like -DSREAL, etc.
	b) This was done to reduce code duplication and facilitate
	    updates, but in some cases which I don't fully understand
	    yet, the separate precision files do a little better in
	    a few circumstances, and a lot worse in most others.  As
	    I'm leaving today for about a week, and don't have time to sort
	    this out before going, I've included the separate
	    precision files as well (in atlas.l2.20000830.sp.tgz), 
	    even though they're not listed in my ?cases.dsc files.  
	c)  Haven't had time yet to merge ger into the unified source
	    file.  All ger routines are separate precision files.
	d)  I thought this might speed development when doing the
	    merge originally, but am not so sure now.  Opinions as the
	    the best way to manage this stuff are appreciated.
3)   I've included a scale file as well to handle the beta
    multiplication under certain circumstances.  Couldn't think of a
    good way to handle beta in the kernel itself in the no-transpose
    case, so now I write over y with beta*y before entering the kernel
    in all cases except beta=0.0/transpose, when I just omit the
    final add of the original value of *y.  Any suggestions most
    welcome. This would seem to incur a cache pollution hit, but it
    seems pretty small, although I did notice some improvement when
    omitting the memset(y,0,m*sizeof(*y)) in the beta0/transpose
	a) I don't suppose kni level1 routines are of any value?

In general, I'd appreciate ideas from others on how to produce these
routines more quickly.  I'd like to take a shot at a gemm kernel when
I get back.

PS.  Has anyone seen the following performance comparison with atlas,
and/or have any comments?

Take care,

Peter Strazdins <Peter.Strazdins@cs.anu.edu.au> writes:

> Hi Clint,
> thanks for looking at my codes.
>  >> From rwhaley@cs.utk.edu  Thu Aug 24 11:38:55 2000
> ...
>  >> >and were primarily written by a Viet Nguyen, who worked with me last year
>  >> >on (mainly complex) UltraSPARC BLAS (and did an excellent job too).
>  >> >The kernels use `lookahead over the level 1 cache' (equivalent to prefetching)
>  >> >so they can perform well for large blocksizes (eg 60-90).
>  >> 
>  >> I just hastily scoped out your kernel, and found the best performance
>  >> at NB=80.  More surprisingly, ATLAS itself does slightly better for
>  >> NB > 44 (the last size that completely fits in cache); my guess is
>  >> this is due to the L1 being 2-way associative; some stuff gets
>  >> knocked out anyway due to conflicts, etc., so overflowing the cache
>  >> is not a big deal when you have associativity and its corallary departure
>  >> from true LRU.  Still, does not explain why your kernel likes 80 so
>  >> much.  Any ideas?
>  >> 
> OK, the lookahead over the level 1 cache, equivalent to prefetching,
> effectively hides L1 cache misses (except direct-mapped cache conflicts
> between 2 pieces of data being used at the same time). So it does not suffer
> much greater L1 misses at NB=80 than at 40. 
> The other part of the question: why doesnt the kernel do better at NB=40:
> 1) the lookahead deepens the pipelining so you get a higher startup cost.
>    This is one reason why it is slower for small NB.
> 2) L2 cache useage will be better at larger NB, eg NB=80.
> You may remember a few years ago when Ken Stanley & I were advocating
> large blocking factors, a formula I had proposed for an optimal K 
> was K ~= sqrt(CacheSize/2). 
> The idea was that this would allow square blocks of A & B fitting in the
> cache; their being square should mean that misses due to memory
> references to A, B and C would be minimized. 
> Here CacheSize would be the effective Level 2 cache size, ie. the part of
> the level 2 cache that can be comfortably spanned by the TLB. 
> On an SunOS Ultra, this works out to be 256KB, corresponding to K = 128
> (empirically K=80-90 works out a bit better).
>  >> This is just the kernel, but a 20% speedup looks pretty sweet to me . . .
>  >> You know, as a child, no matter what the size of the piece of pie given
>  >> to me, I always checked the pan for any remainder; do you have a kernel
>  >> for other precisions as well? :)
>  >> 
> well, since you are such a charming guy, I've had a go at double complex
> and updated:
>    http://cs.anu.edu.au/~Peter.Strazdins/projects/SparcBLAS/UserUSATLAS.tar.gz
> to reflect this and also made some minor fixes (eg. made static a few symbols
> accidentally left global) to the double kernels.
> The complex L1 kernel was easily derived from the double L1 kernel,
> so its relative performance should be the same.
> The full kernel (Viet Nguyen's masterpiece!) is a huge file, as his
> zgemm() called zgemv() and lots of level 1 routines as well... 
> I was unable to run the usual user-supplied tests (see below for the
> problem), but I was able to re-install ATLAS with -DUSERMM and the
> kernel ended up in libatlas.a so I assume it worked properly :). 
>  >> >The full kernel generally shows  `speed ups' of > 1.0 in the
>  >> >except for small matrices. eg.
>  >> 
>  >> Did you find this faster than simply building an ATLAS with your kernel?
> I hadn't tried the latter then. 
> Anyway, I hope the complex codes will be useful.  The full kernel got a
> quite convincing improvement over Sun PerfLib 1.2, so I'd be interested
> to hear how you find it. 
> Regards, Peter
> --------------------------------------------------------------------------
> peter@kaffa.anu.edu.au cd $ATLAS/src/blas/gemm
> peter@kaffa.anu.edu.au touch ATL_AgemmXX.c ATL_gemmXX.c
> peter@kaffa.anu.edu.au cd SunOS_SunUS1
> peter@kaffa.anu.edu.au make zlib
> make: Fatal error: Don't know how to make target `ATL_zNBmm_b1.o'

Camm Maguire			     			camm@enhanced.com
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah