[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

latest blas contribs




Greetings!  OK, my last stuff for the release is at:

http://people.debian.org/~camm/blas_20001204.tgz

A few notes:

1) Beware, the ?cases.dsc files are in here, and will overwrite what
   you have if you simply unpack the tar ball in the root directory.

2) No changes to the l3 stuff from previous

3) All l2 integrated into two files include/contrib/camm_dpa.h and
   include/contrib/ATL_gemv_ger_SSE.c.  These are included with
   appropriate macro parameter settings in
   ATL_{ger1,gemvT,gemvN}_SSE.c.  

4) Current parameters that are settable with macros:
	a) NO_TRANSPOSE (indicates an axpy strategy)
	b) GER (self explanatory, invokes NO_TRANSPOSE automatically) 
	c) PREFETCH (how far ahead to prefetch in bytes)
	d) LUNROLL (how many TYPE elements to unroll in the inner
	loop)
	e) NDPM (How many rows to process at a time, most routines can
	do up to 4, DCPLX only 2, SCPLX NO_TRANSPOSE only 3)
	e) STRIDE (how many rows to skip when processing multiple rows
	at once)
	e) (SREAL only, STRIDE %4==0 || NDPM==1) ALIGN (aligns the
	inner loop to 16 bytes and uses aligned assembler instructions
	thereafter) 

5) Performance:  This code is selected over the default atlas code in
   all cases, but in some, the margin is not much:

	Key: 
	     850n -- Coppermine 850, new code
             850o -- atlas 3.0 lib compiled on PII 350 run on
	             Coppermine 850

	     450n -- Katmai 450, new code
             450o -- atlas 3.0 lib compiled on PII 350 run on
	             Katmai 450

------------------------------- GEMV --------------------------------
TST# TR    M    N ALPHA  LDA INCX  BETA INCY   TIME MFLOP  SpUp  TEST
==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
s850n N 1000 1000   1.0 1000    1   1.0    1   0.01 319.5  3.26 PASS 
s850n T 1000 1000   1.0 1000    1   1.0    1   0.01 330.2  3.13 PASS 
d850n N 1000 1000   1.0 1000    1   1.0    1   0.01 151.2  2.69 PASS 
d850n T 1000 1000   1.0 1000    1   1.0    1   0.01 157.2  1.64 PASS 
c850n N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.02 436.7  2.50 PASS 
c850n T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.02 480.4  2.65 PASS 
z850n N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.03 286.8  2.55 PASS 
z850n T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.03 305.0  2.73 PASS 

s850o N 1000 1000   1.0 1000    1   1.0    1   0.01 176.9  1.00 PASS 
s850o T 1000 1000   1.0 1000    1   1.0    1   0.01 145.7  1.00 PASS 
d850o N 1000 1000   1.0 1000    1   1.0    1   0.02  82.9  1.00 PASS 
d850o T 1000 1000   1.0 1000    1   1.0    1   0.02  91.7  1.00 PASS 
c850o N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.03 237.2  0.99 PASS 
c850o T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.04 200.2  1.00 PASS 
z850o N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.06 133.4  1.01 PASS 
z850o T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.06 142.3  1.01 PASS 

s450n  N 1000 1000   1.0 1000    1   1.0    1   0.01 279.0  4.38 PASS 
s450n  T 1000 1000   1.0 1000    1   1.0    1   0.01 300.1  4.17 PASS 
d450n  N 1000 1000   1.0 1000    1   1.0    1   0.02 117.2  3.25 PASS 
d450n  T 1000 1000   1.0 1000    1   1.0    1   0.01 143.6  2.12 PASS 
c450n  N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.02 392.1  3.45 PASS 
c450n  T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.02 417.7  3.43 PASS 
z450n  N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.04 218.3  3.10 PASS 
z450n  T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.03 249.5  3.48 PASS 

s450o  N 1000 1000   1.0 1000    1   1.0    1   0.01 159.8  0.95 PASS 
s450o  T 1000 1000   1.0 1000    1   1.0    1   0.02 131.2  1.00 PASS 
d450o  N 1000 1000   1.0 1000    1   1.0    1   0.02  95.2  1.00 PASS 
d450o  T 1000 1000   1.0 1000    1   1.0    1   0.02  93.9  1.00 PASS 
c450o  N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.04 192.1  1.00 PASS 
c450o  T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.05 174.7  1.09 PASS 
z450o  N 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.06 134.4  1.15 PASS 
z450o  T 1000 1000  1.0  0.0 1000    1  1.0  0.0    1   0.06 126.4  1.02 PASS 


------------------------------ GER -----------------------------
TST#     M     N ALPHA INCX INCY   LDA   TIME  MFLOP  SpUp  TEST
==== ===== ===== ===== ==== ==== ===== ====== ====== ===== =====
s850n 1000  1000   1.0    1    1  1000   0.02  100.1  1.18 PASS 
d850n 1000  1000   1.0    1    1  1000   0.04   46.5  1.03 PASS 
c850n 1000  1000   1.0   0.0    1    1  1000   0.04  202.3  1.28 PASS 
c850n 1000  1000   1.0   0.0    1    1  1000   0.04  202.3  1.29 PASS 
z850n 1000  1000   1.0   0.0    1    1  1000   0.08   99.6  1.19 PASS 
z850n 1000  1000   1.0   0.0    1    1  1000   0.09   85.0  1.08 PASS 

s850o 1000  1000   1.0    1    1  1000   0.02   83.6  1.00 PASS 
d850o 1000  1000   1.0    1    1  1000   0.05   42.1  0.96 PASS 
c850o 1000  1000   1.0   0.0    1    1  1000   0.05  155.0  1.00 PASS 
c850o 1000  1000   1.0   0.0    1    1  1000   0.05  155.0  1.00 PASS 
z850o 1000  1000   1.0   0.0    1    1  1000   0.10   83.2  1.00 PASS 
z850o 1000  1000   1.0   0.0    1    1  1000   0.10   83.5  1.00 PASS 

s450n 1000  1000   1.0    1    1  1000   0.02  105.9  1.98 PASS 
d450n 1000  1000   1.0    1    1  1000   0.04   51.9  1.35 PASS 
c450n 1000  1000   1.0   0.0    1    1  1000   0.04  196.1  1.88 PASS 
c450n 1000  1000   1.0   0.0    1    1  1000   0.04  196.1  1.87 PASS 
z450n 1000  1000   1.0   0.0    1    1  1000   0.08  101.7  1.92 PASS 
z450n 1000  1000   1.0   0.0    1    1  1000   0.08  102.2  1.92 PASS 

s450o 1000  1000   1.0    1    1  1000   0.03   75.0  1.00 PASS 
d450o 1000  1000   1.0    1    1  1000   0.04   49.0  1.00 PASS 
c450o 1000  1000   1.0   0.0    1    1  1000   0.08  104.4  0.99 PASS 
c450o 1000  1000   1.0   0.0    1    1  1000   0.08  105.0  1.00 PASS 
z450o 1000  1000   1.0   0.0    1    1  1000   0.15   54.4  1.00 PASS 
z850n 1000  1000   1.0   0.0    1    1  1000   0.15   54.7  1.00 PASS 


6) Issues:
	a) The prefetch distance seems to be a function of the cpu/bus
	speed ratio, and may also be different for Coppermine
	vs. Katmai.  I therefore left this as a settable parameter,
	even though I could not find significant repeatable gains over
	the default  2 Cacheline lengths ahead for any case on the
	850Mhz Coppermine I used for testing.  This may also interact
	with b) below.
	b) STRIDE:  Double precision loves stride around the
	regrettably large value of 20, 10 for complex.  This is
	apparently getting around the blocking in some way I don't
	really understand.  I leave the stride out of the
	{d,z}cases.dsc unroll value, and seem to get good results.
	This doesn't seem satisfactory, but its what works best here
	so far.
	c) Inlining: the routine in camm_dpa.h cannot currently be
	inlined.  I have included an effective work around for gcc by
	defining a NO_INLINE macro in camm_util.h and invoking in this
	function.  Don't know about other compilers.  I believe I can
	fix this quickly, but I didn't want to hold up releasing for
	this.  
	d) L3 arbitrary KB cleanup: With a few extra macros, this code
	should make a nice K cleanup when looped externally over B.  I
	also have one that doesn't worry about alignment which gets ~
	1100 MFLOPS on an 850 (if memory serves), but this is not
	included here.  I thought the loop over l2 would be better,
	but again didn't want to hold up a release.


Take care,

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah