[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: Julian Ruhe <ruheejih@calvados.zrz.tu-berlin.de>*Subject*: Re: Need help on Athlon optimized gemm kernel!*From*: Peter Soendergaard <soender@cs.utk.edu>*Date*: Thu, 11 Oct 2001 17:49:16 -0400 (EDT)*cc*: R Clint Whaley <rwhaley@cs.utk.edu>, <atlas-comm@cs.utk.edu>, <math-atlas-devel@lists.sourceforge.net>*In-Reply-To*: <3BC59FF4.5040700@linux.zrz.tu-berlin.de>

Hi Julian, I would love to look at this. Unfortunately, I am going on a weeks vacation, but when if it is still relevant when I get back I will try to look at it, maybe port it to gasm if that makes any sense. Cheers, Peter. On Thu, 11 Oct 2001, Julian Ruhe wrote: > Hello all, > > I am currently working on a new Athlon optimized gemm kernel, but I ran > into a problem: > The kernel crunches 6 dotproducts simultaneously. Of course I must store > the six produces > elements of matrix(block) C and load the next elements on the stack > after that operation. And this > is exactly the problem. As long as I leave out the exchange of elements > of C (means that > the results of all dotproducts in the matrix multiplication are > accumulated in only 6 stack registers), > the matrix multiplication runs with a stellar speed of 1.93 FLOPS/cycle > on my Athlon 600 classic/Win2000. When > I insert the exchange part (I have tried some dozens variations of this) > performance dops enormously, > which I cannot explain. > Currently I try to modify the routine for MSVC++ in order to run AMD > Code Analyzer, but I do not > think that this will enlight the problem. > So I ask everybody who feels able to help me, to progam the register > exchange part of > my kernel. I have prepared a NASM .asm file (and C test program) that is > ready for modifications. This code is the one > that runs with 1.93 FLOPS/cycle so a direct comparion is possible. > Requirements: > - Cygwin installed > - NASM installed > - Skills in Assembly > > The person that finds a fast solution will win a golden cake and much > honor! > If anybody from AMD reads this posting, please help us. Frank S., what's > about you? > > Regards > > Julian > > R Clint Whaley wrote: > > >Guys, > > > >I include below some timings on a 733 Mhz G4e (access courtesy of SourceForge > >compile farm). For quite a while now, Apple's "half the Mhz, half again the > >price" strategy has eluded me, but this machine ought to at least reduce the > >screaming fits of it's laugh-test failure to at most a few furtive chuckles. > > > >Essentially, it is still not going heads up against either the Athlon or > >P4 (and if anyone hits me with the clock-for-clock crap, I will point out that > >clock for clock the original Power chip is still the champ), but I think > >it is cleaning the floor with the PIII, for instance (let's not mention > >price, though, eh?). > > > >In single precision, its results are roughly 75% of a P4 clocked at twice > >its speed (before you sneer with the "easy to be fast at low Mhz", I'll remind > >you it is doing this with good ol' SDRAM, so that's pretty impressive), and it > >almost doubles the performance of a 933Mhz PIII . . . > > > >These results are much crappier on an original G4. Obviously, the extra level > >of cache can't be hurting, but perhaps the greater instruction bandwidth, > >etc., are helping as well. > > > >I found it interesting to compare these timings to the ones I have previously > >posted for the P4 and PIII. Note that gemm timings can be compared pretty > >directly (no real change from 3.3.0 till 3.3.7), but the LU timings cannot > >(3.3.7 has some speedups over 3.3.0) . . . > > > >Cheers, > >Clint > > > >ATLAS 3.3.7 on 733Mhz G4e, 256K L2, 1MB L3 > > > > 100 200 300 400 500 600 700 800 900 1000 > > ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== > >ATL dLU 386.8 480.7 513.0 580.7 594.3 684.9 671.8 668.7 703.8 724.1 > >ATL dMM 416.7 687.7 771.4 914.3 757.6 919.1 879.5 922.5 928.7 943.4 > > > >ATL sLU 437.3 631.0 897.8 982.8 1109.4 1307.5 1343.7 1482.7 1566.4 1586.1 > >ATL sMM 1428.6 1600.0 1800.0 2560.0 2500.0 2400.0 2450.0 3011.8 2803.8 2631.6 > > > > 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 > > ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== > >ATL dLU 733.3 758.7 786.6 799.7 809.0 819.4 833.0 838.5 840.4 837.0 > > > >ATL sLU 1744.4 1846.8 1922.1 1993.0 2058.4 2118.3 2167.8 2206.0 2261.3 2275.0 > >ATL sMM 2953.8 2814.4 3022.9 2858.8 3053.4 2937.4 3061.8 2936.7 3081.0 2995.0 > > > >_______________________________________________ > >Math-atlas-results mailing list > >Math-atlas-results@lists.sourceforge.net > >http://lists.sourceforge.net/lists/listinfo/math-atlas-results > > > > >

**References**:**Need help on Athlon optimized gemm kernel!***From:*Julian Ruhe <ruheejih@calvados.zrz.tu-berlin.de>

- Prev by Date:
**Need help on Athlon optimized gemm kernel!** - Next by Date:
**SSE warnings, Band matrix request feature** - Prev by thread:
**Need help on Athlon optimized gemm kernel!** - Next by thread:
**3.3.7: running fast, staying in same place** - Index(es):