[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Altivec and ATLAS
My machine is a dual 533 Mhz G4 with 133 Mhz SDRAM, 64K of L1 cache, and 
1 MB of L2 cache running at 233 Mhz.
ATL_mm4x4x2_1_pref.c makes a 670 Mflops SGEMM.
Using Altivec, I get a 1280 Mflops SGEMM;  2 Gflops with both processors 
using pthreads.
The NB is 80.  I do think that I can make this better; after all, the 
Altivec unit can do 4 single-precision muladds per cycle!
I'm actually not much of an Altivec programmer. This is one of my first 
efforts.
-Nick
On Friday, June 8, 2001, at 08:07 PM, R Clint Whaley wrote:
> Nick,
>
>> Unaligned C is okay - I've written unaligned load and store code for C,
>> and it results in about a 5 or 10% performance penalty.  My
>> Altivec-based single-precision L1 matmul is getting in the neighborhood
>> of 1.2 - 1.3 Gflops on my 533 Mhz G4.  I can probably make it better
>> than that (scalar code gets about 670 Mflops).
>
> What scalar code gets 670Mflops?  On the G4 I have intermittant access 
> to,
> ATLAS's gemm peaks out around the Mhz . . .  That system probably uses
> 66Mhz SDRAM (maybe 100Mhz at most) . . .  I assume you have L2SIZE set 
> to
> a very large value (for a G4, twice the actual L2 size) . . .
>
> Absolutely great results to already have roughly twice normal peak.  
> Have
> you tried to build the full GEMM yet?  What NB are you using to get that
> performance?
>
--
Nicholas Coult, Ph.D.,  web: http://melby.augsburg.edu/~coult
Assistant Professor, Department of Mathematics, Augsburg College
coult@augsburg.edu, phone:  (612) 330-1064 office: Science Hall 137B