[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Altivec and ATLAS


>Unaligned C is okay - I've written unaligned load and store code for C, 
>and it results in about a 5 or 10% performance penalty.  My 
>Altivec-based single-precision L1 matmul is getting in the neighborhood 
>of 1.2 - 1.3 Gflops on my 533 Mhz G4.  I can probably make it better 
>than that (scalar code gets about 670 Mflops).

What scalar code gets 670Mflops?  On the G4 I have intermittant access to,
ATLAS's gemm peaks out around the Mhz . . .  That system probably uses
66Mhz SDRAM (maybe 100Mhz at most) . . .  I assume you have L2SIZE set to
a very large value (for a G4, twice the actual L2 size) . . .

Absolutely great results to already have roughly twice normal peak.  Have
you tried to build the full GEMM yet?  What NB are you using to get that

>The G4 has prefetch instructions as well, which may improve the copy 
>performance - right now I have no idea where in ATLAS these instructions 
>should go though!

There is presently no way for a user to speed up the data copy.  However,
the data copy is a low-order term; I agree that on the altivec, where a lot
of users are doing embedded work and are therefore doing small problems,
it will be worth pursuing at some point in order to push down the threshhold
of when ATLAS switches from its non-copy code the faster altivec stuff . . .