[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: newbie Athlon optimization question...



Hi Jeff,

you have found the main Atlas kernel, so it is no wonder that the program
spends 73% of the time here.

I don't think messing with ATLAS' code would do much good. It's hard to
omtimize something to be fast on all platforms, because you would have to
test to code on all supported platforms to make sure that it was faster
than the code already contained in ATLAS.

A better thing would be to optimize the kernel you mentioned for the
Athlon. You can do that by submitting your own kernel, and ATLAS will then
choose your kernel if it is faster than the ones generated by ATLAS. Check
out atlas_contrib.ps in the ATLAS/doc/ directory. Since work done by the
kernel takes up 73% of the time any speedup would be good.

Theoretical peek performance for the Athlon is 2 flop per clockcycle, and
Atlas currently gets around 1.2, so there is room for optimization if you
like x87 assembly. There has been some discussion previously on the list
about Athlon optimizations.

For some general techniques you can also look at
http://www.cs.utk.edu/~soender/atlas/doc/atl_report.ps

Cheers,

Peter 

On Wed, 11 Jul 2001, Jeff W. wrote:

> Hi, I'm interested in trying to do some optimization for the AMD Athlon
> single processor configuration, geared more towards HPL though
> as opposed to ATLAS in its entirety.  Through a bunch of testing over
> the past few weeks, it appeared that I found a function that contributed
> heavily to the performance of HPL: ATL_dJIK60x60x60TN60x60x0_a1_b1.  My
> guess is that this function was dynamically generated by ATLAS, which I
> assumed since I could only find the above function in source code after
> compiling ATLAS and not when I just untarred it.  My reasoning went as
> follows:
> 
> I first of all compiled both ATLAS and HPL with gprof profiling
> support.  I ran HPL with a single processor configuration, and observed
> the resulting call tree generated by gprof.  The tree appeared to call
> functions in this order:
> 
> ...
> HPL_pdupdateTT (in HPL_pdupdateTT.c)
> cblas_dgemm (in cblas_dgemm.c)
> ATL_dgemm (in ATL_gemm.c)
> ATL_dGEMM2NN (in ATL_gemmXX.c)
> ATL_dmmJIK (in ATL_mmJIK.c)
> ATL_dmmJIK2 (in ATL_mmJIK.c)
> ATL_dJIK60x60x60TN60x60x0_a1_b1 (in file ATL_dNBmm_b1.c)
> 
> gprof reports that the last function consumes 73% of the total program CPU
> time.  As a test, not for accurate data just for speed questions, I
> commented out the call to cblas_dgemm.  The program ran significantly
> faster (albeit totally incorrect).  So thus I reasoned that for the sake
> of HPL on a single processor Athlon system with my specific configuration,
> that that crpytic ATLAS function was teh culprit.  However, my real
> question is more general.  I'd rather not spend my time optimizing that
> function, as it seems like ATLAS will just generate a more appropriate one
> depending on system configuration.  If I were to want to concentrate my
> efforts in one particular location, what is the lowest level of statically
> created ATLAS code?  I.e., when I download the tarball and extract the
> source files, of the files that aren't going to be changed, which one
> would be most pertinent to look at?  I'd like to be able to make some
> optimizations (again, solely for the sake of improving HPL scores, but
> since they depend on ATLAS,I figured this was the proper place for
> questioning) that aren't specific to my exact machine.  I hope I'm
> describing this all in a way you can understand, I have no clue how
> tech/math/physics oreinted everyone is on this list =)  Let me know if I'm
> being unlcear...
> Thanks for your help, I appreciate it.
> 
>