[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Testing ATLAS with user contributed code.

Hi Clint,

I have gotten the more rigorous testing of blocksizes to work, so I should
have some results soon. I havn't implemented the forced selection of a
less-than-optimal-blocksize, I'm just relying on the vector code being
faster than the scalar code.

One of the reason I wanted to get this testing working is to experiment
with the non-temporal move instructions that are present in both AMD and
Intel chips. They can move data without polluting the caches, but can this
be any benefit with ATLAS' way of copying matrices? Does it matter how the
kernel acces memory or is it only relevant in the copying code? Is there
any difference in the way that the A and B matrices are handled by the
copy code and the code that calls the kernels?

I will do some experimenting and see what turns up.



On Tue, 5 Jun 2001, R Clint Whaley wrote:

> Peter, 
> >I am trying to benchmark ATLAS using my generated kernel and cleanups for
> >a varying number of blocksizes. I would like to build ATLAS with a
> >blocksize ranging from e.g. 2 to 100 and to test each blocksize on
> >problems from size 100 to 1000.
> >
> >So, the questions:
> >
> >1: Is it at all possible to force ATLAS to use my code for a blocksize of
> >say 2, or will it choose its own generated code with a more sane
> >blocksize?
> I gotta ask: why on god's green earth do you want to try NB<16?  I understand
> not trusting the kernel timer to give you best NB, but I usually only try
> a couple (usually smaller, since it has less cleanup overhead) . . .
> Unfortunately, forcing a particular NB is not that simple, and is very
> time consuming.  I usually figure out I've screwed it up myself when my
> tester seg faults.  In general, changing tune/blas/gemm/<arch>/res/<pre>NB
> to your blocking factor is the first step, then you rerun the search.
> This may or may not be all you need to do . . .
> I include below a half complete, and undoubtedly incorrect documentation file
> I started to produce describing the generation process and the output
> files the search produces.  Understanding and playing with these guys,
> and the intermediate output files produced in res/, is the key to success.
> Examining how the various searches (ummsearch and mmsearch big ticket items)
> work, and the intermediate outputs they produce will be revealing . . .
> As for forcing ATLAS to use a suboptimal case, this is also not always
> easy.  The way I do it I run the search, and if my case is not chosen, I
> go look at the timer output file in res/, and increase it so it is the
> fastest, and then rerun the search, and let the search choose my artificially
> inflated case.  For instance, if you want the third user index entry to win,
> with nb=32, you first cause the search to call it, then you edit
> res/duser003_32x32x32, and pump up the mflop numbers written there so that they
> are the best, and the next time the search is ran, it will be taken as the
> best case . . .
> >2: How do I run the tests for varying problem sizes? It is a standard test
> >that I am thinking of, I have just forgotten how to call it.
> ATLAS/doc/TestTime.txt, Section 4.  You can also build the testers, and
> for instance, type "./xdl3blastst -help" . . .
> >3: How do I do this with the least amount of overhead since I have to
> >search for kernels, build ATLAS and run tests some 50 times for each
> >architecture.
> The basics are page 22 of ATLAS/doc/atlas_contrib.ps.  I think for all
> the stuff you are planning to do, you are going to have to do quite a bit
> of learning if you want to automate this whole process.  It would take me
> several days to produce a document with the canned answers for you, and I
> don't have that kind of time at the moment . . .
> Cheers,
> Clint
> This file documents the order in which files are generated in ATLAS.  If you
> are crazy enough, it can be used as a starting point for building ATLAS
> by hand, rather than letting install do it.
> Stage 1 : System discovery/aux compile
>   (1) cd ATLAS/src/auxil/<arch> ; make lib
>       HEADERS                       RESULTS
>       atlas_type.h                  res/[s,d]MULADD
>       atlas_[s,d,c,z]sysinfo.h      res/[s,d]nreg
>                                     res/L1CacheSize
> Stage 2 : Type-dependent tuning (pre = d, s, z, c)
>   NOTE: right now, the Level 3 are tuned first, followed by Level 2
>         because the Level 2 can call the Level 3 for gemv.  It should be
>         the other way around, but it ain't :)
>   (1) Run ATLAS/tune/blas/gemm/<arch>/xmmsearch -p <pre>, creating
>       ATLAS/include/<arch>/<pre>mm.h & ATL_<pre>NCmm.h, and 
>       res/:
>          dgMMRES : generated NBmm kernel results
>          dMMRES  : generated & user NBmm kernel results
>          dClean[M,N,K] : generated cleanup results
>          duMMRES : User-supplied kernel NBmm results
>          duClean[M,N,K]: Best user-supplied cleanups
>          duClean[M,N,K]F : User supplied cleanups that beat generated cases
>          dbest[N,T][N,T]_0x0x0: best no-copy case with no fixed loop dimension
>          dbest[N,T][N,T]_0x0x<nb>: best no-copy case with M and N loop 
>             parameters variable, but K-loop fixed at <nb>
>          dbest[N,T][N,T]_<nb>x<nb>x<nb>: best no-copy case with all loop
>             dimensions fixed to <nb>
>   (2) if first precision, run ATLAS/tune/blas/gemm/<arch>/x<pre>findCE,
>       creating ATLAS/include/<arch>/atlas_cacheedge.h
>   (3) Run ATLAS/tune/blas/gemm/<arch>/x<pre>Run_tfc, creating
>       ATLAS/include/<arch>/<pre>Xover.h
>   (4) GEMV tune, creating ATLAS/include/<arch>/atlas_<pre>mv.h,
>       atlas_<pre>mv[N,T].h
>   (5) GER tune, creating ATLAS/include/<arch>/atlas_<pre>r1.h
> Stage 3: General library build
>   (1) Finish all compilation