[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
ATLAS developer release 3.3.1 is out
There's been a lot more changes than there has been testing, so I figure
its chances of flying are roughly the same as a NASA mars mission,
but ATLAS developer release 3.3.1 has hit the shelves instants ahead
of my nervous breakdown.
The main new thing is tuning for, and allowance for outside contribution to,
the Level 1 BLAS. If you are hoping ATLAS will already support very good
Level 1, hope again. This release is mainly about getting the infrastructure
out there so interested parties can do some optimization. For most
platforms, performance will be the exact same as with last release.
Despair not, however, I am sure very good coverage will be found quickly.
The reason is that Level 1 are easy, quick, and, for us "special" individuals,
fun to play with. After finishing the infrastructure, I tuned all of the
Z & D routines on my 600Mhz Athlon classic in one day. I'm not saying
I've got the best routines possible, but they are not too bad. Fortunately,
the routines I optimized for the Athlon don't do too badly for the Pentium
either. So you can get an idea of how much performance there is to get,
here's my one-day Athlon effort compared to the Fortran77 routines on my
600Mhz Athlon Classic and 500Mhz PIII, for vectors of length 1,000,000
(results in MFLOPS):
axpy copy scal nrm2 asum amax dotc
===== ===== ===== ===== ===== ===== =====
F77 d PIII500 21.3 15.1 29.8 14.6 82.7 30.5 38.9
ATL d PIII500 40.7 21.1 29.4 178.0 177.7 90.4 71.1
F77 z PIII500 34.6 12.9 39.1 14.7 30.5 28.7 24.9
ATL z PIII500 83.4 20.9 86.8 178.0 177.4 98.2 134.8
F77 d Athlon600 16.6 12.1 21.1 17.2 47.4 25.6 24.5
ATL d Athlon600 26.4 13.2 23.0 61.3 159.6 175.7 44.9
F77 z Athlon600 22.2 8.3 25.6 17.3 47.4 25.6 24.5
ATL z Athlon600 50.6 13.9 66.6 88.5 72.3 92.1 69.4
NOTE1: I'm using million length vectors, 'cause none of the atlas timers
(even the ones I just released) flush cache adequately enough for the
level 1, so you gotta use vectors long enough to flush it themselves.
Also, of course, once you have flushed caches, the longer the vector
the better you can do performance wise (amortizing all that loop startup
and so on) :->
NOTE2: the complex NRM2 MFLOPS as reported by
x[z,c]l1blastst are 2* what they should be. The above numbers are the
reported results divided by 2 for this routine.
Why does my Athlon blow so bad? My guess is because of memory speed.
I bought one of the very early athlons, and it has 100Mhz SDRAM at best.
Later I'll try the tbird, and someday, maybe, a DDR system. The P4,
with its rambus memory (almost as much bandwidth as lawsuits), should
be interesting as well . . .
In order to get the above speedups, prefetch was needed. This release
has the first stab at atlas_prefetch.h, a header file meant to allow
C coders access to prefetch instructions in a portable fashion. See
ATLAS/doc/atlas_contrib.ps for details. Right now it's got prefetch
ins for 3DNow! and SSE only; if you know how to do it on other platforms,
send it in . . .
I haven't updated the html version of atlas_contrib yet, so you'll need
the one out of the tarfile for all the new goodies . . .
I put in a fix for the memory access errors in TRSM that Shi reported,
and I finally got with Carl's parallel make stuff. Neither of these
features has been stringently tested (perhaps at all, but why say that?),
so your mileage may vary. On my uniprocessor laptop, the parallel
gave no speedup, so maybe I screwed something up, but it could also be
due to the fact it was using local file system (not as much IO to block
on) . . .
Does anyone out there have an application that cares at all about
the performance of any level 1 routine?
Along the same lines, I'm already considering adding support for
atlas_set (set a vector to a constant) and atlas_axpby (the mother
of all level 1 operations: y = alpha*x + beta*y; almost every vector
operation you can thing of is a subcase of this :). Does anyone have
operations needing these routines, or, for that matter, other level 1-like
Keep in mind, that no force on earth short of a revolution in computer
memory will help you for short loops (say < 100 for real), assuming
your operations aren't kept in cache. You are going to wait hundreds
of cycles for the first data item, so that even if your loop were perfect,
it'd still be butt-slow . . .