[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

IA64 explained, 3.3.5



Guys,

OK, trying to set a record for number of releases, I've just posted 3.3.5.
This gets rid of trtri out of lapack, improves IA64 complex performance,
and fixes a bug in the complex Cholesky tester.

I have figured out what was going on that I got no speedup with my new
kernel on the IA64.  If you recall, 3.3.3 (which started all this quick
release madness) was supposed to be a IA64-improving release, due to IA64
prefetch, but when I timed it on machines I wasn't NDAd on, I got no
performance improvement.  Even though it used the same compiler as my
NDAd machine, I got strange compiler problems as well.

Turns out the problem is that on the TestDrive machine, they have two different
compilers, and my 3.3.3 build was using a mixture of RedHat's baaaad gcc, and
the much better gcc 3.0.

So, this is the first performance hint for IA64: make sure you use gcc 3.0
everywhere in your ATLAS install: change all C compilers defined in your
Make.<arch> to explicitly reference it, and change all gcc refs in
ATLAS/tune/blas/gemm/CASES/?cases.flg as well.

Once this was done, I got the performance shown below.  What we see is that
prefetch does not make a big performance improvement (3.3.2 and 3.3.4 are
almost the same speed asymptotically), but that the improved cleanup code
I wrote definitely helps small problems.

Prefetch definitely helps the Level 1 and 2 BLAS performance; the bad news
is that even the new performance is signally poor.  This is because we have
no IA64-specific kernels for Level 1/2; the improvement is simply using the
best general kernel with prefetching enabled . . .

The timings on a 800Mhz IA64 are included below, all for double precision.
I do not have access to non-NDAd MKL; if anyone does, I'd love to see some
comparisons . . .

Cheers,
Clint

Timings for double precision, comparing ATLAS 3.3.2 vs. 3.3.4, all on a
800Mhz IA64.  The performance of 3.3.4 is same as 3.3.5 for double precision
(3.3.5 is faster for complex; complex timings are not shown).

             100    200    300    400    500    600    700    800    900   1000
          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 1024.0 1512.4 1783.7 1846.1 1896.3 2076.8 1973.2 2084.6 2102.8 2104.8
3.3.4 dMM 1061.1 1524.1 1803.1 1927.5 1969.2 2029.2 2081.6 2072.3 2126.8 2135.6

            1200   1400   1600   1800   2000   2200   2400   2600   2800   3000
          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 2112.8 2129.5 2192.0 2222.1 2180.5 2136.3 2189.1 2159.2 2236.1 2218.9
3.3.4 dMM 2155.3 2144.9 2171.5 2206.1 2205.7 2194.5 2220.9 2218.9 2223.6 2229.9

                          GEMM   SYMM   SYRK  SYR2K   TRMM   TRSM
                         =====  =====  =====  =====  =====  =====
3.3.2  d100               967.9  962.4  627.4  862.9  677.2  490.4
3.3.4  d100              1019.9 1153.2  710.1  891.6  732.5  636.8

3.3.2  d500              1889.3 1723.9 1452.0 1777.8 1514.8 1245.7
3.3.4  d500              1939.4 1729.7 1590.0 1718.1 1501.5 1402.7

3.3.2  d1000             2117.9 1917.6 1653.3 1935.7 1790.2 1526.1
3.3.4  d1000             2155.8 1823.7 1677.6 1932.1 1701.0 1528.4

                                 GEMV   SYMV   TRMV   TRSV    GER    SYR   SYR2
                               ====== ====== ====== ====== ====== ====== ======
3.3.2 d500                      122.4  225.6  113.4  109.4   39.2   47.3   61.0
3.3.4 d500                      130.1  245.2  170.1  151.5  160.1  107.1  156.9

3.3.2 d1000                     166.0  231.3  101.0   97.3   37.3   37.4   52.1
3.3.4 d1000                     214.1  208.7  194.3   180.0 172.2  115.3  165.7

                   ROTM   SWAP   SCAL   COPY   AXPY     DOT  NRM2   ASUM   AMAX
                 ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 d500000      72.5   28.6   18.6   51.3   29.5   35.1   18.3   47.5   49.3
3.3.4 d500000      77.8   33.6   39.4   50.8  133.0   82.3   96.1  180.9  120.8