

         *** Superscalar GEMM-based level 3 BLAS ***
                       Beta version 0.1

               Bo Kagstrom and Per Ling
               Department of Computing Science
               Umea University
               S-901 87 Umea, Sweden
               bokg@cs.umu.se and pol@c.umu.se

               November 30, 1997


Introduction and Summary
========================

The Superscalar GEMM-based Level 3 BLAS library is a further
development of the GEMM-based Level 3 BLAS (see [1], [2])
targeted towards superscalar processors.

The main differences with GEMM-based Level 3 BLAS are:

- GEMM is the only level 1-3 BLAS called.

- Level 1-2 operations are replaced by in-line code.

- 4x2 unrolling of innermost loops.

- All references are stride 1 - work arrays and copying.

- We still handle "critical" leading dimensions.

- We have also developed a Superscalar DGEMM that currently
is used with the library (here we use 4x4 unrolling).

- GEMM-based Performance benchmark results on IBM PowerPC 604-
processor on IBM SMP node:

Improvements:

DSYMM DSYRK DSYR2K DTRMM DTRSM

+3%   +28%  +2%   +23%  +25%

We observe substantial improvements for the routines where
the GEMM-based library used Level 1-2 routines (DSYRK, DTRMM
and DTRSM). Up to 80% improvement for small matrices.
Practical peak for all routines: 100-106 Mflops/s (500 x 500)

- In future, we can concentrate in making the best possible
GEMM-routine!!

- Next step is to parallelize the library using threads.

- We will also provide a Superscalar GEMM-based Level 3 BLAS
Performance Benchmark.


References
==========

[1]
\bibitem{kagstromlingvanloan97a}
B.~K\aa{gstr{\"o}m}, P.~Ling, and C.~Van~Loan.
\newblock {GEMM-Based Level~3 BLAS:~High-performance Model Implementations
 and Performance Evaluation Benchmark}.
\newblock {\em Accepted for publication in ACM Trans. Math. Software.,
 1997}

[2]
\bibitem{kagstromlingvanloan97b}
B.~K\aa{gstr{\"o}m}, P.~Ling, and C.~Van~Loan.
\newblock {GEMM-Based Level~3 BLAS:~Portability and Optimization Issues }.
\newblock {\em Accepted for publication in ACM Trans. Math. Software.,
1997}


Availability
============

The Superscalar GEMM-based Level 3 BLAS library for double
precision real data can now be obtained via anonymous ftp from
ftp.cs.umu.se (get pub/ssgemmbased.tar.Z). The file can be unpacked
as follows:
 % uncompress ssgemmbased.tar.Z
 % tar xvf ssgemmbased.tar

This will give you the desired directory of routines.


Installation
============

The Superscalar GEMM-based Level 3 BLAS library is installed by
following Steps 1-5 below:

   1.  Set compiler flags in the Makefile. It is important to
       choose optimization flags carefully to achieve the best
       performance. The most aggressive levels of compiler
       optimization do not always give the best performance
       with these level 3 BLAS implementations.

   2.  Create the program 'dsgpm', which is used to set the
       blocking parameters:

          % make dsgpm

   3.  The level 3 BLAS routines can be tuned to memory hierarchy
       characteristics of different machines (cache size) by specifying
       blocking parameters in a file similar to the file 'dgpm.in'.
       Otherwise, 'dgpm.in', can be used.

   4.  Set the blocking parameters in each of the level 3 BLAS
       routines according to the specifications in 'dgpm.in' or
       a similar file with tuned blocking parameters.

          % dsgpm < dgpm.in

   5.  Create the library 'libgbl3b.a' containing the level 3 BLAS
       routines.

          % make


In the following we list some guidelines for how to tune the blocking
parameters in Step 3 above. The values should be set so that the
requirements below are fulfilled.

An easy way to make the sizes of blocks match the level of
unrolling in the routines is to choose all blocking parameters
to be a multiple of four.

dgemm.f (MB, NB, NBT, KB):
     Set MB and KB so that a local array of size (KB x MB) fits
     in the L1-cache, or say in 75% of the cache.

     KB should be a multiple of the number of double precision words
     that fits in a single cache line.

     MB and NBT should not be larger than the number of entries in
     the TLB-buffer.

     NB is used to avoid excessive paging if the main memory
     is small. If performance decreases for large problems that
     do not fit in main memory, reduce NB.


dsymm.f and dsyr2k.f (RCB):
     A local array of size (RCB x RCB) is created. It is not
     necessary that it fits in cache for these routines.

     RCB  should be a multiple of the number of double precision words
     that fits in a single cache line.

     RCB should not be larger than the number of entries in
     the TLB-buffer.


dsyrk.f (RB, CB):
     Set RB and CB so that a local array of size (RB x CB) fits
     in the L1-cache, or say 75% of the cache.

     RB should be a multiple of the number of double precision words
     that fits in a single cache line.

     CB should not be larger than the number of entries in the
     TLB-buffer.


dtrmm.f and dtrsm.f (RCB, RB, CB)
     Set RCB so that a local array of size (RCB x RCB) fits
     in the L1-cache, or say in 75% of the cache.

     Set RB and CB so that a local array of size (RB x CB) fits
     in the L1-cache, or say 75% of the cache.

     RCB and RB should be multiples of the number of double
     precision words that fits in a single cache line.

     RB and CB should not be larger than the number of entries in the
     TLB-buffer.

We encourage practical tests measuring the performance of the
routines to fine tune the parameters within these limits.


Performance Results and Experiences
====================================

Since this code is still under development we appreciate your
comments on the design and performance.  We are looking forward
to receive comments and performance results via email (see addresses
above).

**************************************************************************
