                             GEMMW Release Notes
                              December 23, 1992


                                 0. INVENTORY
                                 ------------

gemmw.shar          This includes all of the rest of these files.  Unpack
                    them either using unshar (the preferred method) or the
                    command "sh gemmw.shar" from your favorite shell program.

Makefile            How to make everything of possible value.
README              This file.
clock.c             A clock interface for UNIX versions without one.
gemmw.c             GEMMW's source code.
gemmw.h             GEMMW's header file for real C compilers.
gemmwh.yuk          GEMMW's header file for turkey C compilers.
gemul3.f            Fortran-77 translation of GEMUL3.
gemul3.rat          Ratfor-77 source code for GEMUL3.
gemul390.f          Fortran-90 translation of GEMUL390.
gemul390.rat        Ratfor-90 source code for GEMUL3.
gemul3n.f           Fortran-77 translation of GEMUL3 using NAG.
gemul3n90.f         Fortran-90 translation of GEMUL390 using NAG.
heroux-1.f          Michael Heroux' test problem for a Cray Y-MP (6ns clock).
pfblas.f            Fortran-77 translation of pfblas.rat.
pfblas.rat          Ratfor-77 source for extra Level 3 BLAS (YAX, GEADD, GESUB).
pfblas90.f          Fortran-90 translation of pfblas90.rat.
pfblas90.rat        Ratfor-90 source for extra Level 3 BLAS (YAX, GEADD, GESUB).
second.c            Another clock interface -- THIS MAY NEED TO BE MODIFIED.
testc.c             C version of testc.f
testc.f             The single precision complex data test program.
testd.c             C version of testd.f
testd.f             The double precision real data test program.
tests.c             C version of tests.f
tests.f             The single precision real data test program.
testz.c             C version of testz.f
testz.f             The double precision complex data test program.
yale904.tex         LaTeX source for the Yale University Department of Computer
                    Science Report YALEU/DCS/904.


                               1. Introduction
                               ---------------

This is a quick supplement to the research report contained in yale904.tex.
Reading this paper might be more beneficial than these notes.

In this set of informal release notes, _'s can refer to the letters c, d, s,
or z (sometimes all, sometimes just two of them), depending on the type of the
data.  Using _gemmw with less than 64 bits is crazy, so just don't do it.  The
single precision versions are provided for Cray's, not for fast (or even slow)
32 bit workstations (never mind 16, 24, or 38 bit disasters).

There are two main entry points:

    _gemmw      The actual routine with all sorts of memory allocation
                options depending on the size of an auxiliary area.
    _gemmb      A drop in replacement for calls to _gemm (only the names
                have been changed to protect the innocent).  This calls
                _gemmw and lets that allocate auxiliary space dynamically.

The "how to install this" is not really made yet.  Using the Makefile will
produce some .a files that can be used by other programs, however.  As noted
in the Makefile, do not delete the .f files unless you have Craig Douglas'
.rat to .f translator.  If you have a problem, send an e-mail message to one
of the authors listed at the end of this file.

You can hand tune the cross over point where the Strassen-Winograd routine
_winos converts to the classical algorithm by changing the value of the macro
"mindim" in either the Makefile or in gemmw.h.  A very unscientific method,
but quick, is to make the test programs for your machine several times,
execute them, and see where the minimum run times occur.  You only have to get
close to get a marked improvement over the classical algorithm on many
machines.  Start small (mindim = 32) and work up by increments of 16 or 32.


                           2. Making the Libraries
                           -----------------------

On some machines you will need a file named libblas.a.  This is an object code
archive of the BLAS.  We used the set that is part of LAPACK.  LAPACK is
distributed through netlib.  Send the message "send index from lapack" to the
nearest netlib server (e.g., research.att.com or netlib.ornl.gov) to find out
how to get the BLAS.  Once you get them, you can create a file libblas.a
(which should be renamed and probably moved) using the command "make all" in
whatever directory you unpack the BLAS in.

First, go through the Makefile and find the section related to your computer.
The sections currently are for the following machines:

        Cray-2, Cray YMP, Cray C90, DEC 5000 (Ultrix), IBM AIX/370,
        IBM RISC System/6000, Sequent Symmetry, SUN Sparcs

Hopefully, people will send back more configurations.

You will need to uncomment (by removing the leading '# ' characters) the
parameter definitions.  You should also modify the definitions of ALL and LIBS
at the top of the Makefile.  If your machine uses 64 bit words as single
precision, you will need to modify the files rs.f and cs.f using the
instructions at the top of each file.

To make the most simple of test programs (do not read it, please), just type

        make                            <--- standard UNIX make

or

        gmake MAKE=gmake                <--- GNU make

at the prompt.  This will produce a collection of executables from the set of
names contained early on in the file Makefile.  Note that the Makefiles seem
to confuse the make program on the Sequent I use.  I got a copy of the GNU
make sources from the anonymous ftp area on ftp.uu.net from the directory
packages/gnu.

Some machines do not understand the Fortran construct "double complex" that is
in some of the pfblas.f files.  Try "complex*16" instead on these machines.

On all of these machines the routine _gemul3 will automatically be used in the
complex data cases.  This routine does classical matrix multiplication using a
reduced real matrix multiplication scheme described in yale904.tex.

Note that we used NAG Mark 14, which runs like molasses on vector machines
since they do not seem to have released a vector version yet.  Hence, it is
unfair to compare scalar NAG versus vector anything since this is an oranges
and apples comparison (so do not do it).

Finally, not all manufacturers have seen fit to produce an ANSI C compiler.
Some C compilers are no longer compatible with the Berkeley C compiler for BSD
4.3, which has a marvelous bug in its preprocessor that allows macro
parameters to be appended to each other without a blank being put in between
them.  There is an old version of the header file, gemmwh.yuk, which can be
used instead of gemmw.h.  Do not use the old file unless you absolutely have
to do so.  There are no guarantees that it will be kept up to date or even
distributed in the future.  Instead, please send in a bug report to your
computer manufacturer about their C compiler.  This will probably decrease the
pay raises to the compiler group until they reduce the number of bug reports
per year.

                                  3. Testing
                                  ----------

To execute all of the programs, try

        make runem2

and sit back and watch the numbers go by (possibly slowly).  Better yet, go do
some other work and come back in a while.  If the times seem quite weird, it
means that the scaling factor in clock.c or second.c is wrong.  Fix the
constant and run make again.  Please send us the fix.

To execute a particular one of the test programs (e.g., testsb), try

        testsb

This will start executing with square matrices of size N = 100 and keep
increasing N by 20% until N passes 1826.  Since this uses a lot of memory, you
may want (or have to) change the definition of nmax at the top of both tests.f
and tests.f (setting naux to 1 forces _gemmw to do a malloc always, which you
may find beneficial).  Alternately, you may want to change the definition of
nn.  If all else fails, there are C versions of the important Fortran files.
Contact us for more information.

Finally, the calls to _gemms and _gemm are commented out in tests.f and
tests.f.  You may want to uncomment these calls if your machine has them.

We strongly urge you to generate your own test programs.  The file heroux-1.f
is another test program.  It is set up specifically for a particular Cray Y-MP
(6ns).  We really do not want to release our own test programs since we know
that they already work on a number of machines.

If you find a problem, please let us know.  If possible, send us a copy of a
simple example that fails.  Hopefully, this will never occur...

Good luck,

Craig Douglas               na.cdouglas@na-net.ornl.gov
Michael Heroux              mamh@cray.com
Gordon Slishman             slishmn@watson.ibm.com
Roger Smith                 smith-roger@cs.yale.edu
