# PDS: The Performance Database Server

## Mm_1

****************************************
* Matrix Multiply Algorithm Results *
* Results file: mm_1.tbl *
* Source file: mm.c *
* RAM usage: Need at LEAST 10 MBytes *
* Al Aburto, aburto@nosc.mil *
* 01 Oct 1997 *
****************************************
The Matrix Multiply program mm.c is by Mark Smotherman. His email
address is: mark@cs.clemson.edu. Please contact Mark regarding the
mm.c code or for questions, comments, and results showing wide
variations. What results I get (Al Aburto, aburto@nosc.mil)
I'll pass along to Mark too.

This table of results is kept at 'ftp.nosc.mil' (128.49.192.51) in
directory 'pub/aburto'. You can access this and other programs and
results via anonymous ftp. I try to keep things frequently and
regularly updated.

mm.c is a collection of nine matrix multiply algorithms. Five of those
algorithms were selected for this database. The algorithms and options
are shown below. Compile mm.c as: cc -O -DN=500 mm.c -o mm (or use
whatever other compile options you prefer) and then run mm with the
options shown below. NOTE: You must use '-DN=500' else the matrix size
will be undefined.

The results are very interesting as they reveal the enormous effect
that cache thrashing can have on the results with different machines,
algorithms, compilers, and compiler options.

There are even more efficient algorithms tuned for specific machines.
Toshinori Maeno (tmaeno@cc.titech.ac.jp) of the Tokyo Institute of
Technology has sent me a few examples for HP, IBM, DEC, and Sun.

The MFLOPS rating (for FADD and FMUL) can be obtained from the results.
For example, for the T. Maeno algorithm (mm -m 20), the number of FADD
and FMUL instructions (weighted equally) is N * N * ( 2 * N + 25 )
= 256250000 (for N = 500). Therefore MFLOPS = N*N*(2*N+25) / Runtime,
where Runtime is in seconds (see table below). Thus the IBM RS/6000
Model 560 is working at 256250000/3.44 = 74.5 MFLOPS relative to equally
weighted FADD and FMUL instructions. With a properly 'tuned' algorithm
this could be further improved.

mm -n :option n - normal matrix multiply
mm -u 8 :option u - innermost loop unrolled by factor of 8
mm -t :option t - matrix multiply using transpose of b matrix
mm -b 32 :option b - matrix multiply using blocking (size 32) without
unrolling.
mm -m 20 :option m - matrix multiply using Maeno method of blocking
(size 20) and unrolling.
<<<