# PDS: The Performance Database Server

## Mm_2

****************************************
* Matrix Multiply Algorithm Results *
* Results file: mm_2.tbl *
* Source file: mm.c *
* RAM usage: Need at LEAST 10 MBytes *
* Al Aburto, aburto@nosc.mil *
* 01 Oct 1997 *
****************************************
The Matrix Multiply program mm.c is by Mark Smotherman. His email
address is: mark@cs.clemson.edu. Please contact Mark regarding the
mm.c code or for questions, comments, and results showing wide
variations. What results I get (Al Aburto, aburto@nosc.mil)
I'll pass along to Mark too.

This table of results is kept at 'ftp.nosc.mil' (128.49.192.51) in
directory 'pub/aburto'. You can access this and other programs and
results via anonymous ftp. I try to keep things frequently and
regularly updated.

mm.c is a collection of nine matrix multiply algorithms. Four of those
algorithms were selected for this database. The algorithms and options
are shown below. Compile mm.c as: cc -O -DN=500 mm.c -o mm (or use
whatever other compile options you prefer) and then run mm with the
options shown below. NOTE: You must use '-DN=500' else the matrix size
will be undefined.

The results are very interesting as they reveal the enormous effect
that cache thrashing can have on the results with different machines,
algorithms, compilers, and compiler options.

There are even more efficient algorithms tuned for specific machines.
Toshinori Maeno (tmaeno@cc.titech.ac.jp) of the Tokyo Institute of
Technology has sent me a few examples for HP, IBM, DEC, and Sun.

The MFLOPS rating (for FADD and FMUL) can be obtained from the results.
For example, for the D. Warner algorithm (mm -w 50), the number of FADD
and FMUL instructions (weighted equally) is 2*N*N*N = 250,000,000
(for N = 500). Therefore MFLOPS = 2*N*N*N / Runtime, where Runtime is in
seconds (see table below). Thus the IBM RS/6000 Model 950 is working at
250000000/3.65 = 68.5 MFLOPS relative to equally weighted FADD and FMUL
instructions with the D. Warner algorithm with blocking of size 50. With
a properly 'tuned' algorithm this could be improved further.

mm -p :option p - matrix multiply using pointers
mm -v :option v - normal matrix multiply using temp variable
mm -i :option i - matrix multiply with interchanged loops
mm -w 50 :option w - matrix multiply using D. Warner method of blocking
(size 50) and unrolling.
mm -w 20 :option w - matrix multiply using D. Warner method of blocking
(size 20) and unrolling.
<<<