PDS: The Performance Database Server

Mm_1

**************************************** * Matrix Multiply Algorithm Results * * Results file: mm_1.tbl * * Source file: mm.c * * RAM usage: Need at LEAST 10 MBytes * * Al Aburto, aburto@nosc.mil * * 01 Oct 1997 * ****************************************

The Matrix Multiply program mm.c is by Mark Smotherman. His email address is: mark@cs.clemson.edu. Please contact Mark regarding the mm.c code or for questions, comments, and results showing wide variations. What results I get (Al Aburto, aburto@nosc.mil) I'll pass along to Mark too.

This table of results is kept at 'ftp.nosc.mil' (128.49.192.51) in directory 'pub/aburto'. You can access this and other programs and results via anonymous ftp. I try to keep things frequently and regularly updated.

mm.c is a collection of nine matrix multiply algorithms. Five of those algorithms were selected for this database. The algorithms and options are shown below. Compile mm.c as: cc -O -DN=500 mm.c -o mm (or use whatever other compile options you prefer) and then run mm with the options shown below. NOTE: You must use '-DN=500' else the matrix size will be undefined.

The results are very interesting as they reveal the enormous effect that cache thrashing can have on the results with different machines, algorithms, compilers, and compiler options.

There are even more efficient algorithms tuned for specific machines. Toshinori Maeno (tmaeno@cc.titech.ac.jp) of the Tokyo Institute of Technology has sent me a few examples for HP, IBM, DEC, and Sun.

The MFLOPS rating (for FADD and FMUL) can be obtained from the results. For example, for the T. Maeno algorithm (mm -m 20), the number of FADD and FMUL instructions (weighted equally) is N * N * ( 2 * N + 25 ) = 256250000 (for N = 500). Therefore MFLOPS = N*N*(2*N+25) / Runtime, where Runtime is in seconds (see table below). Thus the IBM RS/6000 Model 560 is working at 256250000/3.44 = 74.5 MFLOPS relative to equally weighted FADD and FMUL instructions. With a properly 'tuned' algorithm this could be further improved.

mm -n :option n - normal matrix multiply mm -u 8 :option u - innermost loop unrolled by factor of 8 mm -t :option t - matrix multiply using transpose of b matrix mm -b 32 :option b - matrix multiply using blocking (size 32) without unrolling. mm -m 20 :option m - matrix multiply using Maeno method of blocking (size 20) and unrolling. <<<