CS 594 - Applications of Parallel Computing
Due February 7th, 2001
Implement, in Fortran or C, the six different ways to perform matrix multiplication by interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a subroutine, like:
subroutine ijk ( a, m, n, lda, b, k, ldb, c, ldc )
subroutine ikj ( a, m, n, lda, b, k, ldb, c, ldc )
Construct a driver program to generate random matrices and calls each matrix multiply routine with square matrices of orders 50, 100, 150, 200, …, 500, timing the calls and computing the Mflop/s rate.
Run your program on a processor of the TORC cluster.
torc0.cs.utk.edu – torc8.cs.utk.edu
Intel Pentium II 550 MHz:
Use the highest level of optimization. Include in your timing routine a calls to PAPI routines for timing measurements and calls to the BLAS matrix multiply routine DGEMM from ATLAS.
Download and build ATLAS for this part.
call dgemm('No', 'No', n, n, n, 1.0d0, a, lda, b,
ldb,1.0d0, c, ldc )
For ATLAS see http://www.netlib.org/atlas/.
For PAPI see: http://icl.cs.utk.edu/papi/ and for an example of PAPI use see: http://www.cs.utk.edu/~dongarra/WEB-PAGES/SPRING-2001/fflop.F
To compile and run with PAPI you can:
f77 -I/usr/local/include fflops.F /usr/local/lib/libpapi.a -o fflops
cc -I/usr/local/include -o flops flops.c -L/usr/local/lib -lpapi
Write-up a description of the timing and describe why the routines perform as they do.
The goal is to optimize matrix multiplication on these machines. Use whatever optimization techniques you can to improve the performance.