**CS 594 -
Applications of Parallel Computing**

**Assignment 2**

**Due February 7 ^{th},
2001**

Part1:

Implement,
in Fortran or C, the six different ways to perform matrix multiplication by
interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a
subroutine, like:

subroutine ijk ( a, m, n,
lda, b, k, ldb, c, ldc )

subroutine ikj ( a, m, n,
lda, b, k, ldb, c, ldc )

...

Construct
a driver program to generate random matrices and calls each matrix multiply
routine with square matrices of orders 50, 100, 150, 200, …, 500, timing the
calls and computing the Mflop/s rate.

Run
your program on a processor of the TORC cluster.

torc0.cs.utk.edu
– torc8.cs.utk.edu |
Intel
Pentium II 550 MHz: |

Use
the highest level of optimization. Include in your timing routine a calls to
PAPI routines for timing measurements and calls to the BLAS matrix multiply
routine DGEMM from ATLAS.

Download
and build ATLAS for this part.

call dgemm('No', 'No', n,
n, n, 1.0d0, a, lda, b,

ldb,1.0d0, c, ldc )

For
ATLAS see http://www.netlib.org/atlas/.

For
PAPI see: http://icl.cs.utk.edu/papi/
and for an example of PAPI use see: http://www.cs.utk.edu/~dongarra/WEB-PAGES/SPRING-2001/fflop.F

and

http://www.cs.utk.edu/~dongarra/WEB-PAGES/SPRING-2001/flops.c

To
compile and run with PAPI you can:

f77 -I/usr/local/include fflops.F
/usr/local/lib/libpapi.a -o fflops

cc -I/usr/local/include -o flops flops.c
-L/usr/local/lib -lpapi

Write-up
a description of the timing and describe why the routines perform as they do.

Part
2:

The
goal is to optimize matrix multiplication on these machines. Use whatever
optimization techniques you can to improve the performance.