Timing the
Collective Communications Module

This document describes the timing suite for the routines of the Collective Communications Module, CCM. The tests are performed using the programs:

aall2all allredricf allredrir gath redricf redrir
allredlif allredricr barrier redlif redricr scat
allredlir allredrif bcast redlir redrif

These programs time the CCM module calls shown in the list below. Each call is done a number of times with various amounts and types of data. The programs report the type of CCM call performed, the number of processors, the ranks of the input and output arrays, and the time taken for each call. A post processing program, format.f90, converts the data into html documents.

The barrier testing program works differently from the others. It calls ccm_barrier in a nested loop. The iteration count for the inner loop is increased until the time taken for the loop is one second. The time required for ccm_barrier is then calculated as (time for the loop / iteration count).

The source for these tests are created from the files:

allredil.input allredri.input allredric.input alltoall.input
alltoallv.input bcast.input redil.input redri.input
redric.input scat_gat.input scat_gatv.input

These files contain data for a preprocessing program make_timer.f90. Make_timer.f90 takes the *input files and produces *f90 files that contain the tests discussed above for user selectable data types and input/output array ranks.

Consider the first 12 lines of the file bcast.input.

.true.
bcas
1 3 1 3
sp real(b4)
dp real(b8)
in integer
comp complex(c4)
dpcomp complex(c8)
logical logical
character character
!qp real(b16)
!qpcomp complex(c16)

The first line .true. indicates that we want to define this test only for input and output arrays of the same rank. (This is logical for ccm_bcast.)

The second line bcas gives the name of the output file produced by the test and a generic routine name to be used inside of the test program. The next line gives the ranks of the arrays for which this routine is defined, input ranks of 0 to 3 and output ranks of 0 to 3.

The next have the lines:

sp real(b4)
dp real(b8)
in integer(def_int)
comp complex(c4)
dpcomp complex(c8)
logical logical
character character
!qp real(b16) 
!qpcomp complex(c16) 

These give information about the specific instances of the generic routine. The text in the first field is appended to the generic routine name to give base specific routine names. The base specific routine names then have the ranks of the input and output arrays appended to them to given a collection of specific routine names for a given generic routine.

The next field gives the data type for the routine. The kinds are defined in make_timer.f90 to map to the normal real and complex types, and the default integer type.

The last two lines are commented out. These lines can be used to test routines with quad precision values. If the module is not defined for quad precision values and these lines are uncommented the test program will not compile.

If you look at the rest of the bcast.input file you will see two character strings of the form $1-$4, $h, $x. These are replaced by make_timer.f90 by strings that are appropriate for the data type for which a routine is being built and the ranks of the input and output arrays. The mappings between these variables and their values are shown below.

$1 generic routine name "bcas" in this case
$2 input and output array ranks as integer pairs 00, 01, ... nm
$3 actual data type, one of the following:
 real(b4)
 real(b8)
 integer
 complex(c4)
 complex(c8)
 logical
 character
 real(b16)
 complex(16)
$4 input data rank, replaced with one of the following:
"               " "(:)            "
"(:,:)          " "(:,:,:)        "
"(:,:,:,:)      " "(:,:,:,:,:)    "
"(:,:,:,:,:,:)  " "(:,:,:,:,:,:,:)"
$x output data rank, replaced with one of the following:
"               " "(1)            "
"(1,1)          " "(1,1,1)        "
"(1,1,1,1)      " "(1,1,1,1,1)    "
"(1,1,1,1,1,1)  " "(1,1,1,1,1,1,1)"
$h The first field of the header information, one of the following:
sp
dp
in
comp
dpcomp
logical
character
qp
qpcomp

The preprocessing program also has some additional filters that allow conditional compilation. Lines that contain the text in the table shown below are only added to the final source if the routine that is being generated is for the prescribed rank.

Lines containing
the following
are compiled if
! r1=0 input rank is 0
! r1>0 input rank is > 0
! r2=0 output rank is 0
! r2>0 output rank is > 0

The programs generated by make_timer.f90 contain of a collection the subroutines to do the testing, one for each data type, input array rank and output array rank. Each program has a driver to call the subroutines and report results.

Make_timer takes three inputs from stdin on a single line, the name of the *.input file the name of the *.f90 file and an integer. The integer, the_max, determines the maximum array size for which a test will be run. Timings are done for sizes in the range ( ((2**m)*(10**n),m=0,3),n=0,the_max), or 1 to 8*(10**the_max). For the gather and scatter routines this is the size of the sending and receiving arrays, so the root allocates an array of this size times the number of processors. For the alltoall routines the arrays are also of this size times the number of processors.

  10   3   4
          bcas       real(b4)     4    1    1
   1
ccm_bcast                                                                       
    0.000E+00    0.000E+00    0.000E+00    0.000E+00    0.000E+00
  999.928E-06    1.000E-03    0.000E+00  999.928E-06    0.000E+00
    0.000E+00    1.000E-03    0.000E+00  999.928E-06    1.000E-03
...
...

The first line gives the size of the array required to hold the data for a single test. The first number is the number of times a call was made. The second number and third number give the range of array sizes for which a call was made. In this case the range is ( ((2**m)*(10**n),m=0,3),n=0,4). The next line gives a generic routine name, the data type for the test, the number of processors, and the ranks of the input and output arrays. Following on the next line is an integer that gives the number of tests that were run for each routine. For example ccm_reduce might be called with the max and min operations. On the next line is the name of the routine that is being tested. Then we have the times for the test in seconds, except for the barrier operation the times are in milliseconds. Also, the barrier testing program will report a time of -1 if the test determines that calling ccm_barrier for a given large number of times will require longer than 1 second.

The program format.f90 can take this output file and create a html file that contains the data in tables.

Running the tests

A script, do_test, is provided to run the tests. If the script is run with no inputs, or with the argument all it will run all of the tests on 4 processors. You can select a single tests by giving a test name. The script can take a second argument, the number of processors on which to run the tests.

This script assumes that the programs can be run with the command

mpirun -np #processors program_name

The script can be modified in the obvious places if this command is not available on your machine.

The script will make the programs before running them. The makefile included with the tests will compile against one of the two reference implementations. It assumes that the library is built in a subdirectory one level up from the testing programs. To use this makefile you must set the environmental variable CCM_COM.

If you are running csh or a similar shell then before doing a make, do one of the following:

setenv CCM_COM  sgi_mpi
setenv CCM_COM  sgi_shmem
setenv CCM_COM  sv1_shmem
setenv CCM_COM  darwin
setenv CCM_COM  aix

depending on your platform and if you are using the MPI or shmem reference implementations.

If you are running sh or a similar shell then do one of the following:

export CCM_COM=sgi_mpi
export CCM_COM=sgi_shmem
export CCM_COM=sv1_shmem
export CCM_COM=darwin
export CCM_COM=aix

Setting this variable selects one of the predefined the make include files. Include files for other platforms can be created based on these files.

By default, the test suite is built for single and double precision real and complex values. To build the test suite for quad precision values also, do the following:

(1) In the  *input files remove ! from the lines 

!qp real(b16) 
!qpcomp complex(c16) 

     The script, to_quad, can be used to perform these edits
     
     to_quad *input


(2) In the file fill.f90 remove the initial ! from lines containing !qp

    The script, to_quad, can be used to perform these edits
    
    to_quad fill.f90

(3) Make as discussed above

The makefile will run the program make_test to create the source files for the tests. The makefile contains a line

SIZE=4

to set the maximum array size as discussed above.

On the Cray SV1 and some other platforms you might need to

make make_timer

before doing the make for the rest of the programs.

Timer resolution

The routines are timed using ccm_time. Ccm_time may not have sufficient resolution to get accurate times. You can replace the calls to ccm_time with another timer. The script, to_atime, will replace the calls to ccm_time with calls to a generic timing routine, "atime". It is invoked as

to_atime *input

If you make this replacement you need to supply the your own version of "atime" and make the necessary changes to the makefile. Two example atime routines are

#works on most unix systems including ibm/aix
#include <sys/time.h>
double atime()
{
        double six=1.0e-6;
        struct timeval tb;
        struct timezone tz;
        gettimeofday(&tb,&tz);
        return((double)tb.tv_sec+((double)tb.tv_usec)*six);
}


#works on ibm/aix systems, better than gettimeofday
#include <sys/time.h>
#include <sys/types.h>
double atime();
double atime() {
  double nine=1.0e-9;
  struct timestruc_t tb;
  gettimer(TIMEOFDAY,&tb);
  return((double)tb.tv_sec+((double)tb.tv_nsec)*nine);
}

Cray SV1 warning message

The Cray SV1 might produce a warning message similar to the following:

     The length of common block '@DATA_in_CCM_NUMZ' has been redefined as
     larger by module 'ALLREDUCE_SP00_in_CCM_ALLREDUCE_' from file
    '../shmem_src/libccm.a'.

This warning message is related to a "use only" statement in the module and is of no consequence. The following example using three files, f1.f90, f2.f90 and f3.f90 will generate the same message. It the files are combined into a single file, and that file is compiled, the warning message is not generated.

aurora % cat f1.f90
module ccm_numz 
    integer :: i 
    integer :: ccm_auto_print 
end module 

aurora % cat f2.f90
module ccm 
    use ccm_numz, only : ccm_auto_print 
end module 

aurora % cat f3.f90
program bonk 
  use ccm 
  write(*,*)"hello" 
end program 

aurora % f90 -a taskcommon -c f1.f90
aurora % f90 -a taskcommon -c f2.f90
aurora % f90 -a taskcommon f3.f90
 ldr-167 f90: CAUTION 
     The length of common block '@DATA_in_CCM_NUMZ' has been redefined as
     larger by module 'CCM_NUMZ' from file 'f1.o'.
aurora % 
aurora % 
aurora % cat f1.f90 f2.f90 f3.f90 > f4.f90
aurora % f90 -a taskcommon f4.f90
aurora %