GBIS Benchmark Header File: solver

   ===                                                            ===
   ===     GENESIS / PARKBENCH Distributed Memory Benchmarks      ===
   ===                                                            ===
   ===                           SOLVER                           ===
   ===                                                            ===
   ===                 Quark propagator generator                 ===
   ===                                                            ===
   ===                 Versions: Std F77, PARMACS, PVM 3.1        ===
   ===                                                            ===
   ===        PARKBENCH authors: Stephen Booth, Nick Stanford     ===
   ===             GENESIS mods: Ian Glendinning                  ===
   ===                                                            ===
   ===                Inquiries: HPC Centre                       ===
   ===                           Computing Services               ===
   ===                           University of Southampton        ===
   ===                           Southampton SO17 4BJ, U.K.       ===
   ===                                                            ===
   ===   Fax: +44 703 593939   E-mail:    ===
   ===                                                            ===
   ===            Last update: Jul 1994; Release: 3.0             ===
   ===                                                            ===
1. Description

SOLVER is part of an ongoing software development exercise carried out by
UKQCD (The United Kingdom Quantum Chromo-Dynamics  collaboration) To develop a
new generation of simulation codes. The current generation of codes were
highly tuned for a particular machine architecture so a software development
exercise was started to design and develop a set of portable codes. This code
was developed by S.Booth and N.Stanford of the University of Edinburgh during
the course of 1993.  Solver is a benchmark code derived from the codes used to
generate quark propagators. It is designed to benchmark and validate the
computational sections of this operation. It differs from the production code
in that it self initialises to non-trivial test data rather than performing
file access. This is because there is no accepted standard for parallel file
access.  The benchmark was originally developed as part of a national UK
procurement exercise.

The application generates quark propagators from a  background gauge
configuration and a fermionic source. This is equivalent to solving M psi =
source where psi is the quark propagator and M (a function operating on psi)
depends on the gauge fields.  The benchmark performs a cut down version of
this operation.

The benchmark code initialises the gauge field to a unit gauge configuration.
(The results for a unit gauge can be calculated analytically allowing a check
on the results) A gauge transformation is then applied to the gauge field. A
unit gauge field only consists of zeros and ones by applying a gauge
transformation non-trivial values are generated. Quantities corresponding to
physical observables should be unchanged by such a transformation.  In
application code the gauge field would have been read in from disk.  The
source field is initialised to a point source (a single non-zero point on one
lattice site) An iterative solver is called to generate the quark propagator.
The solver routine also generates timing information.  In application code
this would then be dumped to disk.  In the benchmark we use the quark
propagator to generate a physically significant quantity (the pion
propagator). This generates a single real number for each timeslice of the
lattice. These values are printed to standard out.

This procedure requires a large number of iterations. For benchmarking we are
only interested in the time per-iteration and some check on the validity of
the results. We therefore usually only perform a fixed number of iterations
(say 50) to generate accurate timing information and verify the results by
comparison with other machines.

Memory as function of problem size :

The appropriate parameters for memory use are
Max_body (maximum number of data-points per/processor)
Max_bound (maximum number of data points on a single boundary between
   two processors)
If LX LY LZ LT are the local lattice sizes obtained by dividing the lattice
size by the processor grid size and rounding up to the nearest integer.
Max_body = (LX*LY*LZ*LT)/2
Max_bound = MAX( LX*LY*LZ/2 ,LY*LZ*LT/2 ,LX*LZ*LT/2 ,LX*LY*LT/2 )

The code contains a number of build-time switches for variations
in the implementation that may be beneficial on some machines. The
memory usage depends on these switches but typical values are:
108 * Max_body + 36 * Max_bound Fpoints
16 * (Max_body + Max_bound) INTEGERS

Number of floating-point operations as function of problem size :

Each iteration performs 2760 floating point operations per lattice site,
i.e. 50 iteration using a 24^3*48 lattice = 9.16e+10 floating point

2. Operating Instructions
The problem size and number of processors are set in the file

For example to run a 8^3*16 lattice on 4 processors use
  C Set the problem size, these numbers MUST be even.
        PARAMETER( X_latt = 8,
       $           Y_latt = 8,
       $           Z_latt = 8,
       $           T_latt = 16)
  C Set the size of the processor grid.
  #ifndef FAKE
        PARAMETER( X_proc = 1,
       $           Y_proc = 1,
       $           Z_proc = 2,
       $           T_proc = 2)

  The preprocessor option FAKE can be used to select single node execution.
  How to do this is described in more detail below.

  The total number of processors used (not counting the front-end)
  is X_proc * Y_proc * Z_proc * T_proc.

  Any reasonable processor grid can be used as the program will
  automatically use an irregular decomposition if a regular
  decomposition is not possible. I consider a processor grid
  where N_proc is more than half N_latt to be unreasonable.

  NB the local lattice size in the X and T directions must be even so
  an irregular decomposition will be used for these directions if
  these lattice dimensions are an odd multiple of the
  corresponding processor-grid width.

The precision of the target machine must be set in the file precision.h

other compile time switches are set in options.h and solver_options.h
these should be ignored.

Compiling and Running the Benchmark:

To compile and link the benchmark for distributed execution type:  make
The PVM version can be compiled for sequential, single-node, execution by


To run the benchmark type:     solver

Results are written to the file `solver.res'.  The measurement of flop/s that
is reported is per node.

The SOLVER program has been configured to use a fixed number of
iterations rather than to iterate to convergence. The solver routine is 
run twice and the free pion propagator calculated after each run.
The first run is for 4 iterations. This is used to verify the results.
In this case the residues and the non zero elements of the free pion propagator
should be the same for all lattice sizes larger than 16^3*32

 STATUS:solver:print_pion:0:Timeslice   0  0.84581123E+00
 STATUS:solver:print_pion:0:Timeslice   1  0.54563329E-01
 STATUS:solver:print_pion:0:Timeslice   2  0.83493473E-02
 STATUS:solver:print_pion:0:Timeslice   3  0.13466747E-02
 STATUS:solver:print_pion:0:Timeslice   4  0.17637214E-03
 STATUS:solver:print_pion:0:Timeslice   5  0.17347775E-04
 STATUS:solver:print_pion:0:Timeslice   6  0.13273219E-05
 STATUS:solver:print_pion:0:Timeslice   7  0.64078817E-07
 STATUS:solver:print_pion:0:Timeslice   8  0.12376473E-08
 STATUS:solver:print_pion:0:Timeslice   9  0.00000000E+00

There may be some variation due to rounding.

The second run is for an additional 50 iterations and should only be
used as a measure of performance.

$Id: ReadMe,v 1.1 1994/07/20 18:10:44 igl Exp igl $


High Performance Computing Centre

Submitted by Mark Papiani,
last updated on 10 Jan 1995.