From owner-pbwg-compactapp@CS.UTK.EDU Fri May 21 08:42:24 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA03711; Fri, 21 May 93 08:42:24 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA15863; Fri, 21 May 93 08:42:58 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Fri, 21 May 1993 08:42:58 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA15855; Fri, 21 May 93 08:42:56 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA18681; Fri, 21 May 1993 08:42:55 -0400
Date: Fri, 21 May 1993 08:42:55 -0400
From: walker@rios2.epm.ornl.gov (David Walker)
Message-Id: <9305211242.AA18681@rios2.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: Compact applications


Dear Compact Applications People,

At last I have roughed out some notes on compact applications to serve as
a discussion for next weeks meeting in Knoxville. See you there,

David
------------------ Latex file below --------------------------------
%file: compac2.tex
\chapter{Compact Applications}
\footnote{assembled by David Walker for Compact Applications subcommittee}

\section{Introduction}
\label{sec:compact.intro}
While kernel applications, such as those described in Chapter 4, provide
a fairly straightforward way of assessing the performance the parallel
systems they are not representative of scientific applications in general
since they do not reflect certain types of system behavior. In particular,
many scientific applications involve data movement between phases of
an application, and may also require significant amounts of I/O. These types
of behavior are difficult to gauge using kernel applications. 

One factor
that has hindered the use of full application codes for benchmarking parallel
computers in the past is that such codes are difficult to parallelize and to
port between target architectures. In addition, full application codes that
have been successfully parallelized are often proprietary, and/or subject
to distribution restrictions. To minimize the negative impact of these factors
we propose to make use of compact applications in our benchmarking effort.

Compact applications are typical of those found in research environments 
(as opposed to production or engineering environments), and usually consist of 
up to a few thousand lines of source code. Compact applications are distinct 
from kernel applications since they are capable of producing scientifically
useful results. In many cases, compact applications are made up of several
kernels, interspersed with data movements and I/O operations between the 
kernels.

In this chapter we will discuss a number of compact applications in terms of 
their purpose, the algorithms used, the types of data movements required, 
the memory requirements, and
the amount of I/O. The compact application below are not meant to form a 
definite or complete list.

\section{Proposed Compact Application Benchmarks}
\label{sec:compact.proposed}
To ensure that those areas of scientific computing that make the most use of
high performance computers are adequately represented in the benchmark
suite we shall classify compact applications by scientific field.

\subsection{Plasma Physics}
\label{subsec:plasmas}
Plasma physics is a large consumer of high performance computer cycles. Among
the areas studied are the design of tokamaks, high power microwave devices, and 
astrophysical plasmas. It would be nice to have a compact application from 
each of these three fields in the benchmark suite. Currently we have Hockney's
device simulation, LPM1, from the GENESIS suite.

\subsubsection{Electronic Device Simulation with LMP1}
\label{subsubsec:lpm1}
LMP1 is a time dependent simulation of an electronic device
using a particle-mesh or PIC-type algorithm. It uses a two-dimensional
$(r,z)$ geometry with the fields being computed on a regular mesh
of size $33\times 75\cdot\alpha$, where $alpha$ is a size parameter that can
take the value 1, 2, 4, and 8, corresponding to runs with between about 700 and
6000 particles.

\subsection{Quantum Chromodynamics}
\label{subsubsec:qcd}
Quantum Chromodynamics (QCD) is the gauge theory of the strong
interaction which binds quarks and gluons into hadrons, which make up the
constituents of nuclear matter. Analytical perturbation methods can be applied
to QCD only at high energies, hence computer simulations are necessary to study
QCD at lower, more realistic, energies. In these lattice gauge theory
simulations the quantum field is discretized onto a periodic, four-dimensional,
space-time lattice. Quarks are located at the lattice sites, and the gluons
that bind them are associated with the lattice links. The gluons are
represented by SU(3) matrices, which are a particular type of $3\!\times\! 3$
complex matrix. A major component of the QCD code involves updating these
matrices.

\subsubsection{Quenched QCD}
\label{subsubsec:quenched}
The QCD code in the Perfect benchmark suite is derived from the work of
Fox, Flower, Otto, and Stolorz at Caltech. The Perfect QCD code uses the 
Cabbibo-Marinari pseudo heat bath algorithm to update the SU(3) matrices on
the lattice links. This algorithm uses a Monte Carlo technique to generate a 
chain of configurations which are distributed with a probability proportional
to $\exp{(-S(U))}$, where $S(U)$ is the action of the configuration $U$.
If the only contributions to the action come from the gauge field then
the action is local. The inclusion of dynamical fermions gives rise to a
nonlocal action. This code ignores the effects of dynamical fermions, and so
represents a pure-gauge model in the quenched approximation.

A major component of this QCD code is the updating of the SU(3) matrices
associated with each link in the lattice, and it is this operation which
is benchmarked in the Perfect timings. Two basic operations are involved in
updating the lattice. The first is the multiplication of SU(3) matrices,
and the second is the generation of pseudo-random numbers.

\subsubsection{Genesis QCD}
\label{subsubsection:dynamical}
Is the Genesis benchmark QCD1 similar to the Caltech QCD code. Which one
should be used?

\subsection{General Relativity}
\label{subsec:gr}
\subsubsection{Evolution of Gravitational Field}
The Genesis code GR1 solves a system of hyperbolic PDEs, derived from general
relativity which describe the evolution of a gravitational field from an
initial state. Although conceptually similar to the solution of the wave
equation the equations are long and complicated. This application solves the
axisymmetric problem to reduce the problem to manageable size. Solution of
the general problem requires three orders of magnitude more compute power,
and is likely to become of substantial interest as more powerful parallel
machines are developed.

\subsubsection{Quantum Theory of Gravity}
\label{subsec:gravity}
This code, which derives from the work of Sorkin and Daughton of
Syracuse University, is part of an effort to provide a
satisfactory quantum theory of gravity by the use of causal set
theory$\ldots$whatever that is. The main computational task is the LU
factorization of large, dense matrices ($10000\times 10000$).

\subsection{Climate and Weather Prediction}
\label{subsec:climate}
Mesoscale weather prediction and global climate modeling have become
important application areas in recent years. They typically involve the
solution of nonlinear PDEs.

\subsubsection{Spectral Solver for the Shallow Water Equations}
\label{subsubsec:swe}
The spectral transform method 
is the standard numerical technique
used to solve partial differential equations on the sphere in
global climate modeling. For example, it is used in CCM1 
(the Community Climate Model 1), and its successor CCM2.
The solution of the shallow water equations on a sphere constitutes an 
important component in such global climate models.
The SSWMSB code uses the spectral transform method to solve the shallow water
equations on the surface of a sphere which is discretized as a regular
longitude-latitude grid. In each timestep the state variables of 
the problem are transformed
between the physical domain, where most of the physical forces are calculated,
and the spectral domain, where the terms of the differential equation
are evaluated. This transformation involves first the evaluation of FFTs along
lines of constant latitude, followed by Legendre integration (i.e., weighted
summation) over longitude.

\subsubsection{Helmholtz Solvers for Meteorological Modeling}
\label{subsubsec:helmholtz}
The Genesis suite includes two meteorological applications based on 
Helmholtz solvers. One uses a pseudo-spectral solution method, and the other
a multigrid algorithm.

\subsection{Molecular Dynamics}
\label{subsec:moldyn}

\subsubsection{Dislocation Studies in Crystals}
\label{subsubsec:dislocation}
In parallel Fortran 77 plus message passing code has been developed at ORNL to 
study dislocation phenomena in crystals. This three-dimensional code divides
space into cells, with each processor being assigned a rectangular block of
cells. Each cell contains a set of particles. Communication is necessary to
exchange particles lying in cells on the boundary of a processor with a
neighboring processor. Particles must also be migrated between processors
as they move in space.

\subsubsection{The Genesis Molecular Dynamics Code}
\label{subsubsec:genesis_md}
I don't know much about this, but I expect it's similar to the ORNL code.

\subsubsection{The PERFECT Molecular Dynamics Code}
\label{subsubsec:perfect_md}
The Perfect benchmark suite included two molecular dynamics code, both of
which use data sets that are too small to be used to evaluate current
parallel computers. BDNA which simulates the hydration structure of potassium
counterions and water in a B-DNA molecule, involves 1500 water molecules and
20 counterions. MDG performs a molecular dynamics calculation on 343 water
molecules in the liquid state.

\subsection{Geophysics}
Two important geophysics computations are flow through porous media and
seismic migration. The Perfect suite includes a seismic migration code,
MG3D. This code is dominated by FFTs. A parallel code for modeling groundwater
flow is under development at ORNL and may be a good code to include in the
suite as an example of a flow through porous media code.

\subsection{Other Codes}
Clearly we would want to include CFD codes, astrophysics codes such as the
tree-based simulations of gravitating systems, quantum chemistry and
superconductor simulations. We also need to include codes from the NAS, NPAC,
PERFECT2, and SLALOM benchmark suites, as well as providing better 
descriptions of the codes above.

\section{Concluding Remarks}
There are probably two or three dozen compact applications that
we might consider for inclusion in the benchmark suite. We should consider
what is a reasonable number of codes to include, and the criteria for
accepting a code in terms of documentation, usefulness, and software quality.


From owner-pbwg-compactapp@CS.UTK.EDU Fri May 21 09:06:07 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA03860; Fri, 21 May 93 09:06:07 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA17282; Fri, 21 May 93 09:06:44 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Fri, 21 May 1993 09:06:43 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from BERRY.CS.UTK.EDU by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA17276; Fri, 21 May 93 09:06:41 -0400
Received: from LOCALHOST.cs.utk.edu by berry.cs.utk.edu with SMTP (5.61++/2.7c-UTK)
	id AA01842; Fri, 21 May 93 09:06:40 -0400
Message-Id: <9305211306.AA01842@berry.cs.utk.edu>
To: walker@rios2.epm.ornl.gov (David Walker)
Cc: pbwg-compactapp@cs.utk.edu
Subject: Re: Compact applications 
In-Reply-To: Your message of "Fri, 21 May 1993 08:42:55 EDT."
             <9305211242.AA18681@rios2.epm.ornl.gov> 
Date: Fri, 21 May 1993 09:06:39 -0400
From: "Michael W. Berry" <berry@cs.utk.edu>

Fellow Compact Applic. Members: Here is a copy of the minutes
from the SPEC/Perfect meeting I attended in Hunstville.  Some
of this information may be useful to PBWG.

Mike B.
---------------------------------------------------------------

                        Draft Minutes: The SPEC Perfect Group
                                   11-13 May 1993

          The Perfect Club Steering Committee voted to merge with the  SPEC
          organization.  The first joint meeting with SPEC occurred  during
          11-13 May 1993.  The original SPEC organization has been modified
          so that  the name  "SPEC" refers  to the  non-profit  corporation
          which acts as  a financial umbrella  for benchmarking  subgroups.
          The original SPEC  group is now  known as the  SPEC Open  Systems
          Group.  The Perfect Club is now known as the SPEC Perfect Group.

          In accordance with the  vote taken by  David Schneider in  April,
          the initial  SPEC Perfect  Steering Committee  includes  Margaret
          Simmons (LANL), George Cybenko(Darmouth), David Schneider (CSRD),
          John Larson (CSRD),  Mike Berry (U.of  Tenn), Satish Rege  (DEC),
          Joanne Martin (IBM), and Philip Tannenbaum (HNSX).  This  meeting
          was attended by David Schneider  (CSRD), Mike Berry (U.of  Tenn),
          Satish Rege  (DEC),  Philip  Tannenbaum  (HNSX),  Leo  Boelhouwer
          (IBM-Kingston,  representing   Joanne   Martin),   Jacob   Thomas
          (IBM-Austin), Larry Gray  (Chairman, SPEC BOD),  and Rod  Skinner
          (Treasurer, SPEC).   Hwa Lai (Fujitsu)  attended as an  observer.
          Various SPEC Open  Systems members  periodically sat  in.   David
          Schneider indicated  that  he  anticipated  Cray  Research  would
          rejoin because of marketing necessity.

          The meeting  began  with David  Schneider,  Larry Gray,  and  Rod
          Skinner presenting the framework for  the merger.  The SPEC  Open
          Systems Group  and  the SPEC  Perfect  Group will  be  autonomous
          subgroups within  SPEC.   SPEC  itself  will act  as  a  business
          umbrella organization.  Each Group will assess dues and  allocate
          budgets independently.   The  overhead which  SPEC Perfect  Group
          will  be  responsible  for   will  include  legal  retainer   and
          accounting fees  for  NCGA,  and additional  costs  of  printing,
          duplication,  distribution,  or  other  services  that  the  SPEC
          Perfect Group may elect  to utilize in the  future.  It was  also
          stated that the  SPEC organization was  flexible on many  issues,
          but the  underlying  requirement  was to  ensure  that  corporate
          non-profit  status  regulations  are  not  violated.    SPEC   is
          incorporated as a non-profit organization in California.

          It was  generally  agreed  by  all that  mutual  trust  would  be
          required from SPEC Open Systems  Group and SPEC Perfect Group  to
          minimize formality and unnecessary bureaucracy.

          The Perfect Group will be given one SPEC BOD seat on a  temporary
          basis until January 1994.  The  SPEC BOD currently consists of  5
          members that  includes HP,  Intel, Sun,  ATT/NCR, and  IBM.   The
          Perfect Group seat will add 1 member to the BOD.  In January 1994
          this 6th BOD  seat will  be open for  voting by  the entire  SPEC
          membership (SPEC Perfect Group and  SPEC Open Systems Group).   A
          discussion about who should fill the temporary SPEC Perfect Group
          BOD seat resulted in agreement  that University people could  not
          practically take the  position because  of travel  expense.   IBM
          already was  represented  on the  SPEC  BOD, so  David  Schneider





          nominated Satish  Rege  (DEC)  and Philip  Tannenbaum  (HNSX)  as
          candidates for  the  BOD  seat.    Leo  Boelhouwer  seconded  the
          nomination  for  Philip  Tannenbaum;  Mike  Berry  seconded   the
          nomination for Satish Rege.   A vote will  be conducted by  email
          on/about 1 June 1993.  The initial  7 Steering Committee  members
          are the eligible voters.

          During June  a  press  announcement about  the  merger  would  be
          jointly written.

          There was discussion about  inclusion of academic and  government
          members.   As  a  result of  SPEC  non-profit  requirements,  all
          members must be  either full members  ($5,000/year) or  associate
          members ($1,000/year).    It was  agreed  that few  academics  or
          government members could  acquire funding for  membership.   SPEC
          Perfect Group  Steering  Committee  could elect  to  sponsor  the
          memberships of  selected  individuals;  and  certain  individuals
          could  be  included  by  creation  of  "SPEC  Fellows"  or  "SPEC
          Affiliates" whereby  specific services  could  be paid  for  with
          membership.     Seeking  industrial   sponsorship  for   academic
          participation was  discussed as  desireable.   Each  member  will
          initiate a "check is  in the mail"  process for their  membership
          fees.   Diane  Dean,  NCGA,  2722  Merrilee  Drive,  Fairfax,  VA
          22301-4499 (703-698-9600  x318) is  our contact  in this  regard.
          SPEC Open  Systems  Group  members  received  6  free  pages  for
          SPEC/OSG reporting  in the  publications; additional  pages  were
          billed at $500  each--it was  noted that DEC  purchased 60  extra
          pages in the last publication to kick off a new product line.

          The SPEC Perfect group organization was discussed.  It was agreed
          that the SPEC Perfect Group should have a Chairman, a  Secretary,
          and a Technical Coordinator.   The Chairman would be  responsible
          for interfacing  with  SPEC  and the  SPEC  Open  Systems  Group,
          organizing meetings, and general management.  The Secretary would
          be   responsible    for   generating    minutes   and    handling
          correspondence.  The Technical  Coordinator would be  responsible
          for benchmarking status,  benchmark production and  distribution,
          coordinating the benchmark subgroups,  and being the focal  point
          for technical issues.  Each benchmark subgroup would have its own
          leadership.

          Temporary assignments were accepted to fill these positions until
          the next SPEC Perfect Group  meeting, targeted for August at  ATT
          (Chicago).    Rege  Satish  is  the  temporary  Chiarman,  Philip
          Tannenbaum  the  temporary  Secretary,  and  Leo  Boelhouwer  the
          temporary Technical Coordinator.   Specific action items for  the
          period include:

             Completing the benchmark codes
             Generating verification tests and timing instrumentation
             Publishing minutes
             Writing a  solicitation for  vendors  and industry  to  attract
             membership or sponsorship support





          A discussion about the benchmark rules and reporting resulted  in
          general  agreement  that  there  would  be  baseline  ("As   Is")
          executions which  allowed only  the minimal  changes required  to
          obtain correct  results.   There would  also be  an optimized  or
          alternative solution execution which would allow unlimited use of
          standard vendor libraries and unlimited rewriting in a high level
          language.  

          It was agreed  that the benchmark  programs would be  distributed
          via netlib  or  anonymous ftp.    Text  would be  added  to  each
          benchmark program  requiring that  any use  of benchmark  results
          from the program, which are  not formally accepted and  published
          by SPEC  Perfect  Group,  must   state  "these  results  are  not
          officially approved  and  reported  by  the  SPEC  Perfect  Group
          Steering Committee.    They may  not  be directly  comparable  to
          accepted and verified results."
          Only actual execution results would be permitted.  All executions
          must be  on  hardware  and  software  systems  that  are  current
          products or  which  will be  generally  available in  the  market
          within 6 months.  

          There was  a  spirited debate  on  the  metrics to  be  used  for
          reporting results.  Discussion about  the pros and cons of  using
          normalized  ratings,  MFLOPS,  wall  clock  times,  and  absolute
          numbers took  place. The  discussion  resulted in  the  benchmark
          publications including 1)elapsed wall clock time, 2)startup time,
          3)time step  timing,  3)cleanup  time,   4)total  user  cpu  time
          accumulated, and 5)total system cpu time accumulated per program.
            No MFLOPS rate will reported.   This was agreed to be the  most
          scientifically  sound  approach  that  would  be  meaningful  and
          unambiguous.

          All execution results presented for approval and publication must
          include  sufficient   detail  of   the  hardware   and   software
          configuration such that the  run could be essentially  duplicated
          with comparable  timings.   Acceptable  results will  have  valid
          answers and meet  SPEC Perfect Group  standards for code  changes
          and execution requirements.   Optimized and alternative  solution
          results must include the entire  program code as executed, and  a
          statement that the code  may be used,  without restriction, as  a
          SPEC Perfect Group baseline benchmark  code.  All vendor  library
          codes  used   must  include   copies  of   the  relevant   vendor
          documentation page that include sufficient detail to describe the
          processes done within  the library routine.   New vendor  library
          routines   must   have    copies   of   equivalent    preliminary
          documentation.   All  library  routines used  must  be  generally
          available to all vendor customers, and must either be  documented
          products, or  become  documented  products  within  6  months  of
          benchmark submission.    Results on  prototype  or  preproduction
          systems could  be removed  from  publication if  the  benchmarked
          products were not released within the 6 month window.

          The goal  is to  provide  all codes  in  a FORTRAN77  version,  a
          FORTRAN90 version, and a message passing version.  It was  agreed





          that version control  should be  instituted so  that all  results
          would be grouped according to benchmark version.  If any one code
          in a  benchmark group  changed,  all codes  would receive  a  new
          version number.  The benchmark groups will be aligned to  address
          vertical industrial areas such as petroleum, chemistry,  finance,
          etc. 

          The codes available  for the initial  release include the  FDMOD,
          FKMIG, and  SEIS from  the ARCO  suite, QCD,  FALSE, PUEBLO,  and
          TURB3D.   The ARCO suite codes are farthest along.  All codes are
          expected  to  represent  scalable  problem  solutions  that   are
          appropriate to vector, vector parallel, and MPP architectures.  A
          goal is to  maintain the benchmark  set at a  level whereby  only
          supercomputer class  and extreme  high end  workstations/clusters
          could reasonably  execute the  problems.   There is  no  specific
          exclusion intended; this goal was stated in order to maintain the
          SPEC Perfect Group focus on  true supercomputing rather than  the
          broader high performance computing classification.  The goals may
          not all be addressed initially because of pratical limitations in
          how much can be accomplished with available resources.

          Coding and  language standards  were discussed.   Proposals  were
          made.  John Larson''s work in this area will be circulated.   Leo
          Boelhouwer will  edit  the  V1 execution  rules  and  present  an
          updated draft for approval during the next meeting.  









          Language standards  were  presented as  a  basis for  creating  a
          benchmark code  standard  by  David  Schneider.    They  included
          numerous items that were accepted by the group, and a few  (noted
          below) where no final conclusion was made.

               Variables could not exceed 31 characters
               No Pointers
               No DOUBLE PRECISION; REAL*8 and COMPLEX*16 should be used
               No CHARACTER-Floating Point equivalences
               No Hollerith constants or data
               No 128 bit requirements (REAL*16, COMPLEX*32)
               All 64 bit constants should be specified in D format
               All 32 bit constants should be specified in E format
               Machine constant limitations were discussed--no  conclusions
          agreed
               INTEGER*8 and LOGICAL*8 should not be used unless  necessary
          for execution
               Tests  for  floating   point  equality  were   discussed--no
          conclusions agreed





               Known vector directive information  will be translated to  a
          "C*PERFECT" syntax to
                    preserve information; it will be explicitly  prohibited
          from implementing compiler
                    recognition of "C*PERFECT" information.
               DO WHILE and DO-ENDDO syntax is allowed
               "!" inlined comments were discussed--no conclusions agreed


          Additional action items were summarized:

               Distribute old by-laws for review (DS)
               Review old by-laws and offer suggestions for revision (all)
               Contact NCGA regarding our new status (DS)
               Present our proposals for membership specific issues to  the
          SPEC BOD (SR)
               Identify manpower  requirements  to  complete  V2  benchmakr
          suite (all)
               Transfer "Perfect Benchmark" trademark  from U.Ill. to  SPEC
          (DS)
               Distribute Minutes (PT)
               Set up address and email lists (DS)
               Next meeting  at ATT,  Chicago, in  August (with  SPEC  Open
          Systems Group) (all)
               Schedule a benchathon to finalize all V2 inital codes (all).

---
Michael W. Berry     ___-___  o==o======   .   .   .   .   .
Ayres 114         =========== ||//         
Department of             \ \ |//__        
Computer Science          #_______/        berry@cs.utk.edu
University of Tennessee                    (615) 974-3838 [OFF]
Knoxville, TN 37996-1301                   (615) 974-4404 [FAX]
From owner-pbwg-compactapp@CS.UTK.EDU Wed May 26 17:49:18 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA09519; Wed, 26 May 93 17:49:18 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA25937; Wed, 26 May 93 17:49:43 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Wed, 26 May 1993 17:49:42 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from BERRY.CS.UTK.EDU by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA25931; Wed, 26 May 93 17:49:41 -0400
Received: from LOCALHOST.cs.utk.edu by berry.cs.utk.edu with SMTP (5.61++/2.7c-UTK)
	id AA11808; Wed, 26 May 93 17:49:40 -0400
Message-Id: <9305262149.AA11808@berry.cs.utk.edu>
To: pbwg-compactapp@cs.utk.edu
Subject: We can get ARCO
Date: Wed, 26 May 1993 17:49:39 -0400
From: "Michael W. Berry" <berry@cs.utk.edu>

Here's an note I recieved from Mosher at ARCO - looks pretty good!
Mike

Return-Path: <ccm@Arco.COM>
Received: from inetg1.Arco.COM by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA14228; Wed, 26 May 93 14:48:57 -0400
Received: by Arco.COM (4.1/SMI-4.1)
	id AA06937; Wed, 26 May 93 13:48:55 CDT
Date: Wed, 26 May 93 13:48:55 CDT
From: ccm@Arco.COM (Chuck Mosher (214)754-6468)
Message-Id: <9305261848.AA06937@Arco.COM>
To: berry@cs.utk.edu
Subject: ARCO/Perfect Seismic Benchmark


Version 1.0 of SeisPerf is due for Beta release June 1.  The
suite provides a working seismic processing executive with
examples of common industry algorithms.  Version 1.0 is built
over a simple message passing layer, which calls PVM, P4, or
native message passing services.  The applications call several
of the kernal routines mentioned in the PBWG minutes, including
3D fft's, tri-diagonal and Toepplitz matrix solvers, convolutions,
and integral methods.  The codes are designed to be scalable
from single processor workstations to ~1000 processor MPP systems.

Verification tools include a simple X-windows frame viewer, and
a checksum table that is printed at the end of each run.  The 1.0
release is based on Fortran 77.  MasPar has provided a Fortran 90
port of the codes for their systems, which could form the base for
and HPF version of the codes.

I'd be happy to participate in PARKBENCH and provide support for
including SeisPerf results.

Regards,
Chuck Mosher
ccm@arco.com

---
Michael W. Berry     ___-___  o==o======   .   .   .   .   .
Ayres 114         =========== ||//         
Department of             \ \ |//__        
Computer Science          #_______/        berry@cs.utk.edu
University of Tennessee                    (615) 974-3838 [OFF]
Knoxville, TN 37996-1301                   (615) 974-4404 [FAX]
From owner-pbwg-compactapp@CS.UTK.EDU Thu May 27 12:54:03 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-UTK)
	id AA13555; Thu, 27 May 93 12:54:03 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA10406; Thu, 27 May 93 12:54:28 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Thu, 27 May 1993 12:54:27 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from BERRY.CS.UTK.EDU by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA10400; Thu, 27 May 93 12:54:26 -0400
Received: from LOCALHOST.cs.utk.edu by berry.cs.utk.edu with SMTP (5.61++/2.7c-UTK)
	id AA13805; Thu, 27 May 93 12:54:25 -0400
Message-Id: <9305271654.AA13805@berry.cs.utk.edu>
To: ccm@arco.com (Chuck Mosher (214)754-6468)
Cc: pbwg-compactapp@cs.utk.edu
Subject: Re: ARCO/Perfect Seismic Benchmark 
In-Reply-To: Your message of "Thu, 27 May 1993 06:59:31 CDT."
             <9305271159.AA15941@Arco.COM> 
Date: Thu, 27 May 1993 12:54:24 -0400
From: "Michael W. Berry" <berry@cs.utk.edu>


> An earlier release of the codes is available on the U of Illinois
> anonymous ftp server 'csrd.uiuc.edu' in the directory '/pub/perfect'.
> The file 'arco_beta.tar.Z' contains code, installation scripts,
> and documentation for an earlier f77 version for uniprocessors.
> You might want to get this file and have a look at the documentation
> and source structure.  The message-passing source is pretty close
> in structure to the f77 version.
> 
> We have a mailing list for discussion of the codes:
> 	'perfect_seismic@csrd.uiuc.edu'
> Let me know if you want to be on the list.  We'll announce the
> new codes there.
> 
> Regards,
> Chuck Mosher
 Yes, please add my email addr and pbwg-compactapp@cs.utk.edu to
the mailing list. Thanks Mike

---
Michael W. Berry     ___-___  o==o======   .   .   .   .   .
Ayres 114         =========== ||//         
Department of             \ \ |//__        
Computer Science          #_______/        berry@cs.utk.edu
University of Tennessee                    (615) 974-3838 [OFF]
Knoxville, TN 37996-1301                   (615) 974-4404 [FAX]
From owner-pbwg-compactapp@CS.UTK.EDU Thu Sep 16 11:20:48 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA00187; Thu, 16 Sep 93 11:20:48 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA25374; Thu, 16 Sep 93 11:19:13 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Thu, 16 Sep 1993 11:19:10 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from sun4.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA25344; Thu, 16 Sep 93 11:19:07 -0400
Received: by sun4.epm.ornl.gov (4.1/1.34)
	id AA00634; Thu, 16 Sep 93 11:19:06 EDT
Date: Thu, 16 Sep 93 11:19:06 EDT
From: worley@sun4.epm.ornl.gov (Pat Worley)
Message-Id: <9309161519.AA00634@sun4.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: potential compact benchmark
Forwarding: Mail from 'MAILER-DAEMON (Mail Delivery Subsystem)'
      dated: Thu, 16 Sep 93 11:16:12 EDT

Ian Foster and I are just finishing version 1.0 of PSTSWM, a parallel
algorithm testbed and benchmark code developed for the climate modelling
community. It will be made available to this community via netlib, but it
may also be interesting as a PARKBENCH compact application. There are a
few difficulties with this though, and I would like some
feedback/suggestions on how to proceed.

Description
-----------
PSTSWM is a parallel implementation of a serial code (STSWM 2.0) written
by Jim Hack and Rudy Jakobs at NCAR to solve the shallow water equations
on a sphere using the spectral transform method. It was originally
developed as a numerical algorithm testbed, to allow comparison of
spectral methods with finite difference methods with finite element
methods, etc., and has 6 runtime-selectable test cases in the code.
These test cases specify initial conditions, forcing, and analytic
solutions (for error analysis), and were chosen to test the ability  of
the numerical methods to simulate important flows phenomena.

For PSTSWM, we completely rewrote STSWM to add vertical levels, in order
to get the correct communication and computation granularity for 3-D
climate codes, and to allow the problem size to be selected at runtime
without depending on such nonportable features as dynamic memory. 

PSTSTWM is meant to be a compromise between paper benchmarks and the
usual fixed benchmarks by allowing a significant amount of
runtime-selectable algorithm tuning. Thus, the goal is to see how
quickly the numerical simulation can be run on different machines
without fixing the parallel implementation, but forcing all
implementations to execute the same numerical code (to guarantee
fairness). To enable this PSTSWM supports:

a) 4 classes of parallel algorithms (distributed or transpose
   based for each of two major parallel phases)
b) each class has 3-4 specific parallel algorithms (e.g. using a
   recursive-halving vector sum, using a pipelined ring vector sum,
   etc.)
c) each algorithm has 2-4 variants 
d) each algorithm is built on top of two communication constructs,
   swap and sendrecv, and each of these has 5-6 different communication
   protocol options (synchonous, blocking, nonblocking, forcetypes,
   etc.)

We are quite happy with the code, and are getting good results with it.
Most interesting to us is how the best algorithm changes across
platforms and as the problem size changes on the same platform.

Problems
--------
There are couple of issues to be dealt with in using this code as part
of PARKBENCH.

1) The code currently is in single precision with double precision
   parts. Single precision is sufficient for the problem sizes of
   interest, but the Legendre polynomial values and Gauss quadrature
   weights and nodes must be calculated in higher precision. For larger
   problem sizes, double precision computation will be appropriate, but
   the Gauss weights, etc will then need to be calculated in quad.
   precision. I do not think that this sort of mixed case has been
   discussed yet. 

2) In one sense, PSTSWM is not a single benchmark, but many of them.
   We can fix the problem and parallel algorithm specifications by
   providing (a set of) default input files, but which ones should we
   chose? All of them are arguably good algorithms in some setting, and
   I would hate to compare two machines when the algorithm is good for
   one and inappropriate for another.

3) PSTSWM is currently written using PICL (because that is what I
   normally use and because I have embedded instrumentation in the
   research version of the code). I made a real effort to isolate the
   message passing bits, so porting to anything else will be trivial.
   But the message passing interface that is used does effect the
   parallel algorithms that are supported. For example, PICL supports
   nonblocking send and receive and passes through forcetype message
   types. These are important to performance on some Intel machines.
   This is not a problem so much as something to be aware of. PSTSWM
   will also be available in its original form, but a pointer to some of
   the issues in cross-machine comparisions should be made. This may be
   an issue that should be mentioned in the methodology section as
   pertains to compact applications. Unlike low level benchmarks,
   compact applications are less likely to be "done right" by the vendor
   for their particular machines. 

Comments and suggestions would be appreciated. I imagine every proposed
compact application will be unsuitable in one form or another when it is
first submitted, and precise guidelines on what should or should not be
permitted is important. On the other hand, as a developer, I will not be
interested in doing too much work in modifying the code in order to
include it in the benchmark suite. Even with the best intentions, it
will not be a high priority item for me and is likely to be put off
(forever) if not fairly simple.

Thanks.

Pat Worley

From owner-pbwg-compactapp@CS.UTK.EDU Tue Sep 21 11:49:13 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA02710; Tue, 21 Sep 93 11:49:13 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA08554; Tue, 21 Sep 93 11:47:15 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Tue, 21 Sep 1993 11:47:14 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930125/2.8s-UTK)
	id AA08546; Tue, 21 Sep 93 11:47:13 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA12782; Tue, 21 Sep 1993 11:47:07 -0400
Date: Tue, 21 Sep 1993 11:47:07 -0400
From: walker@rios2.epm.ornl.gov (David Walker)
Message-Id: <9309211547.AA12782@rios2.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: Application submission form


I'm trying to put together a submission form for people to use to submit 
applications for inclusion in the ParkBench Compact Applications suite. Also
I'd like to establish a procedure for submission. Below is a first stab at
these 2 things. Please send me feedback. Later this week I intend to send
out a filled in version of the submission form as an example.

David
                 PARKBENCH COMPACT APPLICATIONS SUBMISSION FORM

To submit a compact application to the ParkBench suite you must follow the
following procedure:

1. Complete the submission form below, and email it to David Walker
   at walker@msr.epm.ornl.gov. The data on this form will be reviewed 
   by the ParkBench Compact Applications Subcommittee, and you will
   be notified if the application is to be considered further for
   inclusion in the ParkBench suite.
   
2. If ParkBench Compact Applications Subcommittee decides to consider
   your application further you will be asked to submit the source code
   and input and output files, together with any documentation and papers
   about the application. Source code and input and output files should
   be submitted by email, or ftp, unless the files are very large, in
   which case a tar file on a 1/4 inch cassette tape. Wherever possible 
   email submission is preferred for all documents in man page, Latex 
   and/or Postscipt format. These files documents and papers together
   constitute your application package. Your application package should
   be sent to:
		David Walker
                Oak Ridge National Laboratory
                Bldg. 6012/MS-6367
                P. O. Box 2008
                Oak Ridge, TN 37831-6367
                (615) 574-7401/0680 (phone/fax)
                walker@msr.epm.ornl.gov

   The street address is "Bethal Valley Road" if Fedex insists on this.
   The subcommittee will then make a final decision on whether to include 
   your application in the ParkBench suite.

3. If your application is approved for inclusion in the ParkBench suite
   you (or some authorized person from your organization) will be asked
   in complete and sign a form giving ParkBench authority to distribute,
   and modify (if necessary), your application package.

-------------------------------------------------------------------------------
Name of Program         :
-------------------------------------------------------------------------------
Submitter's Name        :
Submitter's Organization:
Submitter's Address     :


Submitter's Telephone # :
Submitter's Fax #       :
Submitter's Email       :
-------------------------------------------------------------------------------
Cognizant Expert(s)     :
CE's Organization       :
CE's Address            :



CE's Telephone #        :
CE's Fax #              :
CE's Email              :
-------------------------------------------------------------------------------
Extent and timeliness with which CE is prepared to respond to questions and
bug reports from ParkBench :


-------------------------------------------------------------------------------
Major Application Field :
Application Subfield(s) :
-------------------------------------------------------------------------------
Application "pedigree"  :




-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :


-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

        Integers :     bytes
	Floats   :     bytes

-------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :



-------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :



-------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :



-------------------------------------------------------------------------------
Other relevent research papers:



-------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :




-------------------------------------------------------------------------------
Total number of lines in source code:
Number of lines excluding comments  :
Size in bytes of source code        :
-------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :



-------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :



-------------------------------------------------------------------------------
Brief, high-level description of what application does:




-------------------------------------------------------------------------------
Main algorithms used:



-------------------------------------------------------------------------------
Skeleton sketch of application:




-------------------------------------------------------------------------------
Brief description of I/O behavior:




-------------------------------------------------------------------------------
Brief description of load balance behavior :




-------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :



-------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :




-------------------------------------------------------------------------------
Give parameters that determine the problem size :



-------------------------------------------------------------------------------
Give memory as function of problem size :


-------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :


-------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :




-------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :






-------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :



-------------------------------------------------------------------------------
From owner-pbwg-compactapp@CS.UTK.EDU Tue Oct  5 15:29:11 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA06534; Tue, 5 Oct 93 15:29:11 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA00420; Tue, 5 Oct 93 15:28:34 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Tue, 5 Oct 1993 15:28:29 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA00402; Tue, 5 Oct 93 15:28:23 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA20677; Tue, 5 Oct 1993 15:28:21 -0400
Message-Id: <9310051928.AA20677@rios2.epm.ornl.gov>
To: spb@epcc.edinburgh.ac.uk, mia@unixa.nerc-bidston.ac.uk,
        pbwg-compactapp@cs.utk.edu
Subject: Submission form for ParkBench compact applications
Date: Tue, 05 Oct 93 15:28:20 -0500
From: David W. Walker <walker@rios2.epm.ornl.gov>


Below is an example (prepared by Pat Worley of Oak Ridge National Lab) of
the use of the ParkBench Compact Applications submission form. This form (or
something like it) is intended to be used by all persons wishing to submit 
an application to be included in the suite. The first page or so expalins
the submission procedure. Pat has been very thorough in filling out the form.
I don't think it practical to expect every submission to be this detailed.

If you have applications that you would like to submit please go ahead and
fill in the form. Laso any comments on the form would be appreciated. I hope
to give the form wider distribution in a couple of weeks so we can (I hope)
get a good number of submission before teh SC93 ParkBench meeting.

David

                 PARKBENCH COMPACT APPLICATIONS SUBMISSION FORM

To submit a compact application to the ParkBench suite you must follow the
following procedure:

1. Complete the submission form below, and email it to David Walker
   at walker@msr.epm.ornl.gov. The data on this form will be reviewed 
   by the ParkBench Compact Applications Subcommittee, and you will
   be notified if the application is to be considered further for
   inclusion in the ParkBench suite.
   
2. If ParkBench Compact Applications Subcommittee decides to consider
   your application further you will be asked to submit the source code
   and input and output files, together with any documentation and papers
   about the application. Source code and input and output files should
   be submitted by email, or ftp, unless the files are very large, in
   which case a tar file on a 1/4 inch cassette tape. Wherever possible 
   email submission is preferred for all documents in man page, Latex 
   and/or Postscipt format. These files documents and papers together
   constitute your application package. Your application package should
   be sent to:
David Walker
                Oak Ridge National Laboratory
                Bldg. 6012/MS-6367
                P. O. Box 2008
                Oak Ridge, TN 37831-6367
                (615) 574-7401/0680 (phone/fax)
                walker@msr.epm.ornl.gov

   The street address is "Bethal Valley Road" if Fedex insists on this.
   The subcommittee will then make a final decision on whether to include 
   your application in the ParkBench suite.

3. If your application is approved for inclusion in the ParkBench suite
   you (or some authorized person from your organization) will be asked
   in complete and sign a form giving ParkBench authority to distribute,
   and modify (if necessary), your application package.

-------------------------------------------------------------------------------
Name of Program         : PSTSWM 
                        : (Parallel Spectral Transform Shallow Water Model)
-------------------------------------------------------------------------------
Submitter's Name        : Patrick H. Worley
Submitter's Organization: Oak Ridge National Laboratory
Submitter's Address     : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
Submitter's Telephone # : (615) 574-3128
Submitter's Fax #       : (615) 574-0680
Submitter's Email       : worley@msr.epm.ornl.gov
-------------------------------------------------------------------------------
Cognizant Expert(s)     : Patrick H. Worley
CE's Organization       : Oak Ridge National Laboratory
CE's Address            : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
CE's Telephone #        : (615) 574-3128
CE's Fax #              : (615) 574-0680
CE's Email              : worley@msr.epm.ornl.gov

Cognizant Expert(s)     : Ian T. Foster
CE's Organization       : Argonne National Laboratory
CE's Address            : MCS 221/D-235
                          9700 S. Cass Avenue
                          Argonne, IL 60439
CE's Telephone #        : (708) 252-4619
CE's Fax #              : (708) 252-5986
CE's Email              : itf@mcs.anl.gov
-------------------------------------------------------------------------------
Extent and timeliness with which CE is prepared to respond to questions and
bug reports from ParkBench :

Modulo other commitments, Worley is prepared to respond quickly to questions
and bug reports, but expects to be kept informed as to results of experiments
and modifications to the code.

-------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Climate Modeling
-------------------------------------------------------------------------------
Application "pedigree" (origin, history, authors, major mods) :

PSTSWM Version 1.0 is a message-passing benchmark code and parallel algorithm
testbed that solves the nonlinear shallow water equations using the spectral
transform method. The spectral transform algorithm of the code follows
closely how CCM2, the NCAR Community Climate Model, handles the dynamical
part of the primitive equations, and the parallel algorithms implemented in
the model include those currently used in the message-passing parallel
implementation of CCM2. PSTSWM was written by Patrick Worley of Oak Ridge
National Laboratory and Ian Foster of Argonne National Laboratory, and is
based partly on previous parallel algorithm research by John Drake, David
Walker, and Patrick Worley of Oak Ridge National Laboratory. Both the code
development and parallel algorithms research were funded by the DOE Computer
Hardware, Advanced Mathematics, and Model Physics (CHAMMP) program. The
features of version 1.0 were frozen on 8/1/93, and it is this version we
would offer initially as a benchmark.  

PSTSWM is a parallel implementation of a sequential code (STSWM 2.0) written
by James Hack and Ruediger Jakob at NCAR to solve the shallow water equations 
on a sphere using the spectral transform method. STSWM evolved from a
spectral shallow water model written by Hack (NCAR/CGD) to compare numerical
schemes designed to solve the divergent barotropic equations in spherical
geometry. STSWM was written partially to provide the reference solutions
to the test cases proposed by Williamson et. al. (see citation [4] below),
which were chosen to test the ability of numerical methods to simulate
important flow phenomena. These test cases are embedded in the code and 
are selectable at run-time via input parameters, specifying initial conditions,
forcing, and analytic solutions (for error analysis). The solutions are also
published in a Technical Note by Jakob et. al. [3]. In addition, this code is
meant to serve as an educational tool for numerical studies of the shallow
water equations. A detailed description of the spectral transform method, and
a derivation of the equations used in this software, can be found in the
Technical Note by Hack and Jakob [2].  

For PSTSWM, we rewrote STSWM to add vertical levels (in order to get the
correct communication and computation granularity for 3-D weather and climate
codes), to increase modularity and support code reuse, and to allow the
problem size to be selected at runtime without depending on dynamic memory
allocation. PSTSTWM is meant to be a compromise between paper benchmarks and
the usual fixed benchmarks by allowing a significant amount of
runtime-selectable algorithm tuning. Thus, the goal is to see how quickly the
numerical simulation can be run on different machines without fixing the
parallel implementation, but forcing all implementations to execute the same
numerical code (to guarantee fairness). The code has also been written in
such a way that linking in optimized library functions for common operations
instead of the "portable" code will simple.

-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :

Yes, but users are requested to acknowledge the authors (Worley and
Foster) and the program that supported the development of the code
(DOE CHAMMP program) in any resulting research or publications, and are
encouraged to send reprints of their work with this code to the authors.
Also, the authors would appreciate being notified of any modifications to 
the code. Finally, the code has been written to allow easy reuse of code in
other applications, and for educational purposes. The authors encourage this,
but also request that they be notified when pieces of the code are used.

-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

The program currently uses INTEGER, REAL, COMPLEX, and DOUBLE PRECISION
variables. The code should work correctly for any system in which COMPLEX is
represented as 2 REALs. The include file params.i has parameters that can be
used to specify the length of these. Also, some REAL and DOUBLE parameters
values may need to be modified for floating point number systems with large
mantissas, e.g., PI, TWOPI. PSTSWM is currently being used on systems where

        Integers : 4   bytes
	Floats   : 4   bytes

The use of two precisions can be eliminated, but at the cost of a significant
loss of precision. (For 4 bytes REALs, not using DOUBLE PRECISION increases
the error by approximately three orders of magnitude.) DOUBLE PRECISION
results are only used in set-up (computing Gauss weights and nodes and
Legendre polynomial values), and are not used in the body of the computation.

-------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :

The sequential code is documented in a file included in the distribution of the
code from NCAR:

Jakob, Ruediger, Description of Software for the Spectral Transform Shallow
Water Model Version 2.0. National Center for Atmospheric Research,
Boulder, CO 80307-3000, August 1992

and in 

Hack, J.J. and R. Jakob, Description of a global shallow water model based on
the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. 

Documentation of the parallel code is in preparation, but extensive
documentation is present in the code.

-------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :

1) Browning, G.L., J.J. Hack and P.N. Swarztrauber, A comparison of
   three numerical methods for solving differential equations on
   the sphere, Monthly Weather Review, 117:1058-1075, 1989.

2) Hack, J.J. and R. Jakob, Description of a global
   shallow water model based on the spectral transform method,
   NCAR Technical Note TN-343+STR, January 1992.

3) Jakob, R., J.J. Hack and D.L. Williamson, Reference solutions to
   shallow water test set using the spectral transform method,
   NCAR Technical Note TN-388+STR (in preparation).

4) Williamson, D.L., J.B. Drake, J.J. Hack, R. Jakob and P.S. Swarztrauber,
   A standard test set for numerical approximations to the shallow
   water equations in spherical geometry, Journal of Computational Physics,
   Vol. 102, pp.211-224, 1992.
-------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :

5) Worley, P. H. and J. B. Drake, Parallelizing the Spectral Transform Method,
   Concurrency: Practice and Experience, Vol. 4, No. 4 (June 1992), 
   pp. 269-291.

6) Walker, D. W., P. H. Worley, and J. B. Drake, Parallelizing the Spectral
   Transform Method. Part II, 
   Concurrency: Practice and Experience, Vol. 4, No. 7 (October 1992), 
   pp. 509-531.

7) Foster, I. T. and P. H. Worley,
   Parallelizing the Spectral Transform Method: A Comparison of Alternative
   Parallel Algorithms,
   Proceedings of the Sixth SIAM Conference on Parallel Processing for
   Scientific Computing (March22-24, 1993), pp. 100-107.

8) Foster, I. T. and P. H. Worley,
   Parallel Algorithms for the Spectral Transform Method,
   (in preparation)

9) Worley, P. H. and I. T. Foster,
   PSTSWM: A Parallel Algorithm Testbed and Benchmark.
   (in preparation)

-------------------------------------------------------------------------------
Other relevant research papers:

10) I. Foster, W. Gropp, and R. Stevens, 
    The parallel scalability of the spectral transform method, 
    Mon. Wea. Rev., 120(5), 1992, pp. 835--850. 

11) Drake, J. B., R. E. Flanery, I. T. Foster, J. J. Hack, J. G. Michalakes,
    R. L. Stevens, D. W. Walker, D. L. Williamson, and P. H. Worley,
    The Message-Passing Version of the Parallel Community Climate Model,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 500-513.

12) Sato, R. K. and R. D. Loft,
    Implementation of the NCAR CCM2 on the Connection Machine,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 371-393.

13) Barros, S. R. M. and Kauranne, T.,
    On the Parallelization of Global Spectral Eulerian Shallow-Water Models,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 36-43.

14) Kauranne, T. and S. R. M. Barros,
    Scalability Estimates of Parallel Spectral Atmospheric Models,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 312-328.

15) Pelz, R. B. and W. F. Stern,
    A Balanced Parallel Algorithm for Parallel Processing,
    Proceedings of the Sixth SIAM Conference on Parallel Processing for
    Scientific Computing (March22-24, 1993), pp. 126-128.

-------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :

The model code is primarily written in Fortran 77, but also uses
DO ... ENDDO and DO WHILE ... ENDDO, and the INCLUDE extension (to pull in
common and parameter declarations). It has been compiled and run on the Intel
iPSC/2, iPSC/860, Delta, and Paragon, the IBM SP1, and on Sun Sparcstation,
IBM RS/6000, and Stardent 3000/1500 workstations (as a sequential code).

Message passing is implemented using the PICL message passing system.
All message passing is encapsulated in 3 highlevel routines:

BCAST0 (broadcast)
GMIN0  (global minimum)
GMAX0  (global maximum)

two classes of low level routines:
 SWAP, SWAP_SEND, SWAP_RECV, SWAP_RECVBEGIN, SWAP_RECVEND, SWAP1, SWAP2, SWAP3
 (variants and/or pieces of the swap operation)
and
 SENDRECV, SRBEGIN, SREND, SR1, SR2, SR3
 (variants and/or pieces of the send/recv operation)

and one synchronization primitive:
CLOCKSYNC0

PICL instrumentation commands are also embedded in the code.

Porting the code to another message passing library will be simple, although
some of the runtime communication options may become illegal then.
The PICL instrumentation calls can be stubbed out (or removed) without
changing the functionality of the code, but some sort of synchronization is
needed when timing short benchmark runs.

-------------------------------------------------------------------------------
Total number of lines in source code: 28,204
Number of lines excluding comments  : 12,434
Size in bytes of source code        : 994,299
-------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :

problem:   23 lines, 559 bytes, ascii
algorithm: 33 lines, 874 bytes, ascii

-------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :

standard output: Number of lines and bytes is a function of the input
                 specifications, but for benchmarking would normally be
                 63 lines (2000 bytes) of meaningful output. (On the Intel
                 machine, FORTRAN STOP messages are sent from each processor
                 at the end of the run, increasing this number.)

timings:         Each run produces one line of output, containing approx.
                 150 bytes.

Both files are ascii.


-------------------------------------------------------------------------------
Brief, high-level description of what application does:

(P)STSWM solves the nonlinear shallow water equations on the sphere.
The nonlinear shallow water equations constitute a simplified
atmospheric-like fluid prediction model that exhibits many of the features of
more complete models, and that has been used to investigate numerical
methods and benchmark a number of machines.
Each run of PSTSWM uses one of 6 embedded initial conditions and forcing
functions. These cases were chosen to stress test numerical methods for this
problem, and to represent important flows that develop in atmospheric
modeling. STSWM also supports reading in arbitrary initial conditions, but
this was removed from the parallel code to simplify the development of the
initial implementation. 

-------------------------------------------------------------------------------
Main algorithms used:

PSTSWM uses the spectral transform method to solve the shallow water
equations. During each timestep, the state variables of the
problem are transformed between the physical domain, where most of the
physical forces are calculated, and the spectral domain, where the terms of
the differential equation are evaluated. The physical domain is a tensor
product longitude-latitude grid. The spectral domain is the set of spectral
coefficients in a spherical harmonic expansion of of the state variables, and
is normally characterized as a triangular array (using a "triangular"
truncation of spectral coefficients). 

Transforming from physical coordinates to spectral coordinates involves
performing a real FFT for each line of constant latitude, followed by 
integration over latitude using Gaussian quadrature (approximating the
Legendre transform) to obtain the spectral coefficients. The inverse
transformation involves evaluating sums of spectral harmonics and inverse
real FFTs, analogous to the forward transform.

Parallel algorithms are used to compute the FFTs and to compute the 
vector sums used to approximate the forward and inverse Legendre transforms.
Two major alternatives are available for both transforms, distributed
algorithms, using a fixed data decompostion and computing results where they
are assigned, and transpose algorithms, remapping the domains to allow the
transforms to be calculated sequentially. This translates to four major
parallel algorithms:

a) distributed FFT/distributed Legendre transform (LT)
b) transpose FFT/distributed LT
c) distributed FFT/transpose LT
d) transpose FFT/transpose LT

Multiple implementations are supported for each type of algorithm, and
the assignment of processors to transforms is also determined by input
parameters. For example, input parameters specify a logical 2-D processor
grid and define the data decomposition of the physical and spectral domains
onto this grid. If 16 processors are used, these can be arranged as
a 4x4 grid, an 8x2 grid, a 16x1 grid, a 2x8 grid, or a 1x16 grid.
This specification determines how many processors are used to calculate each
parallel FFT and how many are used to calculate each parallel LT.

-------------------------------------------------------------------------------
Skeleton sketch of application:

The main program calls INPUT to read problem and algorithm parameters
and set up arrays for spectral transformations, and then calls
INIT to set up the test case parameters. Routines ERRANL and
NRGTCS are called once before the main timestepping loop for
error normalization, once after the main timestepping for 
calculating energetics data and errors, and periodically during 
the timestepping, as requested. The prognostic fields are 
initialized using routine ANLYTC, which provides the analytic
solution. Each call to STEP advances the computed fields by a 
timestep DT. Timing logic surrounds the timestepping loop, so the
initialization phase is not timed. Also, a fake timestep is calculated before
beginning timing to eliminate the first time "paging" effect currently seen
on the Intel Paragon systems. 

STEP computes the first two time levels by two semi-implicit timesteps;
normal time-stepping is by a centered leapfrog-scheme. STEP calls COMP1,
which choses between an explicit numerical algorithm, a semi-implicit
algorithm, and a simplified algorithm associated with solving the advection
equation, one of the embedded test cases. The numerical algorithm used is an
input parameter. 

The basic outline of each timestep is the following:
1) Evaluate non-linear product and forcing terms.
2) Fourier transform non-linear terms in place as a block transform.
3) Compute and update divergence, geopotential, and vorticity spectral
   coefficients. (Much of the calculation of the time update is "bundled"
   with the Legendre transform.)
4) Compute velocity fields and transform divergence, geopotential,
   and vorticity back to gridpoint space using 
   a) an inverse Legendre transform and associated computations and
   b) an inverse real block FFT.

PSTSWM has "fictitious" vertical levels, and all computations are duplicated
on the different levels, potentially significantly increasing the granularity
of the computation. (The number of vertical levels is an input parameter.)
For error analysis, a single vertical level is extracted and analyzed. 

-------------------------------------------------------------------------------
Brief description of I/O behavior:

Processor 0 reads in the input parameters and broadcasts them to the rest of
the processors. Processor 0 also receives the error analysis and timing
results from the other processors and writes them out.

-------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :

The processors are treated as a logical 2-D grid. There are 3 domains to be
distributed:
 a) physical domain: tensor product longitude-latitude grid
 b) Fourier domain: tensor product wavenumber-latitude grid
 c) spectral domain: triangular array, where each column contains the
                     spectral coefficients associated with a given
                     wavenumber. The larger the wavenumber is, the shorter
                     the column is.
An unordered FFT is used, and the Fourier and spectral domains use the
"unordered" permutation when the data is being distributed.

I) distributed FFT/distributed LT
   1) The tensor-product longitude-latitude grid is mapped onto the 
      processor grid by assigning a block of contiguous longitudes 
      to each processor column and by assigning one or two blocks of
      contiguous latitudes to each processor row. The vertical dimension is
      not distributed.   
   2) After the FFT, the subsequent wavenumber-latitude grid is similarly
      distributed over the processor grid, with a block of the permuted
      wavenumbers assigned to each processor column.
   3) After the LT, the wavenumbers are distributed as before and the spectral
      coefficients associated with any given wavenumber are either
      distributed evenly over the processors in the column containing that
      wavenumber, or are duplicated over the column. What happens is a
      function of the particular distributed LT algorithm used.

II) transpose FFT/distributed LT
   1) same as in (I)
   2) Before the FFT, the physical domain is first remapped to
      a vertical layer-latitude decomposition, with a block of contiguous
      vertical layers assigned to each processor column and the longitude
      dimension not distributed. After the transform, the vertical
      level-latitude grid is distributed as before, and the wavenumber
      dimension is not distributed. 
   3) After the LT, the spectral coefficients for a given vertical layers are
      either distributed evenly over the processors in a column, or are
      duplicated over that column. What happens is a function of the
      particular distributed LT algorithm used. 

III) distributed FFT/transpose LT
   1) same as (I)
   2) same as (I)
   3) Before the LT, the wavenumber-latitude grid is first remapped to
      a wavenumber-vertical layer decomposition, with a block of contiguous
      vertical layers assigned to eadh processor row and the latitude
      dimension not distributed. After the transform, the spectral
      coefficients associated with a given wavenumber and vertical layer
      are all on one processor, and the wavenumbers and vertical layers are
      distributed as before.

IV) transpose FFT/transpose LT
   1) same as (I)
   2) same as (II)
   3) Before the LT, the vertical level-latitude grid is first remapped to
      a vertical level-wavenumber decomposition, with a block of the permuted 
      wavenumbers now assigned to each processor row and the latitude
      dimension not distributed. After the transform, the spectral
      coefficients associated with a given wavenumber and vertical layer
      are all on one processor, and the wavenumbers and vertical layers are
      distributed as before.

-------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :

The distribution is a function of the problem size (longitude, latitude,
vertical levels), the logical processor grid (PX, PY), and the algorithm
(transpose vs. distributed for FFT and LT).

-------------------------------------------------------------------------------
Brief description of load balance behavior :

The load is fairly well balanced. If PX and PY evenly divide the number of
longitudes, latitudes, and vertical levels, then all load imbalances are due
to the unequal distribution of spectral coefficients. As described above, the
spectral coefficients are laid out as a triangular array in most runs, where
each column corresponds to a different Fourier wavenumber. The wavenumbers are
partitioned among the processors in most of the parallel algorithms. Since
each column is a different length, a wrap mapping of the the columns will
approximately balance the load. Instead, the natural "unordered" ordering of
the FFT is used with a block partitioning, which does a reasonable job of
load balancing without any additional data movement. The load imbalance is
quantified in Walker, et al [5]. 

If PX and PY do not evenly divide the dimensions of the physical domain,
then other load imbalances may be as large as a factor of 2 in the worse
case. 

-------------------------------------------------------------------------------
Give parameters that determine the problem size :

MM, NN, KK - specifes number of Fourier wavenumber and spectral truncation
             used. For a triangular truncation, MM = NN = KK.
NLON, NLAT, NVER
           - number of longitudes, latitudes, and vertical levels. There
             are required relationships between NLON, NLAT, and NVER, and
             between these and MM. These relationships are checked in the
             code. We will also provide a selection of input files that
             specify legal (and interesting) problems.
DT         - timestep (in seconds). (Must be small enough to satisfy Courant
             condition stability condition. Code warns if too large, but does
             not abort.)
TAUE       - end of model run (in hours)

-------------------------------------------------------------------------------
Give memory as function of problem size :

Executable size is determined at compile time by setting the parameters
COMPSZ in params.i. Per node memory requirements are approximately
(in REALs)

associated Legendre polynomial values:
   MM*MM*NLAT/PX*PY
physical grid fields: 
   8*NLON*NLAT*NVER/(PX*PY)
spectral grid fields: 
   3*MM*MM*NVER/(PX*PY) 
 or (if spectral coefficients duplicated within a processor column)
   3*MM*MM*MVER/PX        
work space:
   8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/(PX*PY)
 or (if spectral coefficients duplicated within a processor column)
   8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/PX

where BUFS1 and BUFS2 are input parameters (number of communication buffers).
BUFS1 and BUFS2 can be as small as 0 and as large as PX or PY.

In standard test cases, NLON=2*NLAT, NLON=4*NVER, and NLON=3*MM+1, so memory
requirements are approximately:

    (2 + 108*(1+BUFS1) + 3*(1+BUFS2))*(M**3)/(4*PX*PY)
  or
    (2 + 108*(1+BUFS1))*(M**3)/(4*PX*PY) + 3*(1+BUFS2)*(M**3)/(4*PX)


-------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :

for a serial run per timestep (very rough):
  nonlinear terms:
        10*NLON*NLAT*NVER
  forward FFT:
        40*NLON*NLAT*NVER*LOG2(NLON)
  forward LT and time update:
       48*MM*NLAT*NVER + 7*(MM**2)*NLAT*NVER
  inverse LT and calculation of velocities:
       20*MM*NLAT*NVER + 14*(MM**2)*NLAT*NVER
  inverse FFT:
       25*NLON*NLAT*NVER*LOG2(NLON)

Using standard assumptions (NLON=2*NLAT, NLON=4*NVER, and NLON=3*MM+1):

approx. 460*(M**3) + 348*(M**3)*LOG2(M) + 24*(M**4) flops per timestep.

For a total run, multiply by TAUE/DT.

-------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :

This is a function of the algorithm chosen.

I) transpose FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        2*(PX-1) steps, D volume
      or
        2*LOG2(PX) steps, D*LOG2(PX) volume 

II) distributed FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        2*LOG2(PX) steps, D*LOG2(PX) volume

III) transpose LT

   a) forward LT:  let D = 8*NLON*NLAT*NVER/(PX*PY)
        2*(PY-1) steps, D volume
      or
        2*LOG2(PY) steps, D*LOG2(PY) volume 

   b) inverse LT:  let D = (3/2)*(MM**2)*NVER/(PX*PY)
        (PY-1) steps, D volume
       or
        LOG2((PY) steps, D*PY volume

IV) distributed LT

   a) forward + inverse LT:  let D = 3*(MM**2)*NVER/(PX*PY)
        2*(PY-1) steps, D*PY volume
       or
        2*LOG2((PY) steps, D*PY volume

These are per timestep costs. Multiply by TAUE/DT for total communication
overhead. 

-------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :

Standard input files will be provided for 

T21: MM=KK=NN=21      T42: MM=KK=NN=42        T85: MM=NN=KK=85
     NLON=32               NLON=64                 NLON=128
     NLAT=64               NLAT=128                NVER=256
     NVER=8                NVER=16                 NVER=32
     ICOND=2               ICOND=2                 ICOND=2
     DT=4800.0             DT=2400.0               DT=1200.0
     TAUE=120.0            TAUE=120.0              TAUE=120.0

These are 5 day runs of the "benchmark" case specified in Williamson, et al
[3]. Flops and memory requirements for serial runs are as follows (approx.):

T21:           500,000 REALs
         2,000,000,000 flops
     
T42:         4,000,000 REALs
        45,000,000,000 flops

T85:        34,391,000 REALs
     1,000,000,000,000 flops

Both memory and flops scale well, so, for example, the T42 run fits in
approx. 4MB of memory for a 4 processor run. But different algorithms and 
different aspect ratios of the processor grid use different amounts of memory.

-------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :

Count by hand (looking primarily at inner loops, but eliminating common
subexpressions that compiler is expected to find).

-------------------------------------------------------------------------------
Other relevant information:



-------------------------------------------------------------------------------
From owner-pbwg-compactapp@CS.UTK.EDU Fri Oct  8 09:17:11 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA29750; Fri, 8 Oct 93 09:17:11 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA00426; Fri, 8 Oct 93 09:16:23 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Fri, 8 Oct 1993 09:16:22 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA00418; Fri, 8 Oct 93 09:16:20 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA20027; Fri, 8 Oct 1993 09:16:19 -0400
Message-Id: <9310081316.AA20027@rios2.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: Compact applications chapter
Date: Fri, 08 Oct 93 09:16:19 -0500
From: David W. Walker <walker@rios2.epm.ornl.gov>


I just sent the following to Mike Berry, but some of you might also like to make
suggestions.

David

Mike,
	I am a bit of a loss as to what to put into the ParkBench report
for Compact Application since we haven't had any codes submitted (except
for maybe 2 or 3).  It seems to me that we can't really say much without
the codes, about from very general requirements.

David
From owner-pbwg-compactapp@CS.UTK.EDU Fri Oct  8 10:17:35 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA00610; Fri, 8 Oct 93 10:17:35 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA06069; Fri, 8 Oct 93 10:17:05 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Fri, 8 Oct 1993 10:17:03 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from haven.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA06059; Fri, 8 Oct 93 10:17:02 -0400
Received: by haven.EPM.ORNL.GOV (4.1/1.34)
	id AA15407; Fri, 8 Oct 93 10:16:56 EDT
Date: Fri, 8 Oct 93 10:16:56 EDT
From: worley@haven.EPM.ORNL.GOV (Pat Worley)
Message-Id: <9310081416.AA15407@haven.EPM.ORNL.GOV>
To: walker@rios2.epm.ornl.gov, pbwg-compactapp@cs.utk.edu
Subject: Re: Compact applications chapter
In-Reply-To: Mail from 'David W. Walker <walker@rios2.epm.ornl.gov>'
      dated: Fri, 08 Oct 93 09:16:19 -0500
Cc: worley@haven.EPM.ORNL.GOV

>I just sent the following to Mike Berry, but some of you might also like to make
>suggestions.
>
>David
>
>Mike,
>>I am a bit of a loss as to what to put into the ParkBench report
>for Compact Application since we haven't had any codes submitted (except
>for maybe 2 or 3).  It seems to me that we can't really say much without
>the codes, about from very general requirements.
>
>David

Since I imagine that there will always be a dearth of (good) compact
applications, a requirements document (or, at least, a wish list) would be a
useful contribution, particularly if the wishlist were prioritized by what is
most important for the code to have, e.g.,

1) scientific relevance (does anyone care about this type of problem)
2) numerical relevance (are the numerical algorithms representative or
   interesting) 
3) algorithmic relevance (are the parallel algorithms representative or
   interesting)
4) portability (language, parallel programming model, etc.)
5) runability (easy to run, easy to validate results, easy to use for
   benchmarking)
6) ...

This can probably be broken into requirements and desirable features.

Pat

From owner-pbwg-compactapp@CS.UTK.EDU Thu Oct 14 13:38:54 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA16662; Thu, 14 Oct 93 13:38:54 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA04580; Thu, 14 Oct 93 13:37:31 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Thu, 14 Oct 1993 13:37:29 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA04571; Thu, 14 Oct 93 13:37:28 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA19646; Thu, 14 Oct 1993 13:37:27 -0400
Date: Thu, 14 Oct 1993 13:37:27 -0400
From: walker@rios2.epm.ornl.gov (David Walker)
Message-Id: <9310141737.AA19646@rios2.epm.ornl.gov>
To: berry@cs.utk.edu
Subject: ParkBench compact applications
Cc: pbwg-compactapp@cs.utk.edu


Mike,
	Below is the latest version of the Compact Application section of the
ParkBench document. I also intend to send a latex version of the submission 
form to you later today for inclusion as Appendix A. I hope there will
be some comments back from the other members of teh subcommittee about
this section so I hope there will be an opportunity to update it.

David
%file: compac3.tex
%date: October 14, 1993
\chapter{Compact Applications}
\footnote{assembled by David Walker for Compact Applications subcommittee}

\section{Introduction}
\label{sec:compact.intro}
While kernel applications, such as those described in Chapter 3, provide
a fairly straightforward way of assessing the performance of parallel
systems they are not representative of scientific applications in general
since they do not reflect certain types of system behavior. In particular,
many scientific applications involve data movement between phases of
an application, and may also require significant amounts of I/O. These types
of behavior are difficult to gauge using kernel applications. 

One factor
that has hindered the use of full application codes for benchmarking parallel
computers in the past is that such codes are difficult to parallelize and to
port between target architectures. In addition, full application codes that
have been successfully parallelized are often proprietary, and/or subject
to distribution restrictions. To minimize the negative impact of these factors
we propose to make use of compact applications in our benchmarking effort.

Compact applications are typical of those found in research environments 
(as opposed to production or engineering environments), and usually consist of 
up to a few thousand lines of source code. Compact applications are distinct 
from kernel applications since they are capable of producing scientifically
useful results. In many cases, compact applications are made up of several
kernels, interspersed with data movements and I/O operations between the 
kernels.

In this chapter the criteria for selecting compact applications
for the ParkBench suite will be discussed. In addition, the general 
research areas that will be represented in the suite are outlined.

%In this chapter we will discuss a number of compact applications in terms of 
%their purpose, the algorithms used, the types of data movements required, 
%the memory requirements, and
%the amount of I/O. The compact application below are not meant to form a 
%definite or complete list.

\section{Criteria for Selection}
\label{sec:criteria}
The three main criteria for inclusion of a parallel code
in the Compact Applications suite are,
\begin{enumerate}
\item
The code must be a complete application and be capable of producing results
of research interest. These two points distinguish a compact application from
a kernel. For example, a code that only solves a randomly-generated, dense, 
linear system by LU factorization should be considered a kernel. Even though 
the code is complete, it does not produce results of research interest. 
However, if the LU factorization is embedded in an application that uses
the boundary element method to solve, for example, a two-dimensional
elastodynamics problem, then such an application could legitimately be
considered a compact application. 
Compact applications and full production codes are distinguished by their
software complexity, which is difficult to quantify. Software complexity gives
an indication of how hard it is to write, port and maintain an application, 
and may be gauged very roughly by the length of the source code. However, there
is no hard upper limit on the length of a code in the Compact Applications 
suite.  It is expected that the source code (excluding comments and repeated 
common blocks) for most compact applications will be between 2000 and 10000 
lines, but some may be longer.

\item
The code must be of high quality. This means it must have been extensively
tested and validated, preferably on a wide selection of different parallel
architectures. The problem size and number of processors used must not be
hard-coded into the application, and should be specified at runtime as input 
to the program. Ideally, the parallel code should not impose restrictions on 
the problem size that are not applicable for the corresponding sequential code.
Thus, the parallel code should not require that the problem size be exactly 
divisible by the number of processors, or that the number of processors be 
a power of two. In some cases this latter requirement may have to be relaxed.
For example, most parallel fast Fourier transform routines require the number
of processors to be a power of two. It is preferable that the code be
written so that it works correctly for
an arbitrary one-to-one mapping between the logical process topology of the
application and the hardware topology of the parallel computer.
This is desirable so
that the assignment of a location in the logical process topology to a
physical processor can be easily adjusted when porting
the application between platforms. For example a Gray code assignment may
be best for a hypercube, and a natural ordering for a mesh architecture.

\item
The application must be well documented. The source code itself should 
contain an adequate number of comments, and each module should begin
with a comment section that describes what the routine does, and the
arguments passed to it. In addition, there should be a ``Users' Guide''
to the application that describes the input and output, the parameterization
of the problem size and processor layout, and details of what the application
does. The Users' Guide should also contain a bibliography of related
papers.
\end{enumerate}

In addition, to the three criteria discussed above, there are a number of
other desirable features that a ParkBench Compact Application should have.
These are discussed in the following subsections.

\subsection{Self Checking Applications}
\label{subsec:checking}
The application should be self-checking. That is, at the end of the computation
the application should perform a check to validate the results of the run.
The application may also output a summary of performance results for the run,
such as the Mflop rate, and other pertinent information.

\subsection{Programming Languages}
\label{subsec:languages}
The code should be written in Fortran 77, Fortran 90, High Performance Fortran,
or C. Data should be passed between processors by explicit message passing.
ParkBench does not specify which message passing system should be used, but
one that is available on a number of parallel platforms is preferable. 
Eventually it is expected that MPI will become the message passing system
of choice, but in the meantime portable systems such as PVM, PICL, Express,
PARMACS, and P4 are acceptable alternatives. The codes in the
Compact Applications suite should not contain any assembly coded portions,
although assembly code may be used in optimized versions of the code.

\section{Proposed Compact Application Benchmarks}
\label{sec:compact.proposed}
At the time of writing (October 1993) the ParkBench organization is in
the process of soliciting submission of applications for inclusion in
the Compact Applications suite. Thus, the applications that comprise the suite
cannot yet be listed here. However, in this section the main application areas
that are expected to be in the suite are outlined. The intention is that
these areas should be representative of the fields in which parallel
computers are actually used. The codes should exercise a number of different
algorithms, and possess different communication and I/O characteristics.
Initially the Compact Applications suite will
consist of no more than ten codes. This restriction is imposed so that
the resources needed to manage and distribute the suite can be assessed. The
suite may be enlarged in the future if this seems manageable.
Below is a list of the application areas that are expected to be
represented in the suite. This is
not meant to be an exclusive list; submissions from other application areas
will be considered for inclusion in the suite.
\begin{itemize}
\item
Climate and meteorological modeling
\item
Computational fluid dynamics (CFD)
\item
Finance, e.g., portfolio optimization
\item
Molecular dynamics
\item
Plasma physics
\item
Quantum chemistry
\item
Quantum chromodynamics (QCD)
\item
Reservoir modeling
\end{itemize}

\section{Submitting to the Compact Application Suite}
\label{sec:submit}
The procedure for submitting codes to the ParkBench Compact Applications suite
is as follows.
\begin{enumerate}
\item
Complete the submission form in Appendix A, and email it to David Walker
at walker@msr.epm.ornl.gov. The data on this form will be reviewed
by the ParkBench Compact Applications Subcommittee, and the submitter will
be notified if the application is to be considered further for
inclusion in the ParkBench suite.
\item
If ParkBench Compact Applications Subcommittee decides to consider
the application further the submitter will be asked to submit the source code
and input and output files, together with any documentation and papers
about the application. Source code and input and output files should
be submitted by email, or ftp, unless the files are very large, in
which case a tar file on a 1/4 inch cassette tape. Wherever possible
email submission is preferred for all documents in man page, Latex
and/or Postscipt format. These files documents and papers together
constitute the application package. The application package should
be sent to the following address, and the subcommittee will then make a final 
decision on whether to include the application in the ParkBench suite.\par
\smallskip
\indent David W. Walker\par
\indent Oak Ridge National Laboratory\par
\indent Bldg.~6012/MS-6367\par
\indent P. O. Box 2008\par
\indent Oak Ridge, TN 37831-6367\par
\indent (615) 574-7401/0680 (phone/fax)\par
\indent walker@msr.epm.ornl.gov\par

\item
If the application is approved for inclusion in the ParkBench suite
an authorized person from the submitting organization will be asked
to complete and sign a form giving ParkBench authority to distribute,
and modify (if necessary), the application package.
From owner-pbwg-compactapp@CS.UTK.EDU Thu Oct 28 08:51:57 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA11600; Thu, 28 Oct 93 08:51:57 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA07295; Thu, 28 Oct 93 08:51:33 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Thu, 28 Oct 1993 08:51:32 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA07287; Thu, 28 Oct 93 08:51:31 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA13437; Thu, 28 Oct 1993 08:51:41 -0400
Date: Thu, 28 Oct 1993 08:51:41 -0400
From: walker@rios2.epm.ornl.gov (David Walker)
Message-Id: <9310281251.AA13437@rios2.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: Compact Appl. Submissions


So far I've received 3 submissions for the ParkBench Compact
Applications suite. I'm sending you the completed forms in 3 
separate email messages.

David
From owner-pbwg-compactapp@CS.UTK.EDU Thu Oct 28 08:52:38 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA11616; Thu, 28 Oct 93 08:52:38 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA07341; Thu, 28 Oct 93 08:52:14 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Thu, 28 Oct 1993 08:52:13 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA07333; Thu, 28 Oct 93 08:52:11 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA11913; Thu, 28 Oct 1993 08:52:21 -0400
Date: Thu, 28 Oct 1993 08:52:21 -0400
From: walker@rios2.epm.ornl.gov (David Walker)
Message-Id: <9310281252.AA11913@rios2.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: POLMP Compact Application


-------------------------------------------------------------------------------
Name of Program         : POLMP
                 (Proudman Oceanographic Laboratory Multiprocessing Program)
-------------------------------------------------------------------------------
Submitter's Name        : Mike Ashworth
Submitter's Organization: NERC Computer Services
Submitter's Address     : Bidston Observatory
			  Birkenhead, L43 7RA, UK
Submitter's Telephone # : +44-51-653-8633
Submitter's Fax #       : +44-51-653-6269
Submitter's Email       : mia@ua.nbi.ac.uk
-------------------------------------------------------------------------------
Cognizant Expert 	: Mike Ashworth
CE's Organization	: NERC Computer Services
CE's Address     	: Bidston Observatory
			  Birkenhead, L43 7RA, UK
CE's Telephone # 	: +44-51-653-8633
CE's Fax #       	: +44-51-653-6269
CE's Email       	: mia@ua.nbi.ac.uk
-------------------------------------------------------------------------------
Extent and timeliness with which CE is prepared to respond to questions and
bug reports from ParkBench :

Bearing in mind other commitments, Mike Ashworth is prepared to respond 
quickly to questions and bug reports, and expects to be kept informed as 
to results of experiments and modifications to the code.

-------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Ocean and Shallow Sea Modeling
-------------------------------------------------------------------------------
Application "pedigree" (origin, history, authors, major mods) :

     The POLMP project was created to develop numerical
     algorithms for shallow sea 3D hydrodynamic models that run
     efficiently on modern parallel computers. A code was
     developed, using a set of portable programming conventions
     based upon standard Fortran 77, which follows the wind
     induced flow in a closed rectangular basin including a number
     of arbitrary land areas. The model solves a set of
     hydrodynamic partial differential equations, subject to a set of
     initial conditions, using a mixed explicit/implicit forward time
     integration scheme. The explicit component corresponds to a
     horizontal finite difference scheme and the implicit to a
     functional expansion in the vertical (Davies, Grzonka and
     Stephens, 1989).

     By the end of 1989 the code had been implemented on the RAL
     4 processor Cray X-MP using Cray's microtasking system,
     which provides parallel processing at the level of the Fortran
     DO loop. Acceptable parallel performance was achieved by
     integrating each of the vertical modes in parallel, referred to
     in Ashworth and Davies (1992) as vertical partitioning. In
     particular, a speed-up of 3.15 over single processor execution
     was obtained, with an execution rate of 548 Megaflops
     corresponding to 58 per cent of the peak theoretical
     performance of the machine. Execution on an 8 processor Cray
     Y-MP gave a speed-up efficiency of 7.9 and 1768 Megaflops or
     67 per cent of the peak (Davies, Proctor and O'Neill, 1991).
     The latter resulted in Davies and Grzonka being awarded a
     prize in the 1990 Cray Gigaflop Performance Awards .

     The project has been extended by implementing the shallow
     sea model in a form which is more appropriate to a variety of
     parallel architectures, especially distributed memory
     machines, and to a larger number of processors. It is especially
     desirable to be able to compare shared memory parallel
     architectures with distributed memory architectures. Such a
     comparison is currently relevant to NERC science generally
     and will be a factor in the considerations for the purchase of
     new machines, bids for allocations on other academic
     machines, and for the design of new codes and the
     restructuring of existing codes.

     In order to simplify development of the new code and to ensure
     a proper comparison between machines, a restructured version
     of the Davies and Grzonka rectangle was designed which will
     perform partitioning of the region in the horizontal dimension.
     This has the advantage over vertical partitioning that the
     communication between processors is limited to a few points
     at the boundaries of each sub-domain. The ratio of interior
     points to boundary points, which determines the ratio of
     computation to communication and hence the efficiency on
     message passing, distributed memory machines, may be
     increased by increasing the size of the individual sub-domains.
     This design may also improve the efficiency on shared memory
     machines by reducing the time of the critical section and
     reducing memory conflicts between processors. In addition, the
     required number of vertical modes is only about 16, which,
     though well suited to a 4 or 8 processor machine, does not
     contain sufficient parallelism for more highly parallel
     machines.

     The code has been designed with portability in mind, so that
     essentially the same code may be run on parallel computers
     with a range of architectures. 

-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :

Yes, but users are requested to acknowledge the authors (Ashworth and
Davies) in any resulting research or publications, and are
encouraged to send reprints of their work with this code to the authors.
Also, the authors would appreciate being notified of any modifications to 
the code. 

-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

Some 8 byte floating point numbers are used in some of the initialization
code, but calculations on the main field arrays may be done using
4 byte floating point variables without grossly affecting the solution.
Nevertheless, precision conversion is facilitated by a switch supplied
to the C preprocessor. By specifying -DSINGLE, variables will be declared
as REAL, normally 4 bytes, whereas -DDOUBLE will cause declarations to be
DOUBLE PRECISION, normally 8 bytes.

-------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :

The README file supplied with the code describes how the various versions
of the code should be built. Extensive documentation, including the 
definition of all variables in COMMON is present as comments in the code.

-------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :

1) Davies, A.M., Formulation of a linear three-dimensional hydrodynamic
   sea model using a Galerkin-eigenfunction method, Int. J. Num. Meth.
   in Fliuds, 1983, Vol. 3, 33-60.

2) Davies, A.M., Solution of the 3D linear hydrodynamic equations using
   an enhanced eigenfunction approach, Int. J. Num. Meth. in Fluids,
   1991, Vol. 13, 235-250.

-------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :

1) Ashworth, M. and Davies, A.M., Restructuring three-dimensional
   hydrodynamic models for computers with low and high degrees of
   parallelism, in Parallel Computing '91, eds D.J.Evans, G.R.Joubert
   and H.Liddell (North Holland, 1992), 553-560.
   
2) Ashworth, M., Parallel Processing in Environmental Modelling, in
   Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
   Meteorology (Nov. 23-27, 1992)
   Hoffman, G.-R and T. Kauranne, ed., 
   World Scientific Publishing Co. Pte. Ltd, Singapore, 1993.

3) Ashworth, M. and Davies, A.M., Performance of a Three Dimensional
   Hydrodynamic Model on a Range of Parallel Computers, in
   Proceedings of the Euromicro Workshop on Parallel and Distributed
   Computing, Gran Canaria 27-29 January 1993, pp 383-390, (IEEE
   Computer Society Press)
   
4) Davies, A.M., Ashworth, M., Lawrence, J., O'Neill, M.,
   Implementation of three dimensional shallow sea models on vector
   and parallel computers, 1992a, CFD News, Vol. 3, No. 1, 18-30.
   
5) Davies, A.M., Grzonka, R.G. and Stephens, C.V., The implementation
   of hydrodynamic numerical sea models on the Cray X-MP, 1992b, in
   Advances in Parallel Computing, Vol. 2, edited D.J. Evans.
   
6) Davies, A.M., Proctor, R. and O'Neill, M., "Shallow Sea
   Hydrodynamic Models in Environmental Science", Cray Channels,
   Winter 1991.

-------------------------------------------------------------------------------
Other relevant research papers:

-------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :

Code is initially passed through the C preprocessor, allowing a 
number of versions with different programming styles, precisions
and machine dependencies to be generated.

Fortran 77 version

     A sequential version of POLMP is available, which conforms
     to the Fortran 77 standard. This version has been run on a
     large number of machines from workstations to supercomputers 
     and any code which caused problems, even if it conformed to 
     the standard, has been changed or removed. Thus its conformance 
     to the Fortran 77 standard is well established.

     In order to allow the code to run on a wide range of problem
     sizes without recompilation, the major arrays are defined
     dynamically by setting up pointers, with names starting with
     IX, which point to locations in a single large data array: SA.
     Most pointers are allocated in subroutine MODSUB and the
     starting location passed down into subroutines in which they
     are declared as arrays. For example :

     IX1 = 1
     IX2 = IX1 + N*M
     CALL SUB ( SA(IX1), SA(IX2), N, M )

     SUBROUTINE SUB ( A1, A2, N, M )
     DIMENSION A1(N,M), A2(N,M)
     END

     Although this is probably against the spirit of the Fortran 77
     standard, it is considered the best compromise between
     portability and utility, and has caused no problems on any of
     the machines on which it has been tried. 

     The code has been run on a number of traditional vector
     supercomputers, mainframes and workstations. In addition,
     key loops are able to be parallelized automatically by some
     compilers on shared (or virtual shared) memory MIMD machines, 
     allowing parallel execution on the Convex C2 and C3, Cray X-MP, 
     Y-MP, and Y-MP/C90, and Kendall Square Research KSR-1. Cray 
     macrotasking calls may also be enabled for an alternative
     mode of parallel execution on Cray multiprocessors.

Message passing version

     POLMP has been implemented on a number of message-passing machines:
     Intel iPSC/2 and iPSC/860, Meiko CS-1 i860 and CS-2 and nCUBE 2.
     Code is also present for the PVM and Parmacs portable message
     passing systems, and POLMP has run successfully, though not 
     efficiently, on a network of Silicon Graphics workstations. 
     Calls to message passing routines are concentrated 
     in a small number of routines for ease of portability and 
     maintenance. POLMP performs housekeeping tasks on one node of the 
     parallel machine, usually node zero, referred to in the code as the 
     driver process, the remaining processes being workers. For Parmacs
     version 5 which requires a host program, a simple host program has 
     been provided which loads the node program onto a two dimensional 
     torus and then takes no further part in the run, other than to 
     receive a completion code from the driver, in case terminating the 
     host early would interfere with execution of the nodes.

Data parallel versions

     A data parallel version of the code has been run on the
     Thinking Machines CM-2, CM-200 and MasPar MP-1 machines.

     High Performance Fortran (HPF) defines extensions to the
     Fortran 90 language in order to provide support for parallel
     execution on a wide variety of machines using a data parallel
     programming model. 

     The subset-HPF version of the POLMP code has been written
     to the draft standard specified by the High Performance
     Fortran Forum in the HPF Language Specification version 0.4
     dated November 6, 1992. Fortran 90 code was developed on a
     Thinking Machines CM-200 machine and checked for
     conformance with the Fortran 90 standard using the
     NAGWare Fortran 90 compiler. HPF directives were inserted
     by translating from the CM Fortran directives, but have not
     been tested due to the lack of access to an HPF compiler. The
     only HPF features used are the PROCESSORS, TEMPLATE,
     ALIGN and DISTRIBUTE directives and the system inquiry
     intrinsic function NUMBER_OF_PROCESSORS.

-------------------------------------------------------------------------------
Total number of lines in source code: 26,699
Number of lines excluding comments  : 11,313
Size in bytes of source code        : 756,107

-------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :

steering file:   13 lines, 250 bytes, ascii (typical size)

-------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :

standard output: 700 lines, 62,000 bytes, ascii (typical size)

-------------------------------------------------------------------------------
Brief, high-level description of what application does:

POLMP solves the linear three-dimensional hydrodynamic equations 
for the wind induced flow in a closed rectangular basin of constant depth
which may include an arbitrary number of land areas. 

-------------------------------------------------------------------------------
Main algorithms used:

The discretized form of the hydrodynamic equations are solved for field 
variables, z, surface elevation, and u and v, horizontal components of
velocity. The fields are represented in the horizontal by a staggered 
finite difference grid. The profile of vertical velocity with depth
is represented by the superposition of a number of spectral components.
The functions used in the vertical are arbitrary, although the 
computational advantages of using eigenfunctions (modes) of the eddy
viscosity profile have been demonstrated (Davies, 1983). Velocities
at the closed boundaries are set to zero.

Each timestep in the forward time integration of the model, involves
successive updates to the three fields, z, u and v. New field values 
computed in each update are used in the subsequent calculations. A
five point finite difference stencil is used, requiring only nearest 
neighbours on the grid. 

A number of different data storage and data processing methods is 
included mainly for handling cases with significant amounts of land, 
e.g. index array, packed data. In particular the program may be 
switched between masked operation, more suitable for vector processors, 
in which computation is done on all points, but land and boundary points
are masked out, and strip-mining, more suitable for scalar and RISC 
processors, in which calculations are only done for sea points.

-------------------------------------------------------------------------------
Skeleton sketch of application:

The call chart of the major subroutines is represented thus:

  AAAPOL -> APOLMP -> INIT
                   -> RUNPOL -> INIT2  -> MAP
                                       -> DIVIDE
                                       -> PRMAP
                                       -> GENSTP
                                       -> SPEC   -> ROOTS  -> TRANS
                             -> SNDWRK
                             -> RCVWRK
                             -> SETUP
                             -> MODSUB -> MODEL  -> ASSIGN -> GENMSK
                                                           -> GENSTP
                                                           -> GENIND
                                                           -> GENPAC
                                                           -> METRIC
                                                 -> CLRFLD
                                                 -> TIME*  -> SNDBND
                                                           -> RCVBND
                                                 -> RESULT
                             -> SNDRES
                             -> RCVRES
                             -> MODOUT -> OZUVW  -> OUTFLD -> GETRES
                                                           -> OUTARR
                                                           -> GRYARR
                                       -> WSTATE

AAAPOL is a dummy main program calling APOLMP. APOLMP calls INIT which
reads parameters from the steering file, checks and monitors them.
RUNPOL is then called which calls another initialization routine INIT2.
Called from INIT2, MAP forms a map of the domain to be modelled, DIVIDE
divides the domain between processors, PRMAP maps sub-domains onto
processors, GENSTP counts indexes for strip-mining and SPEC, ROOTS
and TRANS set up the coefficients for the spectral expansion.

SNDWRK on the driver process sends details of the sub-domain to be
worked on to each worker. RCVWRK receives that information. SETUP
does some array allocation and MODSUB does the main allocation of array 
space to the field and ancillary arrays. MODEL is the main driver 
subroutine for the model. ASSIGN calls routines to generate masks
strip-mining indexes, packing indexes and measurement metrics.
CLRFLD initializes the main data arrays. Then one of seven time-
stepping routines, TIME*, is chosen dependent on the vectorization
and packing/indexing method used to cope with the presence of land.
SNDBND and RCVBND handle the sending and reception of boundary
data between sub-domains. After the required number of time-steps
is complete, RESULT saves results from the desired region, and 
SNDRES, on the workers and RCVRES on the driver collect the result data.
MODOUT handles the writing of model output to standard output and disk
files, as required.

For a non-trivial run, 99% of time is spent in whichever of the 
timestepping routines, TIME*, has been chosen.

-------------------------------------------------------------------------------
Brief description of I/O behavior:

The driver process, usually processor 0, reads in the input parameters 
and broadcasts them to the rest of the processors. The driver also receives 
the results from the other processors and writes them out.

-------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :

The processors are treated as a logical 2-D grid. The simulation domain
is divided into a number of sub-domains which are allocated, one sub-domain
per processor.

-------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :

The number of processors, p, and the number of sub-domains are provided 
as steering parameters, as is a switch which requests either one-dimensional
or two-dimensional partitioning. 

Partitioning is only actually carried out for the message passing versions
of the code. For two-dimensional partitioning p is factored into px and py 
where px and py are as close as possible to sqrt(p). 

For the data parallel version the number of sub-domains is set to one 
and decomposition is performed by the compiler via data distribution 
directives.

-------------------------------------------------------------------------------
Brief description of load balance behavior :

Unless land areas are specified, the load is fairly well balanced. 
If px and py evenly divide the number of grid points, then the
model is perfectly balanced except that boundary sub-domains have 
fewer communications.

No tests with land areas have yet been performed with the parallel 
code, and more sophisticated domain decomposition algorithms have
not yet been included.

-------------------------------------------------------------------------------
Give parameters that determine the problem size :

nx, ny      Size of horizontal grid
m           Number of vertical modes
nts         Number of timesteps to be performed

-------------------------------------------------------------------------------
Give memory as function of problem size :

See below for specific examples.

-------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :

Assuming stanrdard compiler optimizations, there is a requirement for
29 floating point operations (18 add/subtracts and 11 multiplies) per 
grid point, so the total computational load is

          29 * nx * ny * m * nts

-------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :

During each timestep each sub-domain of size nsubx=nx/px by nsuby=ny/py 
requires the following communications in words :

             nsubx * m     from N
             nsubx         from S
             nsubx * m     from S
             nsuby * m     from W
             nsuby         from E
             nsuby * m     from E
             m             from NE
             m             from SW

making a total of 

             (2 * m + 1)*(nsubx * nsuby) + 2*m words 

in eight messages from six directions.

-------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :

     The data sizes and computational requirements for the various
     problems supplied are :

     Name      nx x ny x m x nts        Computational    Memory
                                        Load (Gflop)     (Mword)

     dbg        10 x   10 x  1 x 2      Small debugging test case

     dbg2d      10 x   10 x  1 x 2      Small debugging test case
                                        for a 2 x 2 decomposition

     v200      512 x  512 x 16 x 200        24             14 

     wa200    1024 x 1024 x 40 x 200       226            126

     xb200    2048 x 2048 x 80 x 200      1812            984

     The memory sizes are the number of Fortran real elements
     (words) required for the strip-mined case on a single processor.
     For the masked case the memory requirement is approximately doubled 
     for the extra mask arrays. For the message passing versions, the 
     total memory requirement will also tend to increase slightly (<10%) 
     with the number of processors employed.

-------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :

Count by hand looking at inner loops and making reasonable assumptions
about common compiler optimizations.

-------------------------------------------------------------------------------
Other relevant information:



-------------------------------------------------------------------------------

-- 
                                    ,?,
                                   (o o)
|------------------------------oOO--(_)--OOo----------------------------|
|                                                                       |
| Dr Mike Ashworth                          NERC Computer Services      |
| NERC Supercomputing Consultant            Bidston Observatory         |
| Tel:         +44 51 653 8633              BIRKENHEAD                  |
| Fax:         +44 51 653 6269              L43 7RA                     |
| email:       mia@ua.nbi.ac.uk             United Kingdom              |
| alternative: M.Ashworth@ncs.nerc.ac.uk                                |
|-----------------------------------------------------------------------|

From owner-pbwg-compactapp@CS.UTK.EDU Thu Oct 28 08:52:55 1993
Received: from CS.UTK.EDU by netlib2.cs.utk.edu with SMTP (5.61+IDA+UTK-930125/2.8t-netlib)
	id AA11653; Thu, 28 Oct 93 08:52:55 -0400
Received: from localhost by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA07365; Thu, 28 Oct 93 08:52:35 -0400
X-Resent-To: pbwg-compactapp@CS.UTK.EDU ; Thu, 28 Oct 1993 08:52:34 EDT
Errors-To: owner-pbwg-compactapp@CS.UTK.EDU
Received: from rios2.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61+IDA+UTK-930922/2.8s-UTK)
	id AA07357; Thu, 28 Oct 93 08:52:32 -0400
Received: by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA16524; Thu, 28 Oct 1993 08:52:41 -0400
Date: Thu, 28 Oct 1993 08:52:41 -0400
From: walker@rios2.epm.ornl.gov (David Walker)
Message-Id: <9310281252.AA16524@rios2.epm.ornl.gov>
To: pbwg-compactapp@cs.utk.edu
Subject: PSTSWM Compact Application


Received: from msr.EPM.ORNL.GOV by rios2.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA20602; Tue, 5 Oct 1993 09:58:22 -0400
Received: from haven.EPM.ORNL.GOV by msr.epm.ornl.gov (4.1/1.34)
	id AA09050; Tue, 5 Oct 93 09:58:21 EDT
Received: by haven.EPM.ORNL.GOV (4.1/1.34)
	id AA13369; Tue, 5 Oct 93 09:58:14 EDT
Date: Tue, 5 Oct 93 09:58:14 EDT
From: worley@haven.epm.ornl.gov (Pat Worley)
Message-Id: <9310051358.AA13369@haven.EPM.ORNL.GOV>
To: walker@msr.epm.ornl.gov

                 PARKBENCH COMPACT APPLICATIONS SUBMISSION FORM

To submit a compact application to the ParkBench suite you must follow the
following procedure:

1. Complete the submission form below, and email it to David Walker
   at walker@msr.epm.ornl.gov. The data on this form will be reviewed 
   by the ParkBench Compact Applications Subcommittee, and you will
   be notified if the application is to be considered further for
   inclusion in the ParkBench suite.
   
2. If ParkBench Compact Applications Subcommittee decides to consider
   your application further you will be asked to submit the source code
   and input and output files, together with any documentation and papers
   about the application. Source code and input and output files should
   be submitted by email, or ftp, unless the files are very large, in
   which case a tar file on a 1/4 inch cassette tape. Wherever possible 
   email submission is preferred for all documents in man page, Latex 
   and/or Postscipt format. These files documents and papers together
   constitute your application package. Your application package should
   be sent to:
David Walker
                Oak Ridge National Laboratory
                Bldg. 6012/MS-6367
                P. O. Box 2008
                Oak Ridge, TN 37831-6367
                (615) 574-7401/0680 (phone/fax)
                walker@msr.epm.ornl.gov

   The street address is "Bethal Valley Road" if Fedex insists on this.
   The subcommittee will then make a final decision on whether to include 
   your application in the ParkBench suite.

3. If your application is approved for inclusion in the ParkBench suite
   you (or some authorized person from your organization) will be asked
   in complete and sign a form giving ParkBench authority to distribute,
   and modify (if necessary), your application package.

-------------------------------------------------------------------------------
Name of Program         : PSTSWM 
                        : (Parallel Spectral Transform Shallow Water Model)
-------------------------------------------------------------------------------
Submitter's Name        : Patrick H. Worley
Submitter's Organization: Oak Ridge National Laboratory
Submitter's Address     : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
Submitter's Telephone # : (615) 574-3128
Submitter's Fax #       : (615) 574-0680
Submitter's Email       : worley@msr.epm.ornl.gov
-------------------------------------------------------------------------------
Cognizant Expert(s)     : Patrick H. Worley
CE's Organization       : Oak Ridge National Laboratory
CE's Address            : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
CE's Telephone #        : (615) 574-3128
CE's Fax #              : (615) 574-0680
CE's Email              : worley@msr.epm.ornl.gov

Cognizant Expert(s)     : Ian T. Foster
CE's Organization       : Argonne National Laboratory
CE's Address            : MCS 221/D-235
                          9700 S. Cass Avenue
                          Argonne, IL 60439
CE's Telephone #        : (708) 252-4619
CE's Fax #              : (708) 252-5986
CE's Email              : itf@mcs.anl.gov
-------------------------------------------------------------------------------
Extent and timeliness with which CE is prepared to respond to questions and
bug reports from ParkBench :

Modulo other commitments, Worley is prepared to respond quickly to questions
and bug reports, but expects to be kept informed as to results of experiments
and modifications to the code.

-------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Climate Modeling
-------------------------------------------------------------------------------
Application "pedigree"  :

PSTSWM Version 1.0 is a message-passing benchmark code and parallel algorithm
testbed that solves the nonlinear shallow water equations using the spectral
transform method. The spectral transform algorithm of the code follows
closely how CCM2, the NCAR Community Climate Model, handles the dynamical
part of the primitive equations, and the parallel algorithms implemented in
the model include those currently used in the message-passing parallel
implementation of CCM2. PSTSWM was written by Patrick Worley of Oak Ridge
National Laboratory and Ian Foster of Argonne National Laboratory, and is
based partly on previous parallel algorithm research by John Drake, David
Walker, and Patrick Worley of Oak Ridge National Laboratory. Both the code
development and parallel algorithms research were funded by the DOE Computer
Hardware, Advanced Mathematics, and Model Physics (CHAMMP) program. The
features of version 1.0 were frozen on 8/1/93, and it is this version we
would offer initially as a benchmark.  

PSTSWM is a parallel implementation of a sequential code (STSWM 2.0) written
by James Hack and Ruediger Jakob at NCAR to solve the shallow water equations 
on a sphere using the spectral transform method. STSWM evolved from a
spectral shallow water model written by Hack (NCAR/CGD) to compare numerical
schemes designed to solve the divergent barotropic equations in spherical
geometry. STSWM was written partially to provide the reference solutions
to the test cases proposed by Williamson et. al. (see citation [4] below),
which were chosen to test the ability of numerical methods to simulate
important flow phenomena. These test cases are embedded in the code and 
are selectable at run-time via input parameters, specifying initial conditions,
forcing, and analytic solutions (for error analysis). The solutions are also
published in a Technical Note by Jakob et. al. [3]. In addition, this code is
meant to serve as an educational tool for numerical studies of the shallow
water equations. A detailed description of the spectral transform method, and
a derivation of the equations used in this software, can be found in the
Technical Note by Hack and Jakob [2].  

For PSTSWM, we rewrote STSWM to add vertical levels (in order to get the
correct communication and computation granularity for 3-D weather and climate
codes), to increase modularity and support code reuse, and to allow the
problem size to be selected at runtime without depending on dynamic memory
allocation. PSTSTWM is meant to be a compromise between paper benchmarks and
the usual fixed benchmarks by allowing a significant amount of
runtime-selectable algorithm tuning. Thus, the goal is to see how quickly the
numerical simulation can be run on different machines without fixing the
parallel implementation, but forcing all implementations to execute the same
numerical code (to guarantee fairness). The code has also been written in
such a way that linking in optimized library functions for common operations
instead of the "portable" code will simple.

-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :

Yes, but users are requested to acknowledge the authors (Worley and
Foster) and the program that supported the development of the code
(DOE CHAMMP program) in any resulting research or publications, and are
encouraged to send reprints of their work with this code to the authors.
Also, the authors would appreciate being notified of any modifications to 
the code. Finally, the code has been written to allow easy reuse of code in
other applications, and for educational purposes. The authors encourage this,
but also request that they be notified when pieces of the code are used.

-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

The program currently uses INTEGER, REAL, COMPLEX, and DOUBLE PRECISION
variables. The code should work correctly for any system in which COMPLEX is
represented as 2 REALs. The include file params.i has parameters that can be
used to specify the length of these. Also, some REAL and DOUBLE parameters
values may need to be modified for floating point number systems with large
mantissas, e.g., PI, TWOPI. PSTSWM is currently being used on systems where

        Integers : 4   bytes
	Floats   : 4   bytes

The use of two precisions can be eliminated, but at the cost of a significant
loss of precision. (For 4 bytes REALs, not using DOUBLE PRECISION increases
the error by approximately three orders of magnitude.) DOUBLE PRECISION
results are only used in set-up (computing Gauss weights and nodes and
Legendre polynomial values), and are not used in the body of the computation.

---------------------------