Minutes of the Message Passing Interface Standard Meeting
			  Dallas, January 6-8, 1993

The MPI Standards Committee met in Dallas on January 6-8, 1993, at the Bristol
Suites Hotel in North Dallas.

This was the third meeting of the MPI committee, but the first following the
format used by the High Performance Fortran Forum.  There were both general
meetings of the committee as a whole and meetings of several of the
subcommittees.  Because interest in the Point-to-Point communications and the
Collective communications was so general, these met as committees of the
whole.

No formal decisions were taken at this meeting, but a number of straw votes
were taken in the subcommittees.  These are included as part of the reports on
the work of the subcommittees.

These minutes were taken by Rusty Lusk (lusk@mcs.anl.gov) with some additions
by Bob Knighten.  Marc Snir's notes on the point-to-point subcommittee
meetings are included here as well.

These minutes are quite long.  If you want to see the important topics you can
search for --- and this will quickly the lead to each topic (and a few other
things.)


January 6
---------

-------------------------------------------------------------------------------
			       General Meeting
-------------------------------------------------------------------------------


The meeting was called to order by Jack Dongarra at 1:30.

Jack Dongarra presented the rules and procedures that had been circulated in
the mailing list.  In general, they say that we intend to operate in very open
fashion, following the example set by the High-Performance Fortran Committee.
He also described the subcommittee structure.  For details, see the mailing
list,

A tentative schedule for future meetings was presented, which was amended on
the last day (see there).

All meetings will be in Dallas at the Bristol Suites.  Steve Otto will
coordinate the production of the document.  He will obtain a set of LaTeX
macros from the HPF Committee and distribute them to the subcommittee heads.

It was suggested by Bob Knighten that the Executive Director arrange for
copies of all pertinent documents be provided at the meetings.  Dennis Weeks,
who is somewhat local (Convex), volunteered to help with the relevant copying.

The attendees were:


Ed Anderson	     Cray Research		eca@cray.com
James Cownie	     Meiko			jim@meiko.co.uk
Jack Dongarra	     UT/ORNL			dongarra@cs.utk.edu
Jim Feeney	     IBM-Endicott		feeneyj@gdlvm6.vnet.ibm.com
Jon Flower	     ParaSoft			jwf@parasoft.com
Daniel Frye	     IBM-Kingston		danielf@kgnvma.vnet.ibm.com
Al Geist	     ORNL			gst@ornl.gov
Ian Glendinning	     Univ. of Southampton	igl@ecs.soton.ac.uk
Adam Greenberg	     TMC			moose@think.com
Bill Gropp	     ANL			gropp@mcs.anl.gov
Robert Harrison	     PNL			rj_harrison@pnl.gov
Leslie Hart	     NOAA/FSL			hart@fsl.noaa.gov
Tom Haupt	     Syracuse U.		haupt@npac.syr.edu
Rolf Hempel	     GMD			hempel@gmd.de
Tom Henderson	     NOAA/FSL			hender@fsl.noaa.gov
C. T. Howard Ho	     IBM Almaden		ho@almaden.ibm.com
Steven Huss-Lederman SRC			lederman@super.org
John Kapenga         Western Michigan Univ.     john@cs.wmich.edu
Bob Knighten	     Intel SSD			knighten@ssd.intel.com
Bob Leary	     SDSC			leary@sdsc.edu
Rik Littlefield	     PNL			rj_littlefield@pnl.gov
Rusty Lusk	     ANL			lusk@mcs.anl.gov
Barney Maccabe       Sandia                     abmacca@cs.sandia.gov
Phil McKinley	     Michigan State		mckinlehy@cps.msu.edu
Chuck Mosher         ARCO			ccm@arco.com
Dan Nessett	     LLNL			nessett@llnl.gov
Steve Otto	     Oregon Graduate Institute   otto@cse.ogi.edu
Paul Pierce	     Intel			prp@ssd.intel.com
Peter Rigsbee	     Cray Research		par@cray.com
Ambuj Singh	     UC Santa Barbara		ambuj@cs.ucsb.edu
Marc Snir            IBM                        snir@watson.ibm.com
Robert G. Voigt	     NSF			rvoigt@nsf.gov
David Walker	     ORNL			walker@msr.epm.ornl.gov
Dennis Weeks	     Convex			weeks@convex.com
Stephen Wheat	     Sandia NL			srwheat@cs.sandia.gov


-------------------------------------------------------------------------------
			 Point-to-point subcommittee
-------------------------------------------------------------------------------

Mark Snir called the meeting to order at 1:40 p.m.  It adjourned at 4:10 p.m.
It resumed the following morning at 9:10 a.m. and adjourned at 4:15 p.m.


Marc Snir began by summarizing the decisions that we have to make:

* which operations?
   send
   receive
   channels?
   sendreceive?
   info arguments
   operations on queues
   probe?

* operation modes
   sync
   async
   local and/or global termination 
   interrupt-driven?

* message types (data types)
   structure of data in core
   buffer packing

* send-receive matching
   type  (We later decided to call this "tag".)
   sender?

* correctness criteria (See Marc Snir's paper in handouts)

* heterogeneous operations

* name space
   how processes are addressed
   flat?
   structured?  implicit/explicit

* error handling

* interaction with threads, interrupt handlers, remote signalling

* special operations for high performance
   ready receiver?

* process startup
   
* syntax/style (The plan is to postpone this for this meeting.)

We will prioritize this list and then go through them one by one.
(The priorities assigned were more or less in the order listed above.)

Two preliminary questions were then discussed:

  A.  Must we worry about multithreaded environments?

      James Cownie pointed out that threads were coming, in almost all new
      systems. Most systems have threads now.  It was proposed that a process,
      which could send and receive messages, should be an address space, so
      that individual threads would not be (MPI-) addressable.

  B.  What about signals?

      Paul Pierce suggested that we discuss signals first: do we want to
      support send/receive from interrupt handlers?


These two questions were then discussed at length.  Dealing with threads
argues against the notion of "last message", since that implies state is
maintained by the system.  There was general agreement that "state" was a `
bad thing, but arguments in favor of state are:

  Sometimes one doesn't want all of the information available after an
    operations, so it shouldn't be returned.
  Having lots of arguments to calls is bad, especially inout arguments.

Ways to avoid state are:

  Structures could be returned
  Return individual arguments
  Return tag to do queries on (but they one needs to free it.)
  Additional out arguments (OK in Fortran 90, but not in C or f77)
  User passes in storage to be used (so he knows the address), and MPI
    provides inquiry functions

[For more details, see Jim Cownie's mail message of January  4, 1993
entitled: Multifarious]

There was a general agreement that system state decreases portability and
manageability, and we should decrease it when we can.  James Cownie said that
We need a reentrant style, and Mark Snir suggested that we try to make all
function calls reentrant.  When queried no one in the group objected to trying
to make all the functions that are introduced in MPI reentrant.
    
Now we began going through the above-mentioned major topics.

Which Operations?
----- ----------

We have send and receive.  How about send-receive (also called shift)?  It can
be efficiently implemented, and buffer can be reused.  There was a discussion
of the "two-body" send-receive (exchange) and the "three-body" version
(ring-shift).  Variations include those in which the send-buffer is required
to be the same as the received- buffer and those in which is is required to be
disjoint from the receive-buffer.

Al Geist: We should focus on *required* operations.  Steve Otto replied that
send-receive *is* a required operation.  Using "exchange" can help avoid
deadlock.

It was agreed that there was no consensus on these issues and it was decided
to defer this to the collective communication subcommittee.


Operation Modes
--------- -----

The next topic that Marc Snir raised for discussion was when do send and
receive return.  Marc described several options:

For send:
  1) return as soon as possible
  2) return when send-buffer is clear
  3) return when the corresponding receive has completed

For receive:
  1) return as soon as possible
  2) return when the receive-buffer is full

"Receive has completed" means "when the user knows".  In other words, when the
sender has returned from send, the receiver has returned from receive.

There was a general discussion about whether 3) was necessary?  dangerous?
Robert Harrison said he believed that 3) was the minimal version that
was truly portable.  Steve Otto pointed out that 3) is CSP-like.  Rusty Lusk
said that 3) would be easier to prove things about than the others.  Adam
Greenberg and Paul Pierce pointed out that neither TMC nor Intel have
implemented an operation depending on the behavior of the receiver.


A straw vote was taken and the vote was 17-3 in favor of having 3) as an
option. 

Marc Snir pointed out that in his original proposal send returns a handle and
the status of the handle is then tested for completion of the send operation,
and asked if this is this desirable.

There was general agreement that something of this sort was desirable, but a
variety of alternatives were mentioned  It was pointed out that sometimes one
wants to wait on multiple outstanding operations.

Al Geist prefers separating "wait" into "sendwait" and receivewait" for code
readability.
      
Bill Gropp suggested that instead of using handles, one could supply a routine
to be called when an operation completes.

James Cownie:  "This gets really hairy in Fortran".

There was a discussion of probing multiple outstanding receives:
If the receives return handles,

   h1 = recv( ... )
   h2 = recv( ... )

wait ( h1 or h2) ?
wait ( h1 and h2 ) is not needed.

Jame Cownie proposed that we supply an operations to *wait* on a vector of
handles, which would return  one of those that have succeeded.  It would
return the handle, not the status.

A straw vote as taken on this proposal, which passed 17 - 0.

So we have:

  status (handle)
  wait   (array of handles)

The send specifies what completion of send means.  Handles need to be freed.

It was pointed out the only the existence of such an operation has been
decided, the semantics are yet unspecified - e.g. issues such as fairness or
what wait returns when several complete are not yet specified.

There was a long discussion of cancellation of send and receive.  It was
observed that there are serious implementation problems because of race
conditions, freeing resources, etc.

A straw poll was taken on including cancel in the initial MPI.  It failed 7-19.

This was the end of the Wednesday afternoon point-to-point meeting.


January 7
---------

The point-to-point subcommittee (now a Committee of the Whole) resumed at
9:15 a.m. on Thursday morning.

Marc Snir opened the meeting and summarized the progress so far:

  3 ways in which send can terminate
  sendreceive postponed
  no cancel of incomplete send operation
  status and wait (successful status accomplishes same as wait)

We did not get to:

  channels (the idea of trying to bind as soon as possible as many parameters
            as possible, so that they can be reused.)
  probe
  readyreceive

Marc noted that channels and readyrecv address similar issues.  Probably want
only one of these.  Do we want either?

Rolf Hempel observed that we don't need channels - can depend on operating
system to cache the connection information when doing synchronous
communication.  Adam Greenberg replied: NO! Want to be able to do this all at
user level without "smart" OS.

Channel creation and use might look like:

  handle = send_init( ... )
  start(handle)
  wait(handle)
  free(handle)

This is an intermediate point between bundled send/receive and full
named channels.  Indeed there are many intermediate points based on
various early bindings.

Is there enough experience to justify a standard?  Bob Knighten observed that
there has been substantial experience with channels on the iWarp system.

There was next a discussion of the ready-receiver semantics proposed by Lusk
and Gropp in the handouts.  Steve Huss-Lederman said that such operations
could make a difference of as much as 25% for matrix multiplication on the
Delta.  Some doubt was expressed about the universality of this optimization.  

Question of use of readyrecv by naive users again.  Cownie mentions
experience again.  Greenberg: facilities for efficiency should not
make it difficult to write correct programs.  Wheat: Don't penalize
users who do understand and can take advantage of efficient procedures.
General back and forth discussion.

Two straw votes were taken:

Ready-receiver operations passed 13 - 10
Channels passed 19 - 2  (Marc Snir will write up a detailed proposal)
  

The next topic discussed was the probe operation.  Do we want such an
operation, and if so, what should be its semantics?

Probing must "lock" the message that it finds, else the information returned
by the probe may be unreliable.  (Consider the multithreaded environment.)
Bill Gropp pointed out that probe is often used to find out the length of a
message so that a buffer of the appropriate size can be allocated.  Marc Snir
pointed out that this is a problem with the November document, that we need to
know the length of a message ahead of time.  Jon Flowers suggested the need
for a blocking probe.

What is needed is to probe and then to receive the message found via the
probe:

  handle = probe(params)
  . . .
  recv(handle)
  release(handle)

Marc Snir pointed out that the handle serves as a lock on the message.
James Cownie pointed out that while we agreed to not have a cancel for a send,
we do need to be able to cancel receives, since an outstanding receive is
permission for the system to write in the user's address space, which is a
permission the user may want to revoke.

A straw vote was taken on the existence of some form of probe, and it passed
25 to 0.


Send-Receive Matching
------------ --------

The next topic is the matching of send and receive.  Currently we have to
discuss matching on:

  tag
  sender
  group id
  context id

We will also need to discuss the name space issue for messages.

Here are three proposals for the predicate that determine whether a message
matches a particular receive:

 1) simple matching on fields
 2) more general, with mask, etc.
 3) user defined function

Adam Greenberg said that at TMC: A user defined function is used by the
system whenever a message is received by a node to decide if it is to actually
be received by the application.  The parameters to the user defined receive
predicate are tag and group.

Issue: If most information is encoded in the tag, then the tag protocol must
be understood by all users involved in writing a particular application.
True, but not a serious problem.  Best to identify small class of specific
matching parameters (e.g. group) and use tag for everything else.

James Cownie pointed out that the matching function, if not too complicated,
can (and is, on many systems) done by special communications processors.
There was further discussion of the difficulties of having the system call
user code for screening messages.  Paul Pierce pointed out that receipt of a
message by the hardware is a crucial point for performance.  

There was general discussion of alternative approaches to getting at least
some of this.  The question of need for this generality was also raised.  TMC
has a user who wants and uses his own predicate function.

Possibilities: (a) select on mask for fields (including a don't care);
(b) simple static logical operations on fields; (c) user defined 

(b) might be  match =  AND (( message(i) = pattern(i) ) OR mask(i))
                      fields

A straw vote was taken on whether to pursue allowing user-defines predicates.
It was decided 26-1 not to allow user-defined functions for this purpose.

(b) deferred until a proposal is available.


Marc Snir summarized that matching by tag is generally agreed on and that this
is not the only item for selection.  After some discussion, matching by sender
was also generally agreed on.  So now, how do we identify a sender?

Rusty Lusk spoke in favor of a flat name space, so that processes could be
addressed independently of group, etc.  There ensued a general discussion of
groups, contexts, and the name space.  It was pointed out that the name space
expected by send could be flat and groups could be implemented by a function
that converted any structured name into a flat integer id.

Other proposals were to to have name=(rank,gid) with the restriction that this
name be usable only within the given group (gid) and the sender must be a
member of this group.  By default the group would be ALL.  Other alternatives
mentioned were name=(rank,ALL)=pid and name=(pid,context).

This led to a general discussion of context and the relation to groups.

Marc Snir pointed out that we could have

pid
pid,context

in which context did not change the meaning of pid.  Paul Pierce said that
tags and contexts should be separated since they need to be handled in
different ways.  Marc Snir pointed out that there should be no "don't care"
on context.  

There was a discussion of servers that can process "any" message.
This also led into discussion of flat name space vs. hierarchical
name space where we would have pid(group, rank) function.  

Can use context to define groups, but there are other uses as well.
Why groups as well as context?  What is the difference between
context and groups?  

Cownie:  Context is just another integer used in the same manner as
tag.  

Not quite - it is reserved, but what is the meaning of "reserved"?

Greenberg was concerned about connecting send/receive behavior with groups.

Snir: Suppose a users wants to have two independently written subroutines that
use the usual rank notation.

Wheat: Similarly want to use rank notation when partitioning machine.

Snir: Both contexts and groups are nice, but do we need both?

Gropp: Problem with mixing two applications both of which use 0-based
indexing will need a larger common name space when they need to communicate.

There was a general discussion of the cost of contexts.  Cownie observed that
context is cheap if only used to distinguish code - obtain a unique context id
for the code by means of the "one-dollar random number generator": each author
obtains a one-dollar bill, copies the serial number, and then burns the bill.
But in general context is not cheaper than groups.

Someone asked about spawning additional processes while program is running?

Various people raised the question: If use name=(pid,context), does context
change the meaning of the pid (i.e. is pid context {or gid or ???} relative.)

There was some discussion of message registration.  Paul Pierce observed that
tag vs. context is only matter of registration.  He wants to divorce tag and
context for safety.  This implies that one cannot use wild card for selecting
on context.

Various people noted difficulties with mixing tag and context.

Adam Greenberg offered: Proposal - always separate tag and context.  Have a
context, NONE, so that pid with context NONE is unmodified, but with other
contexts the pid may be relative.  [NONE, GLOBAL, BASE]

  tag, context
   - must match on context

Several people noted that there are two very different uses of context -
identification of distinct code and identification of a group of processors.
There is state, even distributed state, associated with remapping of
processors with groups.

POSSIBLE FIELDS FOR SEND/RECEIVE:

  tag     context                    id     group
          - no wild card                    - set of processes
          - registration management         - receive only from group
                                            - managed by system

Marc Snir asked whether we could agree on what would be carried with a
message: 

  tag
  context  (like tag, except no wild card; management to be determined)


Two straw votes were taken:
  Having contexts passed unanimously.
  Having the context *not* modify the process id passed unanimously.


Groups
------

Three alternatives:

  no groups  (use send(pid(group, rank), ...) instead)
  group as explicit parameter in operations
  use contexts to implement groups

The basic difference is:  do we want to be able to select on group?

Straw vote:

yes: 10  no:11  on the capability of selecting by group.

(Thursday lunch occurred here)


Message Data Types
------- ---- -----

WHAT IS A BUFFER?

  (Language bindings are going to be important here.)

There are many options to consider:

a) contiguous bytes (non-controversial)  General agreement that 0-length
   messages should be allowed.

b) contiguous buffers of other (implementation specific?) units?

b) stride?  (parameters: base-address, element-length, stride, 
                         number-of-elements)

c) basic data types?

d) arbitrary structures?

e) How will we specify data to be communicated in a heterogeneous environment?

f) iovec structures (array of pointers and lengths, as in un*x iovec for
   reads and writes)

Marc Snir pointed out that one possibility is to have separate pack/unpack
routines and then just send contiguous buffers.  Rusty Lusk pointed out that
this requires a copy that may be unnecessary on certain machines.


Two choices - pack scattered buffer and send OR send scattered
buffer.  If the second, then may need a pack that produces the
descriptor of a scattered buffer to be used by the send scattered
buffer.

Straw poll:  Use IOVEC type send.  Passed 18-1

Basic data types were deferred.

Marc Snir observed that up to this point, a message is a set of bytes in
storage, but now we are about to consider more meanings:

  message = sequence of *values*

Should we use the same calls for homogeneous and heterogeneous calls?  Can
we have a fast homogeneous implementation of the heterogeneous calls?  Bill
Gropp pointed out that the current testbed implementation does this.

SEND vs. SENDCHAR, SENDREAL, . . .
  To be compliant with F77 need to have at least SENDCHAR for
correctness (and this is a real issue, e.g. on VAX.)  Strictly need to
have different for each basic data type (but in practice this is not
an issue.)  But for other than CHARACTER there is also an efficiency
issue.

  1.  F77 conformance
  2.  Special problem of CHARACTER
  3.  Performance
  4.  Heterogeneity (?)

Postpone to language binding discussion.

This led into the issue of the general problem of converting types
between languages and machines!  This in turn led to a discussion of
XDR (and mention of other systems such as NDR, ...)  XDR supports the
basic types (INT, REAL, COMPLEX, CHAR, etc.), array constructors,
pack/unpack routines, etc.

Do we use the same calls for homogeneous and heterogeneous systems?
Can we have a fast implementation of heterogeneous procedures for a
homogeneous system?

What about a "message envelope" that specifies the environmental
aspects of messages (e.g. heterogeneity features such as XDR.)

When we talk about heterogeneity, do we expect MPI libraries from
different vendors on different machines to cooperate?  

Include general SEND as SENDBYTES?

Agreed that do not want SEND in homogeneous to require type
information needed for heterogeneous environment.

There was a discussion of whether we have to pick an interchange format, for
example XDR.  There seemed to be some agreement that we do (as MPI
implementations from different vendors have to be able to communicate with one
another), but no vote was taken.


Error Handling
----- --------

The main issue here is whether an error detected by an MPI routine should
result in the calling of an error-handler or return of a return code.  Other
issues are how much of error handling should be standardized as opposed to
implementation-dependent, and how much user control there should be over
error-handling.

There are two types of error environments - soft (recoverable) and hard
(unrecoverable).  In a soft error environment there is the opportunity for
cleanup on the part of both the "application" and the system, while in the
hard error environment the system will cleanup and terminate the application.


Choices:
  An mpi routine always returns (though it may return with an error code.)
  An mpi routine may call an exception handler

There may be a default exception handler and there could be a user-installable
one as well.

Library writers may want to handle errors differently from how a user program
wants to handle them (or have them handled by the system).

Robert Harrison described the error modes used in TCGMSG and p4:  A process
has a user-settable state that determines whether an error should result in
a (hopefully) graceful takedown of the program or in a error return code.

Paul Pierce described the Intel method which uses two syntactically distinct
classes of functions. For one class an error results in a message being
printed and the process in which the error occurred terminating.  For the other
class an error code is set.

There was some discussion of the problem of maintaining state in a
multithreaded environment.

Two straw votes were taken:
  Do we want a version of MPI that calls an exception handler:  yes: 23 no: 0
  Do we want a version with return codes:  yes: 19  no: 1

Specific discussion of modes or "shadow" routines was deferred.


Correctness Criteria
----------- --------

This concerns defining what is a correct implementation of MPI

An assumption that had to be restated several times during the meeting is that
MPI assumes a reliable underlying communication system, i.e. MPI does NOT
address what happens if that fails.

Two specific topics are order of messages and resource bounds.

There was discussion about whether order preservation is required; that is,
for messages from one process to another, messages are received in the order
they are sent.  Maintaining message ordering is troublesome, but seems
essential for conveniently writing reliable portable programs.  But then comes
the question of what exactly this means, particularly with multithreaded
processes!  What is the effect of probe on the ordering of messages?


Straw vote in favor of requiring order preservation: yes: 23 no: 4

On the issue of correctness with regard to resource exhaustion, Marc Snir
suggested the following example:

		    Process 1      Process 2
		    ---------      ---------
		    send to 2      send to 1
		    recv           recv

What should an implementation be required to do with this program?

On the CM-5 this will always deadlock.  On Intel and Meiko machines this will
"usually" work (but how does one specify exactly when it will work.)  Exchange
is an even nastier case.


------------------------------------------------------------------------------
Summary of both Wednesday and Thursday point-to-point subgroup meetings
by Marc Snir

1. Multithreaded systems and signal handlers.
Should these be of concern to us?

No vote was taken, but the general feeling was that we should try to define
the various communication calls so that they do not rule out the case where
the communicating process is multithreaded.  The implications seems to be that
all calls should be made reentrant, and the communication subsystem is, from
the view-point of the application code, stateless.  (With one with one obvious
exception, namely that due to posted receive or send buffers, and perhaps
additional exceptions having to do with global "modes", like error handling
mode.

2. Small or large library?

No vote taken.  The general feeling is that we should provide many options for
the advanced programmer that wants to optimize code (otherwise, all
"interesting" codes will use non-portable features, but set the syntax so that
the use that uses only the "core" functions need not be burdened by the
availability of advanced options.

3. What functions?

Clearly, SEND and RECEIVE 

General sentiment that a combined send-receive would be nice ("most used
function on CM"), but discussion postponed until we have a proposed
definition: Do we we want an exchange (source=dest), or a 3-body function
(source != dest), or allow for both? do we want send_buffer identical to
receive_buffer, or disjoint from receive_buffer, or allow arbitrary overlap
between the two?  What attributes are shared by sent message and received
message, if at all?

WAIT, STATUS and PROBE functions, and persistent handles are discussed later.

4. What modes?

We want blocking and nonblocking sends and receives (blocking -- returns when
operation terminated; nonblocking -- returns as soon as possible and a
different call is needed to terminate the operation).

We want synchronous and asynchronous modes (Synchronous -- operation
terminated when terminated at all participating nodes.  Asynchronous --
operation terminated when terminated at the calling node; e.g. a send
terminates asynchronously when the sender buffer can be reused.  Please let me
know if you dislike this terminology and prefer something like "local" and
"global".)  The vote went 17-2 toward having a synchronous SEND (completes
when RECEIVE has completed, i.e. when the corresponding WAIT has returned, or
STATUS has returned successfully.)

We did not discuss whether we want all 4 combinations of blocking-nonblocking
and synchronous-asynchronous, or just 3 (blocking synchronous, blocking
asynchronous and nonblocking asynchronous).  We did not discussed explicitly,
but "kind of assumed" that any SEND mode can match any RECEIVE mode.

5.  How does one complete a nonblocking operation?

The SEND and RECEIVE nonblocking operations return a handle that can be used
to query for completion.  WAIT(handle) blocks until operation completed;
STATUS(handle) returns as soon as possible, and returns an indication for
successful completion.  In addition, these operations return information on
completed RECEIVES: tag, message length, etc.  for the received message.  The
information is returned in a structure provided by the caller.  After return
of a WAIT or successful return of a STATUS the operation handle is freed; the
system has no more information on the completed operation, and has freed all
associated resources.

A more complex WAIT is needed, that waits for the completion of one out of
several pending operations.  Proposed syntax is WAIT(array_of_handles) that
returns information on which operation succeeded and its parameters (voted 17
to 0).

No CANCEL operation -- Once a SEND or RECEIVE is posted, it must complete.
(Voted 19 to 7.  Some peoples asked to reconsider at least canceling posted
RECEIVEs, even if posted SENDs must complete).

6. Additional operations

"ready-receive" SEND.  SEND with a promise that a matching RECEIVE is already
posted (A program where such SEND occurs with no preceding matching RECEIVE is
erroneous and, hopefully, the implementation detects this error.)   The
justification is "it exists on some machine" and "it can improve performance by
25% on Delta".  Accepted by 13 against 10.

Persistent handles.  Created by SEND_INIT(params) (resp RECV_INIT(params).  can
now be repeatedly used to send/receive messages with these parameters, and then
explicitly destroyed.  Supported by 19 against 2.

PROBE.  Allows probing for messages available to receive.  Justification -
"provides a mechanism to allocate memory to a message of unknown length,
before it is received".  The proposed mechanism is PROBE(params) that returns
a lock to a matching message if there is a matching message that can be
received.  This message is now locked and can only be received using this
lock.  This was voted 25 to 0.  Some level of uncertainty whether we should
also allow to unlock without receiving (why should one want to do this?)

7.  What is the buffer argument in SENDs and RECEIVEs?

A message is a sequence of values, and as a particular case which is of most
interest for homogeneous systems, and for which the syntax ought be simpler, a
message is a sequence of bytes.  There are various ways of specifying this
sequence of bytes.

a. Contiguous message:  Starting address and length

b. Regular stride message: Starting address, number blocks, length of blocks,
stride.  Voted with no opposition.

c. IOVEC: a list of entries, each of which describes a type a or type b 
message.  Voted 18 against 1.

There was no discussion on a concrete proposal for typed messages, short of
agreement that there should be such.   The standard is not going to propose a
concrete encoding of typed messages, and a concrete mechanism for message
exchange in heterogeneous systems.

8.  Matching of SENDs and RECEIVEs.

A SEND operation associates with a message the following attributes.
a.  Sender id.
b. Tag
c. Context

The idea of associated a group id, too, was rejected 11 to 10.

The RECEIVE criterion is a Boolean predicate on these attributes of the form.
(SENDER_ID = param1) and (TAG = param2) and (CONTEXT = param3).
Don't cares are allowed for sender_id and tag, but not for context.
Sender_id is determined by system, in the obvious manner, and is absolute (not
relative to a group or a context). Tag is under sender control.
Context is under sender control, but a yet to be determined mechanism is used
to allocate valid context values to processes so as to prevent conflicts.
All this was approved with no opposition.  The idea of allowing the user to
provide its own Boolean function as a receive predicate was rejected 26 to 1
(Reason:  "hard to do if the matching is done by a communication coprocessor".)

9.  Error handling

a. We need a version of MPI where errors generate exceptions (user program
halts when an error is detected in an MPI call, or a specific exception
handling mechanism is invoked).  Voted 19 to 1.

b.  we need to provide a version of MPI where calls return error codes, and do
not cause exceptions, whenever possible.  Voted 23 to 0.

10.  Ordering of messages

Messages sent from the same source to the same destination "arrive in the order
they where sent".  Voted 23 to 0.  The exact implications in terms of order in
which RECEIVEs can occur has to be worked out.  It was pointed out that this
condition may be somewhat hard to define in a multithreaded environment.

End of Marc Snir's summary

---------------------------------------------------------------------------
		    Collective Communication Subcommittee
---------------------------------------------------------------------------

The Collective Communication Subcommittee was called to order by Al Geist at
4:30 p.m. on Wednesday.  It continued until 6:40 p.m. when there was a break
for dinner.  The meeting resumed at 8:25 p.m. and finally adjourned at 10:10
p.m.

Al Geist introduced this as the first meeting, since no real discussion
on groups and collective communication took place in Minneapolis.  One
goal of this committee is to maintain consistency with the point-to-point 
operations.  Any discussion of groups necessarily involves this subcommittee.

Collective communication operations can be constructed out of the
point-to-point primitives, but are desired because

  they can be implemented efficiently
  they are convenient for programmers.

The committee then went through the set of collective communication primitives
that had been proposed by Al Geist during the email discussions.

Broadcast:  info = MPI_BCAST(buf,bytes,tag,gid,root)

On return, contents of buf for root is in buf for all processes.  Al Geist
pointed out that the group id here is explicit.  Root has to be a member of
the group.

It was at this point that the committee decided that it would use the word
"tag" for message type from now on to distinguish it from "type", which will
now always mean type of data.

Marc Snir pointed out that for consistency with point-to-point operations,
there should be both local termination (the operation returns when the local
process has done its part) and global termination (the operation returns when
all processes have finished their participation) versions.

There followed a discussion of the fact that the point-to-point committee
seems to be adopting many different versions of send and receive, and that
total compatibility will require many different versions of broadcast.

There was a discussion of the reason for the tag parameter in the
call.  It is needed to disentangle multiple broadcasts occurring at
approximately the same time.  Paul Pierce described how the system can do this
by generating sequence numbers.  Others argued that the tag was useful for the
programmer in any case, particularly for verifying program correctness.


Marc Snir argued that there is a problem (because of the intuition that bcast
provides barrier)

  1            2                3
send(3)	     bcast           rec(don't care)
bcast        send(3)         bcast
   			     rec(don't care)

Note that 3 may receive from 2 before 1, i.e. no barrier.

Al Geist replied that we need barrier, but broadcast is NOT a barrier.

James Cownie initiated a general discussion of whether broadcasts could be
received by general receives.  This would make it simpler to inherit some of
the point-to-point semantics.  Al Geist said that broadcast should be
consistent with the other collective operations, all of which are symmetric.

Paul Pierce suggested we specify collective communication routines in terms of
model P-P implementation. This has consequences in terms of what options can
be supported.

Marc Snir pointed out that one can't actually specify collective communication
in terms of point-to-point operations because they need dynamically-allocated
additional space.

It was decided to postpone a straw vote on whether all processes participating
in a broadcast should do "broadcast" or only the root should "broadcast" and
the others should "receive" because of concern about remaining issues, e.g.
different varieties of recieves.

The discussion of "error code" was deferred until the issue is settled in the
Point-to-point communication subcommittee.


MPI_GATHER:  (see mail archives for details)

It was proposed to have a version in which each participant contributes a
different amount of information (a general "concatenate" function).

Issues raised: How handle the situation where the number of bytes on each
processor is different.  How specify the type of data?  For example one needs
to know the size of the data type for various purposes, e.g. when doing
recursive bisection.


MPI_GLOBAL_OP:  (see archives for definition)

This does not include the data types.  There was a discussion of how the
forwarding processors know where to break buffers if the data type is not
specified.  Paul Pierce suggested that we should separate the case of
user-defined combining operations from the system ones, which could be
optimized.

Robert Harrison suggested that the buffer be specified as (#items, length) at
least for the user-defined operations.  (Tag would be retained) Someone noted
that "bytes" would be different on each processor in the heterogeneous case.


Back to GATHER.  Many agreed that the interface should be changed, but no
proposal was offered.

Straw vote on having separate general concatenation, to go along with the
gather operation:  yes: 18  no: 0


MPI_SYNCH

There was general agreement that "BARRIER" would be a better name.  James
Cownie suggested that a tag argument would be helpful for debugging.

There was also some discussion of failure of such a barrier, e.g. because some
node fails.  It was agreed that this was not a problem peculiar to this
particular function.  One individual nonetheless argued strongly for some kind
of timeout for the barrier.


Groups
------

gid = MPI_MKGRP (list of processes)

There was much discussion of the format of the process list.  As defined MKGRP
defines a group as a subset of a pre-existing group.  One alternative would be
to allow creating a group consisting of processes from a number of other
groups.  (NB Identification of processes is unspecified.  This is a task for
the Point-to-point Communication Subcommittee.)

MKGRP provides an implicit barrier among the processes joining the group.

There are a number of problems about making sure that gid is uniform and known
across the system.  This is an efficiency issue.

Should it be possible to SEND to a (gid,rank) pair?  Marc argued that one
should do Point-to-point communication only within a group, not between
groups.

Note that groups are constant - cannot add or delete members from a
group.  Also group creation is a barrier for the processes that are
part of the group.  This raises the question of how the processes
joining the group know that they are joining.

What is utility of groups? Certainly at present the only commonly used
group is ALL.

MPI_FREEGRP(gid)
MPI_GRPSIZE
MPI_MYRANK

There was a general discussion of how group id's would be generated.  Also a
discussion of the mapping information: How to map back from my_rank and gid to
rank in ALL?  (In order to actually do a SEND.)

-----
At this point the group broke for dinner
-----

The continuation after dinner was an informal general discussion.  There were
some general question about experience from Al Geist to Paul Pierce.

Adam Greenberg expressed interest in discussing channels.  Channels are seen
as an early binding, (Curryification) of various of the SEND/RECV functions
which offer a number of gains in efficiency.

There was a discussion of Fortran language bindings (F77, F90, HPF) of MPI.
It was agreed by those knowledgable in the area that there are no special
issues in regard to HPF.

Steve Wheat discussed the Sandia implementation of channels on the Ncube.
Sounds very similar to iWarp channels except that they are dynamic in
creation.

Jim Cownie noted that global-ops are going to result in non-determinism
in numeric routines.

Jim also elaborated on Meiko's BAD experience with ready_receive function -
lots of user problems.  Commonly user's try it on small problems and it works
and speeds up.  But then on large problems things erratically break and the
user bitches.

Paul Pierce noted that this is essentially Intel's force type and the Intel
experience has not been so bad.  In particular it is harder to use and does
not generally work easily on small problems.

Cownie: In general what to do when a ready_receive fails?  No
reasonable way to raise error.  Response: Use a signal.  Cownie:
GAACK!  This is implementation and not viable on all systems.

John Kapenga listed six collective communication issues that he considers
particularly important.  [Missed the list]

Other desirable collective communication features that were mentioned:
global-exchange; all-to-all communication.

What are criteria for including?  Proposal: Difficulty of implementation;
frequency of use; efficiency gain

John Kapenga asked about 2-D and 3-D mesh operations - e.g. shifts?

Adam Greenberg said this should be left to compilers.  John: No Way!  Adam
argued that the compiler can recognize opportunity to avoid memory copies.
Unless that same facility is available to user the compiler can do much
better.

The group adjourned at 10:10 p.m.

---------------------------------------------------------------------------
			   Topologies Subcommittee
---------------------------------------------------------------------------

The Topologies Subcommittee was called to order by Rolf Hempel at 4:00 on
Wednesday.  It lasted until dinner.

---------------------------------------------------------------------------
			     Other Subcommittees
---------------------------------------------------------------------------

The other subcommittees (Introduction, Formal Semantics, Environmental
Enquiry, Language Binding) met informally after dinner on Wednesday.

---------------------------------------------------------------------------
			Meeting of the Whole Committee
---------------------------------------------------------------------------

Thursday, January 7, 4:30

The Agenda for the rest of the meeting was presented:

  Introduction subgroup report
  Collective-communications subgroup report
  Process Topology subgroup report
  Environmental Inquiry subgroup report
  Formal Language subgroup report
  Language Binding subgroup report
  Profiling (Jim Cownie)
  Dates for future meetings


Report of the Introduction Subcommittee:
------ -- --- ------------ ------------

Jack Dongarra presented the results of the subcommittee meeting that took
place Wednesday night.  This is essentially the draft that has been available
from netlib for the last six weeks.  There was some on-the-fly editing by the
group at large.


The goal of the Message Passing Interface simply stated is to
develop a *de facto* standard for writing message-passing programs.

As such the interface should establishing a practical, portable, efficient,
and flexible standard for message passing.


Goals
-----

  Design an application programming interface (not necessarily for compilers
  or a system implementation library).


  Allow efficient communication: Avoid memory to memory copying and allow
  overlap of computation and communication and offload to communication
  coprocessor, where available.

  Allow (but not mandate) extensions for use in heterogeneous environment.

  Allow convenient C, Fortran 77, Fortran 90, and C++ bindings for interface.

  Provide a reliable communication interface.
    User need not cope with communication failures.
    Such failures are dealt by the underlying communication subsystem.

  Define an interface that is not too different from current practice,
  such as PVM, Express, P4, etc.

  Define an interface that can be quickly implemented on many
  vendor's platforms, with no significant changes in the underlying
  communication and system software.


  The interface should not contain more functions than are really necessary.
(Based on the latest count of send/receive variants, this drew a large laugh
from the crowd.)

  Focus on a proposal that can be agreed upon in 6 months.

Added:  Semantics of the MPI should be programming language independent.

Who Should Use This Standard?
--- ------ --- ---- ---------

  This standard is intended for use by all those who want to write portable
  message-passing programs in Fortran 77 and/or C.


  This includes individual application programmers, developers of software
  designed to run on parallel machines, and creators of higher-level
  programming languages, environments, and tools.


  In order to be attractive to this wide audience, the standard must provide a
  simple, easy-to-use interface for the basic user while not semantically
  precluding the high-performance message-passing operations available on
  advanced machines.


What Platforms Are Targets For Implementation?
---- --------- --- ------- --- ---------------

  The attractiveness of the message-passing paradigm at least partially
  stems from its wide portability.  

  Programs expressed this way can run on distributed-memory multiprocessors,
  networks of workstations, and combinations of all of these.

  In addition, shared-memory implementations are possible.

  The paradigm will not be made obsolete by architectures combining the shared-
  and distributed-memory views, or by increases in network speeds.


  It thus should be both possible and useful to implement this standard on a
  great variety of machines, including those ``machines" consisting of
  collections of other machines, parallel or not, connected by a communication
  network. 

It was agreed that explicit remarks that MPI is intended to be usable with
multithreaded processes and with MIMD (not just SPMD) programs should be added
somewhere.


What Is Included In The Standard?
---- -- -------- -- --- ---------

The standard includes:

  Point-to-point communication in a variety of modes, including modes
  that allow fast communication and heterogeneous communication

  Collective operations

  Process groups

  Communication contexts

  A simple way to create processes for the SPMD model

  Bindings for both Fortran and C

In addition

  A model implementation

and

  A formal specification.

will be provided.


It was proposed that explanation and rationale for the standard would also be
provided as would sample programs and a validation suite.  This is getting
very ambitious.

Jim Cownie also wants wrappers available for use by, for example, profiling.
The suggestion is to provide "name shift", e.g. __MPI_SEND, etc. so the
profiler can have MPI_SEND call __MPI_SEND after doing whatever is useful for
profiling.


What Is Not Included In The Standard?
---- -- --- -------- -- --- ---------

The standard does not specify:


  Explicit shared-memory operations
  Operations that require more operating system support than is currently
    standard; for example, interrupt-driven receives, remote execution,
    or active messages
  Program construction tools
  Debugging facilities
  Tracing facilities


Features that are not included can always be offered as extensions by specific
implementations.


Report of the Collective Communication Subcommittee:
------ -- --- ---------- ------------- ------------

Al Geist summarized the meeting that took place Wednesday afternoon (described
above). 

Global functions beyond those discussed by the subcommittee, such as all2all
or total_exchange, await written proposals.

The (whole) committee added that Fortran 90 and HPF would be a good place to
look for more combining functions (other than max, min, sum, etc.)

It was agreed that a way to supply user-supplied functions would be useful.

Issues mentioned include: What is a group?  How are groups formed?  Are group
elements addressable, if so how?  Are groups ordered (e.g. for prefix/suffix
operations)?  Group always an ordered subset of the ALL group?

Partitioning?  Connection with virtual topologies?  This will be
discussed when topology group reports.


Friday, January 8
------  ------- -
Jack Dongarra called the meeting to order at 9:00.

Report of the Process Topologies Subcommittee:
------ -- --- ------- ---------- ------------

Rolf Hempel reported on the meeting held Wednesday afternoon:

Motivation:

  Applications have structures of processes
  Most natural way to address processes
  Processor topology is valuable to user
  Creation of subgroups is a natural way to implement topologies

Draft proposal for MPI functions in support of process topologies (by Rolf
Hempel) is in the handout bundle.  The subcommittee made some changes to the
draft. 

What functions should MPI contain?

  specification of logical process structure
  lookup functions for process id's
  clean interface to other parts of MPI (process groups)

What should it not contain?

  any reference to particular hardware architectures
  algorithms for mapping of processes to processors

If it does this, the user program will be portable, but will contain full
information for processes mapping at the logical level.

Claim:  The use of process topologies is not an obstruction to quick
implementation of MPI, since the first implementation can make random
assignments. 

A process topology is assigned to a process group.  Copying groups can be used
to overlay different topologies on the same processes.  All processes in a
group call the topology definition function.

Inquiry functions provide the translation of logical process location to
process id.

Supported Topologies:

  General graph structure:
    For each process, define the complete set of neighbors for each node.

    In principle this is sufficient as it covers all topologies.  But it is
    not scalable as all processes have knowledge of all others.  we should
    investigate a scalable version.

However, important special cases should be treated explicitly, because regular
structures can be specified in a scalable way easier to implement the mapping
they cover a large number of applications.

A special case:  Cartesian structures
  grids/tori
  hypercube is a special case
  Support for creation of subgroups for regular structures will be useful.

Special treatment for trees?  deferred

User-defined topology definition functions?  deferred
  It will be necessary for the inquiry functions to provide information on the
  hardware topology, so that a user can provide his own mapping function.

Marc Snir: We need to consider consistency of mapping alignments, for example
an octtree for image processing with a grid structure.

Al Geist: What is connection between group and topology.  Recall that a group
is a linear *ordered* array which is a kind of topology.

General discussion of copying topologies and groups Proposal is to have at
most one topology per group so can use group id as name for topology.  This is
reason that there must be a group copy.

David Walker: We need closer coordination between the collective communication
subcommittee and the topology subcommittee, since groups are central to both.


Report of the Environmental Enquiry Subcommittee:
------ -- --- ------------- ------- ------------

Bill Gropp reported that the Environmental Enquiry subcommittee needs to wait
and get a better picture of what MPI will contain.  

Jon Flower again asked for cpu_time.  This was discussed, and we were reminded
that these were more-or-less rejected at the Minneapolis meeting as not being
part of MPI.  Standardization should come from POSIX.

Marc Snir:  Part of the subcommittee's job should be to decide *what* can be
enquired about as well as how it will be done.

There was general discussion about inquiring about both MPI parameters and
implementation parameters.  Also if parameter *setting* as well as enquiry
should be supported.  (Buffer pool sizes, for example).

Jon Flower also asked about system hints.  He suggested it should be possible
to tell the system about implementation specific tuning in a system
independent way.


Report of the Formal Specification Subcommittee:
------ -- --- ------ ------------- ------------

Rusty Lusk reported the committee was without its chairman, Steven Zenith,
but that it viewed its mission as to try to formalize what the other
subcommittees decide on.  It will probably use CSP, for lack of experience
with any other formal specification language.  

Bob Knighten suggested that the subcommittee look into LIS (Language
Independent Specification) that POSIX defined in order to separate semantics
from language bindings.


Report on MPI -1 (minus one)
------ -- ------ -----------

James Cownie presented an MPI anti-specification.  Ya hadda be there, but in
case you weren't or just want to be reminded, here is a transcription of Jim's
slides.  

                          MPI -1  (Jim Cownie)

In the spirit of LPF (Low Performance Fortran)

   *  Bindings ONLY for    Mathematica
   			   Occam
                           ML

   *  No function take arguments or returns result

   *  Point to Pointless communication

   *  1024 different sends
      NO receives

   *  Full support for 0 dimensional topologies

   *  User data in a message limited to 1 byte (of 6 data types)
      BUT 1 KByte of TAG, CONTEXT

   *  Informal semantics - Formal Syntax

   *  All groups are contexts

   *  All contexts are groups

   *  Non blocking wait

   *  Non blocking barrier

   *  All user programs are unsafe & erroneous, they therefore do all
      their work in the exception handler.


---------------------------------------------------------------------------

A Profile/Instrumentation subgroup was formed with Jim Cownie as chairman.

Steve Otto, as general editor, will contact subgroup chairmen to begin
discussion of editing concerns.

Discussion of meeting format.  The following was proposed as a format for
subsequent meetings, based on the experience with this meeting.

Wed. afternoon:   point-to-point
Wed. night:       all subcommittees other than pt-to-pt and collective comm.
Thurs. morning:   collective communication
Thurs. afternoon: subcommittee reports
Fri. afternoon:   subcommittee reports

Meeting Dates:

It was decided to moved the next two meetings up a week from when they were
tentatively scheduled.

The next meeting will be Feb 17-19.
The next one after that will be Mar 31-Apr 2

The currently-scheduled May 19-21 and June 30-July 2 meetings may also be
moved up as well.  Note that July 2 will be a holiday in the United States.