The MPI Forum met March 30 - April 2, 1993, at the Bristol Suites Hotel in
North Dallas.

This was the fifth meeting of the MPIF and the third of the now regular
meetings in Dallas.  There were both general meetings of the committee as a
whole and meetings of several of the subcommittees.

For the first time a number of formal votes were taken at this meeting.
All of these are recorded in these minutes (and can be found by searching
for VOTE) and have also been published (to the mpi-core mailing list) in a
summary of all the formal votes and all of the straw votes for the
committee as a whole.

The notes for these minutes were taken by Bob Knighten
(knighten@ssd.intel.com) and Rusty Lusk (lusk@mcs.anl.gov).

These minutes are quite long.  If you want to see the important topics you
can search for --- and this will quickly the lead to each topic (and a few
other things.)


Attendees:
---------

Joe Baron               IBM Austin               jbaron@vnet.ibm.com
Eric Barszcz		NASA Ames                barszcz@nas.nasa.gov
Harry Scott Berryman    Yale Univ.               berryman@cs.yale.edu
Rob Bjornson		SCA                      bjornson@sca.com
Lyndon Clarke	        EPCC, U. Edinburgh       lyndon@epcc.ed.ac.uk
James Cownie	        Meiko			 jim@meiko.co.uk
Jack Dongarra	        UT/ORNL			 dongarra@cs.utk.edu
Anne C. Elster          Cornell U.               elster@cs.cornell.edu
Sam Fineberg		NASA Ames                fineberg@nas.nasa.gov
Jon Flower	        ParaSoft		 jwf@parasoft.com
Ian Glendinning         U. of Southampton        igl@ecs.soton.ac.uk
Adam Greenberg          TMC			 moose@think.com
Bill Gropp              ANL                      gropp@mcs.anl.gov
Leslie Hart	        NOAA/FSL		 hart@fsl.noaa.gov
Tom Haupt	        Syracuse U.		 haupt@npac.syr.edu
Rolf Hempel	        GMD			 hempel@gmd.de
Tom Henderson	        NOAA/FSL		 hender@fsl.noaa.gov
C. T. Howard Ho         IBM Almaden              ho@almaden.ibm.com
Steven Huss-Lederman    SRC			 lederman@super.org
Rusty Lusk              ANL                      lusk@mcs.anl.gov
John Kapenga            Western Michigan U.      john@cs.wmich.edu
Bob Knighten	        Intel SSD		 knighten@ssd.intel.com
Rik Littlefield         PNL			 rj_littlefield@pnl.gov
Peter Madams            nCube			 pmadams@ncube.com
Arthur B. Maccabe       U. of New Mexico         maccabe@cs.unm.edu
Oliver McBryan          U. Colorado              mcbryan@cs.colorado.edu
Dan Nessett	        LLNL			 nessett@llnl.gov
Steve Otto	        Oregon Graduate Instiute otto@cse.ogi.edu
Peter Pacheco           U. of San Francisco      peter@sun.math.usfca.edu
Paul Pierce	        Intel			 prp@ssd.intel.com
Sanjay Ranka            Syracuse U.              ranka@top.cis.syr.edu
Arch Robison            Shell Development        robison@shell.com
Mark Sears              Sandia			 mpsears@cs.sandia.gov
Anthony Skjellum        Mississippi State U.	 tony@cs.msstate.edu
Marc Snir               IBM, T.J. Watson         snir@watson.ibm.com
Alan Sussman            U. of Maryland           als@cs.umd.edu
David Walker            ORNL                     walker@msr.epm.ornl.gov
Dennis Weeks	        Convex			 weeks@convex.com
Stephen Wheat           Sandia NL                srwheat@cs.sandia.gov


Wednesday, March 30
---------  --------

-------------------------------------------------------------------------------
			       General Meeting
-------------------------------------------------------------------------------


Jack Dongarra called the meeting to order at 1:30.  The first topic for
discussion was the agenda.  David Walker had mailed out the following

        Provisional Agenda for MPI Meeting, March 31-April 2, 1993

Wednesday
1:30-6:00   Discussion of Snir, Gropp, Lusk point-to-point proposal (everyone)
            (Snir)
6:00-7:30   Unofficial dinner break 
7:30-10:30  Break up for subcommittee meetings
        
Thursday
9:00-12:00  Discussion of Snir & Geist collection communication 
	    proposal (everyone)
	    (Otto?)
12:00-1:30  Lunch (provided)
1:30-3:00   Full group meeting for presentation of alternate approaches to
	    groups and contexts, dynamic vs. static process models, and other
	    issues
	    (Volunteer?)
3:00-6:00   Full group meeting for presentation of process topology
	    subcommittee ideas and proposals.
	    (Hempel)
6:00-8:00   Dinner (attendees pay, but hotel provides transport to area
                   restaurant)
8:00-10:00  Continued informal subcommittee meetings if necessary
        
Friday 
9:00-11:00  Full group meeting with the intent of taking binding votes
	    on point-to-point and collective communication proposals, or
	    sending proposals back to subcommittees for revision.
	    (Snir?)
11:00-12:00 Full group meeting for defining timetable for producing MPI (or
	    subset) by deadline in July.
	    (Dongarra)

Following the discussion on the mpi-core mailing list the question was moving
the discussion of the timetable for producding MPI to the beginning of the
meeting.  After a brief discussion it was decided to proceed first with
reports from the Communication Context and Point-to-Point Communication subcommittees in
order to have a basis for discussing the schedule.  The schedule was discussed
on Thursday afternoon, following the completion of the Point-to-Point
subcommittee report.

The Context subcommittee was alloted two hours, from 1:30 to 3:00, with the
Point-to-Point Communications subcommittee scheduled from 3:00 to 6:00 and on
Thursday morning.


-------------------------------------------------------------------------------
              Report From the Communication Context Subcommittee
-------------------------------------------------------------------------------


Tony Skjellum presided.  There was a large volume of activity on the
mpi-context mailing list before this meeting and so there were five
proposals available for consideration, labeled:

I     (Marc Snir)
III   (Tony Skjellum)
VII   (Lyndon Clarke/Rik Littlefield)
VIII  (Mark Sears)
X     (Tony Skjellum/Lyndon Clarke)

25 minutes was alloted for presentation of each of these in the order I
(Marc Snir), VII+X (Lyndon Clarke), VIII (Mark Sears) , III (Tony
Skjellum), X followed by general discussion.  Tonight there will be a
subcommittee meeting to produce a single proposal.


Proposal I (Marc Snir)
---------- -----------

{Marc used overhead projector slides and these notes are largely a
transcription of those slides.}


Group=Context Proposal

Goals:
  + Keep it simple (and keep MPI small)
  + Keep it efficient

Minimal needs:
  + Protection mechanism
  + Local name space

Group = Context = Ordered set of processes.
--
Method for protecting communication between e.g. libraries.
--
 + All point-to-point communication is WITHIN a group and uses 
   (group,rank) address.

 + All collective communications by a group (which is a context)


OPERATIONS:
  + Group copy
  + Group partition
  + Group creation by list
  + Group deletion
  + ALL group prexists

Group handle has only local (i.e. within group) use and meaning.
--
There is no reason to pass the handle of a group outside the group - it
has no use.
--
1.  Impact on "current practice":

  Need additional argument ALL in all p-p calls.

2.  Overhead for p-p
  send - no impact when ALL use. One lookup for other groups.
  receive - context id match

    Overhead at creation
  Loosely synch collective communication within group affected
    
    Storage
  Member table (good protection)
--
3.  Compatibility with dynamic process creation and deletion

  Process creation/deletion requires same for group
{What is ALL group after process creation or deletion?}

4.  Interaction with topology
  Group has no topology information (but it can be used as a peg for such
information.)

5. Inter group communication (e.g. client-server models)

       +-------------------------------------+

          +---+ ---> (---)
          | * |    
          | * | 
          | * | 
          +---+
       +--------------------------------------+

Do the communication within an encompassing group.
-
 + Encompassing group needed for protection
 + May not be convenient for naming (e.g. send(server[5], ...))
 + Inconvenience does not warrant change in p-p layer
 + Can be handled by creating and explicityl passing arrays of ranks:

MPI_LIST_RANKS(list, subgroup, group)
  returns the list of the rank of each subgroup member within group
--
Discussion:

How have available both subgroup and group?  There must be an encompassing
group which has full knowledge, e.g. server that is member of both the
group and the subgroup.

This proposal is orthogonal to question of attaching additional
information (e.g. topology information, caching, etc.)

Can't deal with situation of contacting an independent pre-existing
server.  Marc's approach is that dynamically adding processes requires
dynamically creating a group containing all of the processes.

Opacity vs. accesibility to mechanism.

 =============================================================================
Lyndon   VII and X

VII:

Context is a higher level mechanism than a group.  It is basically a
unique identifier together with a reference to a group.  This means that
as a group changes, all contents that reference that group change as
well.  

Same ability to hang on facilities (e.g. topology, caching) as others.

Relation to p-p:  Three forms: "closed form", "null form", "open form"
Open form is to allow communication between different groups.  Experience
is that creating encompassing groups is difficult. Disagree with Snir's claim.

Addressing is via (context, rank).

Need a "context allocation" mechanism - this implies global communication.

Relation to c-c:  Works very cleanly for all-all using closed form.
Two group communcations - for MPI-2

Discussion:
How to establish communication between groups?  Can send context.

What is opaque? Lyndon - not important.  Startup/bootstrap - everyone
starts in the ALL group.  Can use common ancestor or name registry.  Power
relative to I? Lyndon claims that this is more convenient once
communication is setup.

Basic idea is to be able to communicate via (my_context, remote_context, rank)


ORTHOGONAL ISSUES - caching, tag selection, transfer of ????

X: Attempt to synthesize III & VII

A CONTEX is a space of tags.  A GROUP is a set of process references.
[What does this mean?]

Idea is to give method for combining groups and context for purpose of
communication.  

COMMUNICATORS (see pp. 3-4)
  Silly names, but a serious proposal.

  Floopy - arbitrary communication between processes allowing wild card
on tag.

  Bongo - basically like Marc's proposal - communication within a group
using rank naming  ("closed")

  Bingo - communication between groups.

Question:  Why do we want ANY of these proposals?
Performance, build large scale software safely

We need examples for all of these proposals!
Collective -> group; express collective in terms of p-p implies need to
discriminate message -> context.

But there are reasons to have groups and contexts that have nothing to do
with collective communication.  Argument that in Marc's proposal the need
for a context means have to have an additional group - but is this a
problem? 

Lyndon argues that there are good reasons to separate group and context.
Static vs. dynamic groups.  Ability to move context.


Proposal VIII - Mark Sears
{{{
MPI-1

char * MPI_ALLOCATE Group and Context Proposal
   				Number();
}}}
--
Context and Groups are orthogonal
  + Orthogonal purpose
  + Orthogonal functionality
  + Orthongonal implementation
--
                               Contexts

Purpose - promote software modularity by allowing construction of
independent tag spaces

Definition: A CONTEXT is an integer-valued extension to the tag
component of the meggage envelope, and must match exactly between
sender and reciver
--
Model:
 + Contexts are global.
 + No concept of process belonging to a context.
 + Contexts are scarce resources (16).
 + Context allocation is a rare event.
 + MPI p-p requries no reference to groups.
--
Context allocation/deallocation
ALLOCATION:

 int mpi_getcontext()

   + called synchronously by all (EVERY SINGLE ONE) processes
   + signals to mpi that use of context is now allowed

DEALLOCATON

  void MPI_free_context(value)


MPI_DEFAULT CONTEXT
  + Preallocated; can't be freed.
  + Solves initialization
  + Free-for-all

But believes that allocation/deallocation not truly needed - could have
entirely static system.

  + Contexts are global
  + No concept of process belonging to a context
  + Context are scarce resources (e.g. 16)
  + Context allocation is a rare event
  + MPI p-p requires no reference to groups
--
                                GROUPS

Purpose - provide tools for organizing subsets of processes in a
parallel task (i.e. MPI program.)

Definition:  A groups is a 1-1 mapping from (0..n-1) to another set of
integer.  A group is a collection of processes only in so far as the
elements are process addresses.

Groups have no associated have no associated context or tags, default
or otherwise.
--
Group Implementation
 + local to each process based on information needed to construct the mapping
 + Group type is local and opaque
 + Groups can be sent in message only by sending the information needed
   to construct the group.
 + Groups are objects in the OOP sense
--
Usage:

MPI_SEND(n,buf,process,tag,context)

MPI_BROADCAST(n,buf,group,tag,context)

Group identifier is a local opaque type, thought of as a pointer to
one of many possible group structures.

  MPI_SEND(..., element(group,rank),tag,context)
---
                            GROUP FUNCTIONS

int order (group)
int range(group)
int element(group, int rank)
int iselement(group, int element)
int rank(group, int element)

                                CLASSES
identity, permutation, linear, list, bilinear, composition, cartesian

                             CONSTRUCTORS
group makelineargroup(order,start,delta)
--
Two kinds of 3rd party code
1. Code that inherits context and tag space from caller.  Example:
MPI collective communication

2. Code which allocates and manges own context and tag space.

MPI should allow both of these.
--
                               Topology

Global topology - mapping of processes to processors.
provide inquiry function returning a string describing this mapping:

  char * MPI_global_topology()

Examples of output:
  "N 564"	-  random network of 564 processes
  "H 5"         - 5 dimensional hypercube
  "R 2 16 13"   - 2D mesh, 16x13

Local topology - implicity within group; no additional functions needed.
--
ADVANTAGES
 + Ease of implementation
 + Close to hardware
 + Good use of resources
 + Flexibility in implementation of higher level concepts
 + MPI p-p requires no reference to groups
 + MPI c-c can be layered on top of MPI p-p
--
Discussion - Serious problem with global communication.  This destroys
the software modularity.

How to do global operations using groups?  Responsibility on code to
insure there is disambiguation, synchronization, etc.


Tony - III

 + Tag/context
    partiontion message space for "safe" software

 + Groups encapsulate scope of operations for
    1) notation
    2) optimization
    3) performance

                              ...........
     ...............     /     . GROUPS .
     .            .    /        .........
     . CONTEXTS    . /  lower,higher?
      .............

Relation between groups and contexts?  Groups can be orthogonal except
for group creatin.
--
Forms of communiations

 + collective communication
   ON groups
   (compatible with I)

 + p-p
   A] (group,rank,tag) - analagous to I
   B] (context,pid,tag)

Models
1) Contexts created/destroyed
2) contexts can be published
    dynamic server implied or shared address area

Contexts & groups interelate
when creating new groups not necessarily from LCA
                                              ^^^{{{WHAT DOES THIS MEAN?}}}
--

contexts & groups interrelate when creating new groups, not necessarily
from LCA (what does this mean?)

Dispute/reply regarding optimization.  Argument that group can be used to
provide to information about special situations (e.g. shared memory) that
can be used for optimization.

Dynamic groups more feasible using III or VII.

Can contexts be sent in III?  Yes.

VII more complex than III because it offers more layers.

Tony believes that dynamic groups are essential for the heterogeneous
case and so believes that I is inappropriate.

John K. notes that other proposals can be built on top of VIII.

Proposal to defer straw vote until tomorrow to give people time to ponder.

Is global synchronization an essential part of VIII?  Sears - no, there
are various possibilities.  
Why is it important that contexts are global?  Because a context is not
associated with a group of processes.  This looses much of the safety.
It also looses the local addressing within a group.  Sears argues that
this complicates p-p, but Snir says that something like his proposal has
been implemented and is not complicated or expensive.

What level of protection is needed/desirable?  
Picture of using context for safety - at startup send a context to each
library used which it then uses for internal safety.  Problem - libraries
using libraries using libraries, etc.  Importance of receiving on wild
cards and relation.

Host/Node model?  Must support this and all do.  (What about loading
program - not part of THIS discussion.)

Subcommittee will meet tonight and present a more unified proposal
tomorrow morning.

Rik asked for example showing how to implement safe barrier using p-p,
group, context.

Adam - How can we evaluate these proposals if we have not agreed on what
we want from the context concept?

-----
break 4-4:20 pm
-----

-------------------------------------------------------------------------------
           Report From the Point-to-Point Communication Subcommittee
-------------------------------------------------------------------------------

First reading of p-p proposal - Marc Snir presiding

Presentation with some minimal assumptions about use of context.
No language binding included.

Use of handles and opaque objects.
[Ignore text comments on implementation - issues for language experts.]

Discussion of ephemeral/persistant.  Greenberg asked about "error to
free a handle with pending operation" - change to "handle becomes
ephemeral" or some such.

How deal with lists - included length, separate length, EOL marker.

States  [Implementation again - not part of current proposal]

If have only one send/receive, what is type of buffer and what to do
about this?  Marc's proposal is to accept breaking F77 in this case.

Skipping 2.4 (Contexts) for now (including error handling.)

2.5.1: What about C/Fortran compatibility for messages?  Also skipped
for now.

2.6: Greenberg wants at least escape hatch to allow functions
(MPI_ADD_?) that add, e.g., other F90 objects.

Discussion of len in MPI_ADD_BLOCK as "Number of elements" rather than
as bytes.  There are language and portability issues.

Marc mentioned issues in middle of p. 14 (negative displacements, ...)
and notes that these must be settled.

Note that delete/commit functions discussed last time are not in this
proposal.

2.6.1 (Data Conversion)  It just happens - this needs to be stated clearly.

-- back to 2.2 - vote on each section

[[NOTE:  A proposal, in the form of a proposed chapter is offered.
Votes are on amendments and on accepting particular parts of chapter.]

How are handles allocated?  User or system.  There are efficiency
advantages to user allocation, e.g. on stack.

PRP argues for a create for each data type rather than a generic
MPI_CREATE.  At the moment there is an admixture in the proposal.

======
Discussion of voting rules.  Organizations voting: 24
=====

VOTE: Separate mpi_create for each type
----  Yes: 20 No: 0 Abstain: 4

Possible meanings of free:

                                 free
1.1 can always be done				1.2 can be done only if
user will not use handle                        no pending operation
after free

2 if no pending operation, deallocate otherwise free when done

1.2 = current proposal; 2 is Greenberg proposal

Greenberg explantion - common thing done in their system is to have
handlers with free only taking effect after handler completes.
PRP - Has primitive that does essentially 2 (which is not "mark ephemeral")
It does not imply cancel.  Fire and forget.
Cownie: Two arguments - Paul's fire and forget; Adam's handler for
messages.  
Snir: Is buffer available after free?  PRP/Adam: No.
Why not just have another kind of free?  Multiplicity of functions.

VOTE: mpi_free is valid even if handle is in use, but effect is to
----  free the object when the operation completes. 
      Yes: 7 No: 8 Abstain: 

List of handles - change name to array of handles

VOTE: list_length explicit (rather than included in array or EOL)?
----  Yes: 22 No: 0

Cownie wants a method to provide cheap allocation, e.g. on user's
stack rather than in system heap.  Need a concrete proposal.

RLK wants explict statement of what is erroneous, checked, etc.

VOTE: Accept 2.2
----  Yes: 23 No: 1

2.3 & 2.4 skipped -- they will be considered elsewhere.

2.5  Cownie - proposes that there be both CHAR and BYTE data types
rather than just the basic data types of the host language.

Another proposal is to have an MPI_STRING data type.  But what length?
Null terminated?

VOTE: BYTE data type.
----  Yes: 24 No: 0

Pending action of Context Subcommitte is the content of the envelope.

VOTE: Accept 2.5 (minus 2.5.2)
----  Yes: unanimous

Proposals for units of bytes and for consistency.

1. Units are elements
2. bytes in C and elements in F77  (No one favors this.  Against: 14)
3. bytes everywhere (and provide some way of getting size of basic types)
    a) index stays as is, but bytes elsewhere     Pro: 9  Con: 13 (VOTE 1)
    b) truly bytes everywhere, including indices  Pro: 10 Con: 8  (VOTE 2)

Greenberg will bring forward additional proposal for 2.6 at next meeting.

VOTE:  In vector, stride may be negative?
----   Yes: 10 No: 2

VOTE:  Allow repetition?
----   Yes: 13 No: 1

VOTE:  Can multiple components overlap?
----   vote not taken

PRP proposes that in vector len is a count of the number of blocks.

VOTE: total length is an integer multiple of the block size
----  Yes: 4 No: 10

Tony proposes adding a COMMIT operation.
Note: Current proposal does not clearly state it, but a handle cannot
be modified after it is in use.  There is nothing that explicitly
specifies that the modifications are complete.

VOTE: Commit?
----  Yes: 11 No: 6

VOTE: Accept 2.6
----  Yes: 12 No: 4

-----
Break for dinner - 6:15 p.m.
-----

Subcommittees tonight:  Here: Vendor caucus
                        Rm 1: Process topology
                        Rm 2: Collective Communications Subcommittee
                        Rm 3: Context Subcommittee
   			BAR:  formal, language binding, profile

meet at 8 p.m.

 -----------------------------------------------------------------------------

Revised agenda

Thursday 
  9 p-p (cont)
  12 lunch
   1:30 (registration)
  1:30 collective communication
  3    process topology
  6-8 dinner

Future:  May 12-14; June 23-25; August ?

 =============================================================================

                      Thursday, April 1, 1993
                              9:10 a.m. - 

-------------------------------------------------------------------------------
     Report From the Point-to-Point Communication Subcommittee (continued)
-------------------------------------------------------------------------------

Point-to-Point Communication - First Reading (continued):
-------------- -------------   ----- -------

Marc Snir presiding

2.7 (Receive Criteria) & 2.8 (Communication Mode)
-------------------------------------------------

Suggestion that we need more than one DONTCARE, a SOURCE_DONTCARE and
a TAG_DONTCARE.

Lyndon Clarke proposed a "secure" communication mode where the send
that will return once the system can guarantee that the receive will
actually complete.  

(During the discussion of this there was a suggestion that the word REGULAR
on p. 16 of Marc Snir's draft be changed to STANDARD so first can always be
used.  Marc remarked that he welcomes all manner of stylistic improvements,
but asked that they be sent to him, not voted on here.)

Why such a secure communication mode?  To provide a portable manner to
write programs that are guaraneed to be safe, even without buffering.
This is similar to but weaker than the synchronous functions that were
rejected in a straw vote last time.  The unease with a proliferation of
functions was again mentioned.

Adam Greenberg suggested that one could be against this and still specify
equivalent function by requiring no buffering throughout the system.

Rik Littlefield (as a pseudo Tony Skjellum) proposed receive criteria based
on an intag and mask.  Variations on this have been discussed in the past.

VOTE: Typed DONTCARE?
----  Yes: 19 No: 1

After the vote, a count showed 25 organizations present.

Proposal: Receive selection based on (tag & mask) = (intag & mask)

Why do this if have context?

Discussion of efficiency.  PRP proposes sending tag (exact match by
system) and extra-info (for recognition of message category by
application) in envelope.  Cownie unhappy because of effect on latency
of small messages.

Alternatives being considered:

(1) RECEIVE(..., tag, info, ...)  no DONTCARE for tag
(2) RECEIVE(..., tag, mask, ...)  no DONTCARE for tag
(3) RECEIVE(..., tag, ...)        DONTCARE for tag

Does this imply small tags?  PRP: reasonable for implementation to
limit size of tag, category, and context.
Greenberg argues againt PRP propsal because of existing practice and
because it could force user to duplicate some system function.

Eric argues in favor of wildcarding because it often allows reuse of
buffer and so fewer resources.

VOTE: (1)  fails for lack of second.
----  (2)  Yes: 6  No: 12
so (3) remains.


Proposal: Secure communication mode.
Rusty - P4 experience argues against this.  Cownie - this is useful
for reasoning about program correctness.  
An alternative is to have a "secure mode".  But what happens if using
secure mode, but some library may have been written without secure mode?

VOTE: Do we want a "secure" communication mode?
----  Yes: 7 No: 6


VOTE: Delete send/ready_receive?
----  Yes: 10 No: 10
[Amendment tabled]

VOTE: Accept amended versions of 2.7 & 2.8
----  Yes: 25 No: 1


Discussion of 2.9 (Communication Objects):
-----------------------------------------

Marc Snir began with an overview of the four subsections.

He asked if there should be handles for user space objects?  This is not in
the current proposal.  He noted that the use of a STATUS_ALL compared to
using STATUS in a loop is one of convenience.

Jim Cownie argued: WAIT_ANY should be told where to start scan because of
fairness concerns. 

[Discussion - this is a strong implementation requirement.  MPI requires
"fairness", though the requirement is so weak as to be untestable.
WAIT_ANY is small part of the fairness problem.  Easier for user to pass in
this information than for the system to maintain it.  The user can always
guarantee fairness.  But at some cost.  Postpone until we discuss
correctness and fairness in general.]

The current draft has MPI_RETURN_STAT providing the free space in the
buffer.  The current practice appears to return the number of bytes
received.  

Issues: Byte count of data received; Where do handles come from; Partially
specified handles; Make explicit that can a message can be sent to oneself.

VOTE: The return status for a receive operation should be the number
----  of bytes received
      Yes: 17 No: 3 Abstain: 7


The return status handle should be allocated by the user:
--------------------------------------------------------
Typically the user knows exactly the what needs to be done and the
lifetime is often short.  Thread safety says can't use global storage,
so ...

Suggestion that there be an overall proposal to deal with handles in
user space.  But this is not a general handle - this is a special
situation.  Postpone for general consideration.


Partial handles
---------------

return_status_handle
 + part of communication handle
 + separate user space object
 + separate system object

VOTE: Accept 2.9 (with amendments) excluding 2.9.1 and WAIT_ANY
----  Yes: 19 No: 2 Abstain: 3

-----
break 10:48 - 11:08
-----

Discussion of sections 2.10-2.12
--------------------------------

Marc presented the following table for discussion:

              |general|contig|vector|contig|
              |buffer |byte  |byte  |type  |
--------------+-------+------+------+------+--------------------------------
blocking      |       |      |      |      |
send          |  *    |      |      |      |
receive       |       |      |      |      |
--------------+-------+------+------+------+
blocking      |       |      |      |      |
ready-receive |  *    |      |      |      |
send          |       |      |      |      |
--------------+-------+------+------+------+
immediate     |       |      |      |      |
send          |  *    |      |      |      |
receive       |       |      |      |      |
--------------+-------+------+------+------+
immediate     |       |      |      |      |
ready-receive |  *    |      |      |      |
send          |       |      |      |      |
--------------+-------+------+------+------+
secure        |       |      |      |      |
send          |       |      |      |      |
receive       |       |      |      |      |
--------------+-------+------+------+------+
Same buffer types are used in Collective Communication.

Discuss probe

Issues: Counting units?  Vector type?  Contig type includes contig
byte because have byte type?

Lyndon - There should be a secure-receive (for optimization of the
protocol.)  Marc - Is system or user responsible for insuring this works?
Lyndon - User.  Various - What is value of secure-receive.  Gropp -
Because these are different protocols loose performance if don't have
different functions as general must always deal with worst case.

Marc:
                   error?
ssend(2)	----------------------------      recv(1)
   					

Proposal is that it is erroneous to attempt to receive a message sent
by a secure-send by other than a secure-receive.  Rusty argues that
this should not be erroneous because ???

Possibilities:
  1) secure-send {can, cannot} be received by receive
  2) Enforced by MPI or user responsibility

Adam argues that this is confusing the secure-send features with
pre-acknowledgement and these are independent issues.

The purpose of secure-receive is entirely performance - it does not
have any semantic content.

secure-send/secure-receive can be implemented on top of regular p-p.

Note that the proposal is to add 2 secure-sends, 2 secure-receive

VOTE: Add both blocking and immediate secure-send/secure-receive
----  with failure of program to match being erroneous.
      Yes: 10 No: 8 Abstain: 9

VOTE: 2.10 as amended
----  Yes: 20 No: 4 Abstain: 3

VOTE: 2.11
----  Yes: 26 No: 0 Abstain: 1


2.12
----

Are blocks just blocks of bytes or are they typed?
How do count the size of blocks?

Lusk - Want to have blocks of typed data for use in a heterogeneous
environment.

PRP - Offers precise propsal that use the same parameters as in
MPI_ADD_BLOCK.

VOTE: Use exactly the same parameters in MPI_ADD_BLOCK as amended
----  Yes: 26 No: 0 Abstain: 1

Adam - Proposal to have functions to have strided messages using blocks.
Rusty - Argues against becasue of problem of proliferation at this
level.  [Arguement was clearly on matter of taste. This is syntactic
sugar as the most general low level routines certainly can be used
for this.]

VOTE: Have strided block message functions.
----  Yes: 5 No: 9 Abstain: 12

-----
lunch 12:00 - 1:35
-----

Continuation of discussion of Chapter 2.  Proposal that more time
discussing 2.10-2.14 needed, so p-p subcommittee will meet after
dinner tonight.

Discussion of Schedule
----------------------

Future meetings
  4. May 12-14       set
  5. June 23-25      set
  6. August 11-13    tent.
  7. September 22-24 tent.

Draft to be available: November 15-19  SC '93 Portland

Reading Schedule
----------------

P-P			Snir		April & May
Collective		Otto/Guist      April & May
Profiling		Cownie		April & May

Process Topolgy         Hempel          May & June
Environmental Mgmt      Gropp		May & June
Lang. Bind. gen.        Berryman	May & June
Context			Skjellum	May & June

Formal Spec		Zenith		June & August  ???

Specific language bindings will follow general material by one meeting.

Where is the language binding material?  There will be a general
principles chapter and also the actual bindings.  These will be
separate votes.

General language	Berryman	June
Specific lang. bind.	Berryman	

Is anything coming in the Formal Spec?  Does anyone care.  Rusty -
one participant and two observers.  Zenith told Rusty that he was
working on something, but nothing has appeared and he is not here.

What about public comment?  Discussion of the HPFF model of two
opportunities for comment.  Proposal to have only one round of public
comment with draft released to public at Supercomputing '93.

Reference implementation - Gropp/Lusk effort
Test suite - Greenberg and Haup will lead an effort
Subset - 

"Implementation Order Recommendation"

Huss

                                 Goals
 + Define a reasonable subset of MPI that is recommended for initial
   implementation

 + Only a minimum - welcome to implement more

 + Allow MPI to begin to show up in a timely fashion while still
   consistent across vendors

 + Consistent with complete standard
--
                                Method
 + Create new "subset committee"
 + Write an Annex (like HPF)
 + Present flushed out proposal/annex at next meeting
 + Hope "other" committee will create initial test suite for minimal
implementation :-)  Might motivate implementors
--
First shot list
  + want subcommitte members
  + Begin discussion via e-mail
-----
NOT IN SUBSET
1  No persistent handles
2  No multicomponent buffer descriptors {only one item described}
3  No indexed component
4  No waitany or waitall
5  No name server model

If interested send mail to Huss


-------------------------------------------------------------------------------
             Report From the Collective Communication Subcommittee
-------------------------------------------------------------------------------

Collective Communications - First Reading
-----------------------------------------

Steve Otto

Subcommitte of three met last night (Otto, Ho, El...)

Propose a discussion about safety, semantics and function of collect c
  NOT:  Contexts/groups
  NOT:  detailed questions of data types, lan. bind, p-p

Semantic Warm-up
----------------
  Barriere implies a time synchronization of processes, but NOT an
emptying of all message buffers.  I.e. p-p messages may span (in
time) a barrier:

   1					2
post-send(2)			post-receive(1)
barrier				barrier
complete-send(2)

barrier does not imply that message queues were emptied.  Object to
statement on p. 8. {{{QUOTE}}}
--
If we want collective function that does ensure all messages queues
were emptied, let's invent it.

                          wait-for-all-global

Example 3. p. 46
Is this safe?

   0					1
bcast                /---------------   receive(0)
send(1)	------------/                   bcast

Does this deadlock?

Depends on implementation of bcast
 If bcast implemented with buffered contenxt-unique messages,
probably won't deadlock
 If bcast synchronizes strongly, probably will deadlock

Otto unhappy about defining sematics of collective communication
routines in terms of operational p-p routines because this depends on
side effects such as emptying message queues.
--
Cownie/Snir argue that this is not true because ... {{{MISSED}}}
--
                         Conservative Proposal

Instead of mandating that example above is "safe", we propose that no
messages are allowed to be "in the air" upon entry to collective
communication call.  So:

We require that c-c routines be used AS IF they implied barrier synch
                                  but
                                  === 
the user cannot assume that they actually provide barrier synch 
-- 

Confusion about what is actually intended.  Unhappiness about phrase
"in the air". The example above is unsafe, but is it erroneous?  Claim
is that this is unsafe, but an implementation that allows it is
compliant.

Now what is the situation of the example on p. 46?  It is there
because Marc want people to be aware that the behavior of a valid (if
unsafe) program may be surprising.
--
Related point: May want to have a "barrier mode" for c-c so that they
do behave as barriers when this is on.  Useful - can detect many
erroneous programs.
--

Keeping multiple c-c's separated
--------------------------------
   /group 1                           /collective operation 2
[                          (    ]                               )
   \collective operation 1   ^         \group 2
                             |
                         can messages in here to go to wrong destination?

Intersecting spanning trees ... can they catch each other's messages

It seems that good implementations can be constructed so that they don't.

 ==> NOT dependent on user-tags

TAGS ARE SUPERFLUOUS FOR C-C
 + keep for consistency with p-p?
 + but, what is the implemntor supposed to do with them?
 + ignore them?
 + what do we tell users about tags in c-c?

We propose - NO TAGS in c-c!

--
Why have them - for debugging.  Anything else?
Does lack of tags affect non-blocking c-c?  Yes.
Steve Huss: Want to make sure that system must insure that successive
broadcasts will never result in confusion of messages.
Marc: What is the exact semantics of c-c, in particular when is the
burden on the user of MPI to insure no ambiguity vs.l when this is
guaranteed by the system?  This is particularly important if
generalize to allow non-blocking c-c.
PRP: Might want to guarantee that ordering is guaranteed between c-c.
Snir: Certainly want to guarantee that if parameters in c-c are
different, then there is no ambiguity.  Further may guarantee even if have
same parameters.  Proposal - no tags and no ambiguity (so matching is
via parameters and sequencing.) [System responsiblity vs. user
responsiblity]
Steve: Wants MPI to guarantee preservation of order
Kapenga/Pierce: Order may not be meaningful
Gropp: Already have similar issue for p-p.
Snir: So what is behavior when order is not meaningful?
Cownie: Kapenga made important point that multithreaded libraries
cannot use c-c without context information.
Various: MIght be able to do this by using duplicated groups.  Again,
issues is what responsibility falls on system and what on user.
One possible solution is to require user to provide ordering of c-c.
--

Functions
---------
  bcast
  barrier
  gather
  global reductions
  cshift/eoshift
  scans
  all-to-all bcast
  index
--

3.1 Introduction
----------------
  Note that non-blocking c-c is not included

VOTE: Include non-blocking c-c?
----   Yes: 0 No: uncounted Abstain: 9

  Note that groups carry no topology information.

Weeks proposes adding a perfect shuffle c-c function.  Ranka proposes
adding a permutation c-c function.  Mention of variations on this.

Rik observes that the arbitrary buffer descriptor versions will be
extremely complex in implementation.  He also objected that this
cannot be implemented using p-p.  Marc responed that it is possible
using p-p because can send message to oneself.

Discuss this in detail when we get to the reduction section.

3.2 (Group Functions) & 3.3 (
-----------------------------
LATER

3.4 Synchronization
-------------------

Tag is removed, so examples are now wrong and must be replaced.

Jon wants the examples removed as they contain semantics that are not
guaranteed.  Marc suggests moving to appendix with the hope that this
would eventually contain a full specification of c-c in terms of p-p.
Agreed.

VOTE:  Remove tag in all c-c?
----   Yes: 15 No: 5 Abstain: 5

VOTE:  Accept just semantics of 3.4
----   Yes: 23 No: 0 Abstain: 2

---
break 3:15 p.m. - 3:35 p.m.
---

Ho has paper on collective-communication library.  He will make it
generally available.

3.5 (Data move functions)
-------------------------

Otto: max size to shift?  NO

Issues:
  + What if inbuf is outbuf? 
  + Periodicity in topology?  [Hempel]
  + Tie topology to chsift    [Hemple]
  + cshift as sendreceive(source, destination, ...)  [Flower]

Allowing inbuf and outbuf to be the same violates F77. Rik -
user-level double buffering a pain.
various: back and forth on responsibility of user vs. system.
Steve Huss: Proposes a cshift with only 1 buffer.  
Disallowing partially overlapping buffers.  Note that this new cshift
will have an INOUT argument.
Does this same argument apply to other reduction functions?  Have to
check one by one.

VOTE:  Allow 2nd cshift with only one buffer. No partial overlap on
----   orig cshift
       Yes: 21 No: 0 Abstain: 6

Note that earlier encompassing vote imples that we will have types in
cshift.

Marc remarks that changes to measuring size in bytes affects
statement in text that buffers have same number of units.

eoshift
-------

Second single buffer form?
zero filling - another argument?

VOTE:  Accept cshift, eoshift proposals with amendment of 2nd form of each.
----   Yes: 20 No: 2 Abstain: 4


bcast
-----

VOTE:  Accept bcast
----   Yes: 22 No: 0 Abstain: 2

gather
------
 Note - len is number of bytes, not what is written.

 Long general, rambling discussion of gather.

Proposal - separate IN argument to gather of sizes; separate
functions to find sizes.

Proposal - version of gather with list of outbufs

Proposal [Flower] - all-to-all gather(???)

Cownie moved to direct gather and scatter back to subcommittee for
further consideration.  Accepted.


3.6 (Global Compute Operations)
-------------------------------

Issues:  
  + inbuf=outbuf problem?
      2nd version
  + have types, so don't need (R,I)MAX, etc.
  + return value to all?
      3rd version
  + maxloc (etc.) return location and value
  + restriction on user defined functions
  + vectorized user function

Returned to committee for further work.

scan - nothing to say for today
correctness - nothing to say for today

Finished with collective communication for today.


-------------------------------------------------------------------------------
                 Report From the Process Topology Subcommittee
-------------------------------------------------------------------------------

Rolf Hempel presided.

No presentation, but need to discuss the direction to go.  First
question - is topology going to be part of group management at all?

Rolf remarks that vast majority of applications have a natural topology.
There are implementation efficiencies - e.g. avoiding tables for
mapping to processors.

Marc: User visible mapping of processes to processors is not likely
to be valuable.  Trend in hardware is to hide hardware topology.

What is the advantage of topology?  Convenient in writing programs
when topology is natural to problem.  This information may be useful
for implementation on particular systems.

What do vendors have to say about the utility of such information.
Cownie: The point is to be as flat as possible.  

Snir: Topology can certainly be built as a superstructure.  What is
value of integration?  Convenience, safety.  Safety does not appear
important and he does not see sufficient value to convenience.  Major
issue is relation of topologies and groups.  Can store topology as
part of group (current proposal) or as a superstructure (Marc's
preference.)

Certainly convenient to have standardized method to build e.g. row
group, column group, etc.  This is what is in current draft, but the
issue is one of integration.  But is it more efficient or even
substantially more convenient.

Three possibilies: Not in MPI, in MPI but not integrated, integrated
into group mechanism in MPI.

Various repetitive arguments for each of these positions.

Straw vote:  Topologies in MPI?
----------   If yes
                integrated with groups (e.g. eoshift), OR
                separate library, environmental inquiry
             In: 25 Out: 4 Abstain: 5
                integrated: 4  Separate: 23  Abstain: 7

Back to discussion in the subcommittee.

Meeting tonight:  p-p  room 1
                  c-c  room 2
 =============================================================================

                           Friday, April 2, 1993

-------------------------------------------------------------------------------
                    Report From the Profiling Subcommittee
-------------------------------------------------------------------------------

Profiling - Jim Cownie
----------------------

Based on draft section ?? of document.  This is a single level approach
and is basically static, i.e. based on selecting actual functions at link
time.

Questions:

1)  Is one level OK?
     - other option is chain of function pointers

2)  Debugging support?
     - Dump all message envelopes
     - Status of active handles

Single level has no extra cost, but limits as might want to use the
single level for another purpose e.g. as a network intermediary.

Yes, that limit could be significant.
Provide multi-level interface in the same manner (i.e. publishing
alternative names for each level)?  [Problem is setting the multiple
interfaces.] 
Provide environmental facilities for exporting a multi-level dynamic
approach?

General agreement that single level is better than extra cost for
everyone.  Note that single level actually can support full dynamic
approach as the MPI routines can be replace by functions that do function
pointer swizzling.


Debugging support?
------------------

A small amount of discussion last night.  What is needed and what can be
provided? Besides items mentioned above, need to be able to decode the
opaque objects in MPI.  Useful to have a recording mechanism.


-------------------------------------------------------------------------------
                 Report From the Language Binding Subcommittee
-------------------------------------------------------------------------------

Scott Berryman

Gross assumptions:
   No Fortran 90 binding
   C++ binding = C binding
   Specification says nothing about language interoperability
   {The confusion bomb went off!} 

[sender deals with native message; XDR in general; underspecified buffer
descriptors - incl. lengths or incl.  lang. spec.; general vs. limited
translation; know transl in hetero - don't know in homo; this is not just
a language issue - it is a language implementation issue][Need concrete
proposals]


-------------------------------------------------------------------------------
     Report From the Point-to-Point Communication Subcommittee (continued)
-------------------------------------------------------------------------------

Marc Snir

Byte vs. Element Count

 + Need ADD_VEC, ADD_INDEX with byte displacement

 + Most usage will be element displacement
--
[R] [III] [R] [III] [R] [III] [R] [III]  (array of records)
so odd displacement
--
Have two different components

 + Block
    start
    len   -  number of elements
    data type

 + VEC
    start
    len  - total number of elements
    stride - number of elements between blocks
    lenblk - number of elements per block
    data type

 + HVEC
    start
    len - total number of elements
    stride - number of BYTES between blocks
    lenblk - number of elements per block
    data type

[///] [////] [ ][///] [////] [ ][///] [////] [ ]

  VEC(..., 5, 3, 2, REAL)
 HVEC(..., 5, 3*size_of_real, 2, REAL)

INDEX
  start
  array_of_indices - element index (start has index zero)
  type

HINDEX
  start
  array_of_indices - element displacement in bytes
  type

HVEC, HINDEX more general but less convenient and more error prone


Returned length of received messages
------------------------------------

Number of elements IS always meaningful (even for message containing
multiple type.)  And number of elements sent is always the same as the
number of elements received, while the number of bytes sent/received need
not be the same.  Moreover the number of elements is not too hard to
compute.

Proposal: Use the element count except for displacements in Hxx buffer
components. 

Organizations present: 22

VOTE: proposal as above
----  Yes: 19  No: 0  Abstain: 3


Probe
-----

1) Use to decide where to receive message (allocate memory)
2) Use to debug.

Propose to support 1.

Do we need probe?  Why not receive in a system buffer and return a pointer?
Problems - system buffer is untyped; memory management

Proposal:  MPI_PROBE(source, tag, context, flag, return_status_handle)
MPI_RETURN_STATUS(handle, len, source, tag)

Assuming no other concurrent receive (single thread) MPI_RECV executed
with source/tag returned by PROBE and same context will return message
found by probe.  Multithread programs need suitable critical region.

What is returned in the len field?  Should be the number of elements, but
unfortunately don't know this without buffer descriptor.  So return
number of bytes?  This may not be useful for deciding size of buffer in a
hetero env.  So possibilities:
  1)  len = -1
     and provide  DECODE(buff_desc, msg_status_handle)
  2)  Number of bytes ("on the wire")
  3)  Number of elements (cost of including this info in envelope)

Could have 2) and also provide DECODE.  The only inconvenience is the
difference in units between DECODE and RECEIVE.

Another alternative is to provide a data type as part of the PROBE rather
than a buffer_descriptor - this would cover the most common case of
uniform buffer_descriptors.

Cost of rebuilding buffer_descriptor for receive because have to add in
pointer for actual storage space. {{???Is this right}}

Oliver object proposal again

  4)  PROBE(source, tag, context, type, return_status_handle) returning
      number of elements.

Straw Vote:  Probe?
----------   Yes: 23  No: 2


Straw Vote:  1) probe - no length              8
----------   2) probe - simple type length    17
             3) probe - byte count            11
             4) probe - element count          6
             5) decode function               22


Cancel
------

MPI_CANCEL(handle)
  Either communication succeeds or CANCEL succeeds, but not both nor neither.
  (This is to cancel a non-blocking send.)

  + Need CANCEL to recover committed resource
  + Implementation is not trivial

Is it valid for CANCEL to always fail?  No, if there is a send with no
corresponding posted receive, then CANCEL must succeed.

Recognition that CANCEL may be a very expensive operation.  For example
it may require an interrupt drive mechanism.

What is the effect (cost) on normal communication?

Straw Vote:  CANCEL?
----------   Yes: 14  No: 6


Type Mismatch
-------------

Suppose send 4 bytes and receive integer or conversely.  Is this unsafe
or erroneous?  Various proposals: type mismatch is always erroneous; type
mismatch is never erroneous; BYTE type is an escape hatch.  Another
proposal is to allow type conversion as well.

Straw Vote:
----------
             1.  type mismatch always erroneous             2
             2.  type mismatch erroneous except BYTE       12
             3.  never erroneous                            9


-------------------------------------------------------------------------------
                              Report on MPI -1.2
-------------------------------------------------------------------------------

Jim Cownie presented.


Procedure clarification:  All official votes require majority of ABSTENTIONS.

Added Features

  Insecure send for users who lack confidence

  Tags are opaque objects

  Message data is opaque

  All lengths are in bits, measure as floats (for sufficient precision)

  Messages are not passed in envelopes (they are too small), but packing
    crates. 

  Context proposal number server
    - returns 64 bit integer (expected to last at least 1 week)

  Group simplification: maximum number of elements in a group is 1

  All communications occur in a group