Mon Nov 23 16:03:12 EST 1992
From owner-mpi-collcomm@CS.UTK.EDU  Tue Nov 24 23:07:28 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA15625; Tue, 24 Nov 92 23:07:28 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA26099; Tue, 24 Nov 92 22:57:48 -0500
Received: from gstws.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA26095; Tue, 24 Nov 92 22:57:45 -0500
Received: by gstws.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA13365; Tue, 24 Nov 1992 22:57:44 -0500
Date: Tue, 24 Nov 1992 22:57:44 -0500
From: geist@gstws.epm.ornl.gov (Al Geist)
Message-Id: <9211250357.AA13365@gstws.epm.ornl.gov>
To: mpi-collcomm@cs.utk.edu
Subject: MPI collective communication...


Collective communication subcommittee.

Welcome. We have our work cut out for us - first because collective
communication was not included in the first iteration of the MPI draft
and second because "groups" caused the most resistance in the last meeting.

In the next 6 weeks we need to come up with and agree on the definition 
of a set of routines that fall under the jurisdiction of collective
communication. As I see it these routines fall into two categories.

- routines that require the cooperation of a group of processes.
  This includes collective communication like multicast 
  and cooperative routines like synchronization.

- routines that create groups of processes and potentially modify these groups.
  This also needs to include group information routines 
  that we feel are required like who am I in the group.

Two items that need to be coordinated with the pt2pt subcommittee
heterogeneity - it's not in the present MPI draft. If we want to be able
                to execute across heterogeneous networks, then we have to 
                think about how a process is identified in MPI and
                also how a message buffer can get encoded/decoded.
                For the latter we will need to know the type of the
                data in pack/unpack routines. 
                (or specified directly in the send/recv)

Inter-group communication - point to point communication between two members
                of a group.

As a first step I would like to get everyone's ideas out on the table
so we can see what type of consensus we have. And so we don't miss any
good ideas. So what basic routines (functions) do you think are required?
I would like to get your input to this first step by December 5.
----------------------
Since I got the short straw, I'll go first.
My basic philosophy about MPI and our standards effort is to
KEEP THINGS SIMPLE. It is easier to add a function later if
we see lots of users combining the basic routines in standard ways.
It is a waste to support a bunch of routines only 1% of the users ever call.

General:
I would like to see all the routines be functions that return error code(s)
as opposed to subroutines.

=======================================================================
Groups:
=======================================================================
Groups could be implemented separate from the collective communication routines.
The collective routines could take an integer array list of task IDs
and there could be a group routine that returned such a list.
There are efficiency factors here since the list of members of a group
would not have to be looked up every time a collective routine was called.
FUNCTIONS: groupsize()
           groupmembers()

GID: groups could be user named and addressed by name
or they could be addressed by a system supplied (unique) integer group ID.

Question - should groups be allowed to overlap?
Question - should we let groups be dynamic or restrict them to be static?

Group member IDs: There should be a notion of the members of a group
being addressable either directly or indirectly by [0 -- num_of_members-1]
There needs to be a routine to return mygroupINDEX (at least) and maybe
a more general routine that can return any process' group index.
FUNCTIONS: gettaskID( given GID and group index )
           getindex( given GID and taskID )

Creating groups: Here are three alternative methods. 
Method 1 (dynamic)
Most general case is to allow any task to join or leave
any group at any time without the consent of the other group members.
While this creates a simple and flexible user interface, it can be 
difficult to implement because of the potential race conditions.
FUNCTIONS: joingroup()
           leavegroup()

Method 2 (static)
A group could be defined by any single task by listing the task IDs.
Or alternatively all the future members of a group have to simultaneously
define the same group.
FUNCTION: makegroup()

Method 3 (dynamic)
Another method which met with some resistance when presented at the 
last MPI meeting was the notion of creating groups by partitioning
an existing group. The negative comments were the large number of routines
involved and the lack of usefulness of a tree of groups.
I am not keen on this method but for completeness.
FUNCTIONS: from MPI draft
           partition()
           root()
           children()
           parent()
           siblings()
           pushg()
           popg()

==============================================================================
Collective Routines:
==============================================================================
One problem we can get into is defining many different 
collective communication routines gmax, gsum, gadd, etc.
I propose that we have only a handful of routines based 
on the underlying communication logic.
All participating tasks call the same function.

FUNCTIONS:

broadcast()  broadcast a message from one task to all tasks in a group.

reduce()     inverse of broadcast. Data from all tasks in a group
			 is reduced using a predefined function or a user function
			 and the result is placed in a specified task.
			 Function name is specified in the argument list.
			 Pre-defined functions should include: max, min, add, mult,
			 and optionally AND, OR, XOR. (others?)

scatter()    a single task contains different messages for each task.
			 Scatter these messages to all tasks in a group.

gather()     inverse of scatter. gather distinct messages from each task
			 in a group and collect them in a specified task.

synchronize()  barrier synchronization of a group of tasks.

shift()      assume group members form a (logical) ring.
			 shift the message in each task to its right (or left) neighbor.
			 (useful in matrix multiply shift and roll algorithm)

exchange()   equivalent to every task in a group calling scatter.
             (routine used for matrix transpose)

all2all()    equivalent to every task in a group calling broadcast.

                          -----------------------------
   __o        /\          Al Geist
 _`\<,_    /\/  \         Oak Ridge National Laboratory
(_)/ (_)  /      \        (615) 574-3153   gst@ornl.gov
* * * * * * * * * *       -----------------------------
From owner-mpi-collcomm@CS.UTK.EDU  Wed Nov 25 13:41:20 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA21743; Wed, 25 Nov 92 13:41:20 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA09805; Wed, 25 Nov 92 13:14:39 -0500
Received: from relay2.UU.NET by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA09790; Wed, 25 Nov 92 13:14:31 -0500
Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP 
	(5.61/UUNET-internet-primary) id AA22179; Wed, 25 Nov 92 13:14:32 -0500
Received: from kailand.UUCP by uunet.uu.net with UUCP/RMAIL
	(queueing-rmail) id 131339.4005; Wed, 25 Nov 1992 13:13:39 EST
Received: from brisk.kai.com (brisk) by kailand.kai.com via SMTP
  (5.65d-92031301) id AA12688; Wed, 25 Nov 1992 12:06:04 -0600
Received: by brisk.kai.com
  (920330.SGI-92101201) id AA08958; Wed, 25 Nov 92 12:06:02 -0600
Date: Wed, 25 Nov 92 12:06:02 -0600
Message-Id: <9211251806.AA08958@brisk.kai.com>
To: mpi-pt2pt@cs.utk.edu, mpi-collcomm@cs.utk.edu, mpi-formal@cs.utk.edu,
        mpi-ptop@cs.utk.edu
Reply-To: William.Gropp's.message.of.Wed@kai.com,
        25 Nov 92 09:28:43 CST <9211251528.AA12985@godzilla.mcs.anl.gov>
Subject: Nonblocking functions and handlers.
From: Steven Ericsson Zenith <zenith@kai.com>
Sender: zenith@kai.com
Organization: 	Kuck and Associates, Inc.
		1906 Fox Drive, Champaign IL USA 61820-7334,
		voice 217-356-2288, fax 217-356-5199


Bill Gropp writes:

    (Warning: radical position that I'm not sure even I hold follows:)
    An interesting issue is whether we should defer all nonblocking communications
    to a thread-based execution model.

I'm not so sure this is a radical position Bill since even
nonsynchronized communication will need to be defined formally this way.
Nonsynchronized communication is in effect creating a parallel process
that has the job of passing the communication on. Al Geist earlier asked
the question wheather buffers used by nonsynchronized communication
should be accessible after the communication has started - the answer
should be - no, unless by some explicit mechanism that formally amounts
to a communication with the process mentioned above.  Any nonexplicit
interaction (e.g. a write to the buffer) would have to be specified as
formally equivalent to an explicit interaction.

Also, there is quite a range of terminology in use.  One common error:
"Asynchronous" and "synchronous" has quite a particular meaning in EE
and when CS people use the terms in relation to message passing they
usually mean NONSYNCHRONIZED and SYNCHRONIZED. Also BLOCKING =
SYNCHRONIZED. Let us begin a glossary that defines the terms we use - if
no-one else volunteers I'll take this to be the responsibility of the
Formal Specification Subcommittee. So I'm looking for volunteers from
that subcommittee.

Steven

From owner-mpi-collcomm@CS.UTK.EDU  Wed Nov 25 15:37:50 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA25014; Wed, 25 Nov 92 15:37:50 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA12301; Wed, 25 Nov 92 15:15:34 -0500
Received: from relay2.UU.NET by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA12297; Wed, 25 Nov 92 15:15:32 -0500
Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP 
	(5.61/UUNET-internet-primary) id AA26923; Wed, 25 Nov 92 15:15:35 -0500
Received: from kailand.UUCP by uunet.uu.net with UUCP/RMAIL
	(queueing-rmail) id 151429.20685; Wed, 25 Nov 1992 15:14:29 EST
Received: from brisk.kai.com (brisk) by kailand.kai.com via SMTP
  (5.65d-92031301 for <mpi-collcomm@cs.utk.edu>) id AA15317; Wed, 25 Nov 1992 13:22:32 -0600
Received: by brisk.kai.com
  (920330.SGI-92101201) id AA09015; Wed, 25 Nov 92 13:22:31 -0600
Date: Wed, 25 Nov 92 13:22:31 -0600
Message-Id: <9211251922.AA09015@brisk.kai.com>
To: geist@gstws.epm.ornl.gov
Cc: mpi-collcomm@cs.utk.edu
In-Reply-To: Al Geist's message of Tue, 24 Nov 1992 22:57:44 -0500 <9211250357.AA13365@gstws.epm.ornl.gov>
Subject: MPI collective communication...
From: Steven Ericsson Zenith <zenith@kai.com>
Sender: zenith@kai.com
Organization: 	Kuck and Associates, Inc.
		1906 Fox Drive, Champaign IL USA 61820-7334,
		voice 217-356-2288, fax 217-356-5199


   Date: Tue, 24 Nov 1992 22:57:44 -0500
   From: geist@gstws.epm.ornl.gov (Al Geist)

	<discussion on groups>

Is this discussion not the domain of the process and topology subcommittee?

   ==============================================================================
   Collective Routines:
   ==============================================================================
   One problem we can get into is defining many different 
   collective communication routines gmax, gsum, gadd, etc.
   I propose that we have only a handful of routines based 
   on the underlying communication logic.
   All participating tasks call the same function.

   FUNCTIONS:

   broadcast()  broadcast a message from one task to all tasks in a group.

Agreed. So, using my earlier suggestion where communications are
"channels" (logically shared objects .. whatever) with logical names, a
broadcast channel called G

	/* example of a declaration */
	communication broadcast_type(N) <datatype> G

"(N)" identifies the number of participants in the broadcast -
would be broadcast to by

	broadcast(G, expression)

(actually this would be formally equivalent to "send(G, expression)"
since G carries the broadcast semantics)

and each process (task) would recieve the message by

	receive(G, variable)

You must clearly identify the meaning of parallel broadcasts to the same
group. I would choose the constuction

	(... broadcast(G, x) ...) ||
	(... broadcast(G, y) ...) ||
	(... receive(G, v1) -> receive(G, v2) ...) ||
	...
to mean
	v1 = x | y
	v2 = x iff v1 = y
	v2 = y iff v1 = y

   reduce()     inverse of broadcast. Data from all tasks in a group
			    is reduced using a predefined function or a user function
			    and the result is placed in a specified task.
			    Function name is specified in the argument list.
			    Pre-defined functions should include: max, min, add, mult,
			    and optionally AND, OR, XOR. (others?)

I'm not sure I like the introduction of the function name. The inverse
of broadcast though is in effect a many-to-one. So

	/* example of a declaration */
	communication reduce_type(N) <datatype> R

would be written to by 

	send(R, e)

in N processes, and

	reduce(R, v, f)

is equivalent to

	receive( R, result )
	receive( R, v)
	v = f( result, v )
	receive(R, result)
	v = f(result, v)
	... until N receive times

In both broadcast and reduce cases we have left it to the implementation
to count the distinct communication instances.

Again we must concern ourselves with the meaning of parallel reduce constuctions

	(... reduce(R, v1, f) ...) ||
	(... reduce(R, v2, f) ...) ||
	(... send(R, x) -> send(R, y) ...)

It would be simplest to restrict this case and say reduce can only
appear in one process for each reduce type, but what about 

	(... reduce(R, v2, f) ...) ||
	(... send(R, x) -> send(R, y) ...)

does each send in sequence apply to one reduce or subsequent reduces. To
be the inverse of broadcast it would be the former.

   scatter()    a single task contains different messages for each task.
			    Scatter these messages to all tasks in a group.

Isn't this an abbreviation for a sequence of sends on an array of
one-to-one channels? So an array of channels S

	/* example of a declaration */
	communication one-to-one (N) S

where 

	scatter(S, A)

such that A is an array of size N, and the scatter is equivalent to

	parallel do i
		send(S[i], A[i])
	end parallel do

and the corresponding receive looks like

	receive(S[i], v)

   gather()     inverse of scatter. gather distinct messages from each task
			    in a group and collect them in a specified task.

Similarly, this an abbreviation for a sequence of recieves on an array of
one-to-one channels. So an array of channels G

	/* example of a declaration */
	communication one-to-one (N) G

where 

	gather(G, A)

such that A is an array of size N, and the gather is equivalent to

	parallel do i
		receive(G[i], A[i])
	end parallel do

and the corresponding send looks like

	send(G[i], e)

   synchronize()  barrier synchronization of a group of tasks.

This is also a many-to-one where the one is a synchronization process
created by the declaration (yes, I know this sounds odd).

		/* example of a declaration */
	communication sync SYNC
and
	synchronize(SYNC)

is equivalent to the output

	send(SYNC)

i.e. send with no output value.

   shift()      assume group members form a (logical) ring.
			    shift the message in each task to its right (or left) neighbor.
			    (useful in matrix multiply shift and roll algorithm)

This can be constructed from the above.

   exchange()   equivalent to every task in a group calling scatter.
		(routine used for matrix transpose)

This is tricky, and isn't as simple as is implied. I have no trouble
with it if we can specify a deadlock free implementation, but frankly I
think it is out of place here.

   all2all()    equivalent to every task in a group calling broadcast.

Why doesn't this cause deadlock in the group? Nah! It does cause deadlock.

Steven


From owner-mpi-collcomm@CS.UTK.EDU  Wed Nov 25 18:12:56 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA26783; Wed, 25 Nov 92 18:12:56 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA15993; Wed, 25 Nov 92 18:07:19 -0500
Received: from relay1.UU.NET by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA15989; Wed, 25 Nov 92 18:07:17 -0500
Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay1.UU.NET with SMTP 
	(5.61/UUNET-internet-primary) id AA23256; Wed, 25 Nov 92 18:07:14 -0500
Received: from kailand.UUCP by uunet.uu.net with UUCP/RMAIL
	(queueing-rmail) id 180640.18440; Wed, 25 Nov 1992 18:06:40 EST
Received: from brisk.kai.com (brisk) by kailand.kai.com via SMTP
  (5.65d-92031301 for <mpi-collcomm@cs.utk.edu>) id AA24413; Wed, 25 Nov 1992 16:21:00 -0600
Received: by brisk.kai.com
  (920330.SGI-92101201) id AA09165; Wed, 25 Nov 92 16:20:58 -0600
Date: Wed, 25 Nov 92 16:20:58 -0600
Message-Id: <9211252220.AA09165@brisk.kai.com>
To: zenith@kai.com
Cc: geist@gstws.epm.ornl.gov, mpi-collcomm@cs.utk.edu
In-Reply-To: Steven Ericsson Zenith's message of Wed, 25 Nov 92 13:22:31 -0600 <9211251922.AA09015@brisk.kai.com>
Subject: MPI collective communication...
From: Steven Ericsson Zenith <zenith@kai.com>
Sender: zenith@kai.com
Organization: 	Kuck and Associates, Inc.
		1906 Fox Drive, Champaign IL USA 61820-7334,
		voice 217-356-2288, fax 217-356-5199


An typo. error crept into my last message.

	v1 = x | y
	v2 = x iff v1 = y
	v2 = y iff v1 = y

should, of course, be

 	v1 = x | y
	v2 = x iff v1 = y
	v2 = y iff v1 = x

And in the examples all sends are synchronized (blocking).

Steven


From owner-mpi-collcomm@CS.UTK.EDU  Wed Nov 25 19:37:42 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA28059; Wed, 25 Nov 92 19:37:42 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA16782; Wed, 25 Nov 92 19:14:32 -0500
Received: from relay2.UU.NET by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA16778; Wed, 25 Nov 92 19:14:29 -0500
Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP 
	(5.61/UUNET-internet-primary) id AA25540; Wed, 25 Nov 92 19:14:34 -0500
Received: from kailand.UUCP by uunet.uu.net with UUCP/RMAIL
	(queueing-rmail) id 191312.8393; Wed, 25 Nov 1992 19:13:12 EST
Received: from brisk.kai.com (brisk) by kailand.kai.com via SMTP
  (5.65d-92031301 for <mpi-collcomm@cs.utk.edu>) id AA26829; Wed, 25 Nov 1992 17:33:21 -0600
Received: by brisk.kai.com
  (920330.SGI-92101201) id AA09251; Wed, 25 Nov 92 17:33:20 -0600
Date: Wed, 25 Nov 92 17:33:20 -0600
Message-Id: <9211252333.AA09251@brisk.kai.com>
To: geist@gstws.epm.ornl.gov, mpi-collcomm@cs.utk.edu
In-Reply-To: Steven Ericsson Zenith's message of Wed, 25 Nov 92 13:22:31 -0600 <9211251922.AA09015@brisk.kai.com>
Subject: MPI collective communication...
From: Steven Ericsson Zenith <zenith@kai.com>
Sender: zenith@kai.com
Organization: 	Kuck and Associates, Inc.
		1906 Fox Drive, Champaign IL USA 61820-7334,
		voice 217-356-2288, fax 217-356-5199


Observation on the following point:

	    synchronize()  barrier synchronization of a group of tasks.

	 This is also a many-to-one where the one is a synchronization process
	 created by the declaration (yes, I know this sounds odd).

			 /* example of a declaration */
		 communication sync SYNC
	 and
		 synchronize(SYNC)

	 is equivalent to the output

		 send(SYNC)

	 i.e. send with no output value.

I should clarify this. Given

	(P||Q);R

This reads P and Q in parallel followed by R; i.e., there is a barrier
at the semicolon. To implement this barrier using Al's primitive the
compiler in effect places a send(SYNC) at the end of P and Q and the
corresponding receive(SYNC);receive(SYNC) at the start of R. Using
something, perhaps more familiar

	begin parallel
	   section
		P
	   end section
	   section
		Q
	   end section
	end parallel
	R

translated using MPI might become the following three programs executed
on three nodes of a distributed memory machine

	program Node0
		P
		synchronize(SYNC)
	end program

	program Node1
		Q
		synchronize(SYNC)
	end program

	program Node2
		receive(SYNC)
		receive(SYNC)
		R
	end program

But now I'm less convinced we need a separate synchronize primitive and
should just permit "empty" messages in send and receive for their
synchronization characteristics. (An implementation may, of course,
choose to send a dummy value to gain the same effect).

Steven
	



From owner-mpi-collcomm@CS.UTK.EDU  Fri Nov 27 12:08:43 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA25724; Fri, 27 Nov 92 12:08:43 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA08742; Fri, 27 Nov 92 12:06:12 -0500
Received: from relay1.UU.NET by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA08738; Fri, 27 Nov 92 12:06:10 -0500
Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay1.UU.NET with SMTP 
	(5.61/UUNET-internet-primary) id AA02767; Fri, 27 Nov 92 12:06:08 -0500
Received: from kailand.UUCP by uunet.uu.net with UUCP/RMAIL
	(queueing-rmail) id 120540.25488; Fri, 27 Nov 1992 12:05:40 EST
Received: from brisk.kai.com (brisk) by kailand.kai.com via SMTP
  (5.65d-92031301) id AA12937; Fri, 27 Nov 1992 10:25:06 -0600
Received: by brisk.kai.com
  (920330.SGI-92101201) id AA11158; Fri, 27 Nov 92 10:25:05 -0600
Date: Fri, 27 Nov 92 10:25:05 -0600
Message-Id: <9211271625.AA11158@brisk.kai.com>
To: mpi-collcomm@cs.utk.edu
Cc: mpi-formal@cs.utk.edu
In-Reply-To: Steven Ericsson Zenith's message of Wed, 25 Nov 92 13:22:31 -0600 <9211251922.AA09015@brisk.kai.com>
Subject: MPI collective communication...
From: Steven Ericsson Zenith <zenith@kai.com>
Sender: zenith@kai.com
Organization: 	Kuck and Associates, Inc.
		1906 Fox Drive, Champaign IL USA 61820-7334,
		voice 217-356-2288, fax 217-356-5199


Observation on the following:

	   all2all()    equivalent to every task in a group calling broadcast.

	Why doesn't this cause deadlock in the group? Nah! It does cause deadlock.

I was thinking about this yesterday over my stuffed Tofu :-). Even if we
permit the broadcast to be nonsynchronized we have the problem I
described earlier with defining the behavior of parallel broadcasts. If
all2all is nonsynchronized then the order of received values must be
nondeterministic.

(|| i for N: broadcast(C, e[i])) || (|| k for N:|| j for N: receive(C, v[k, j]))

i.e., the order of values from e in v is nondeterministic. Now maybe I'm
missing something that has to do with the TMC perspective - in any case,
I have never seen the use of such a construction in an application. If
we do specify a deadlock free behavior for all2all is it desirable given
this nondeterminism? I know it's implementation will be tricky to get
right. Can we have some vendor comments please?

I have assumed here that the values broadcast are the same type.

Steven

Footnote: The syntax 

(|| i for N: broadcast(C, e[i])) || (|| k for N:|| j for N: recieve(C, v[k, j]))

illustrates N broadcasts implementing the all2all, where N is the number
of participants, in parallel with N parallel groups of N (parallel) receives.

From owner-mpi-collcomm@CS.UTK.EDU  Fri Nov 27 12:37:48 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA25951; Fri, 27 Nov 92 12:37:48 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA08855; Fri, 27 Nov 92 12:17:02 -0500
Received: from sampson.ccsf.caltech.edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA08851; Fri, 27 Nov 92 12:16:57 -0500
Received: from elephant by sampson.ccsf.caltech.edu with SMTP id AA24714
  (5.65c/IDA-1.4.4 for mpi-collcomm@cs.utk.edu); Fri, 27 Nov 1992 09:16:50 -0800
Received: by elephant (4.1/SMI-4.1)
	id AA13810; Fri, 27 Nov 92 08:13:55 PST
Date: Fri, 27 Nov 92 08:13:55 PST
From: jwf@parasoft.com (Jon Flower)
Message-Id: <9211271613.AA13810@elephant>
To: mpi-collcomm@cs.utk.edu

To: mpi-collcomm@cs.utk.edu
Re: A few ideas

In response to Al Geist's request here are a few more or less
random ideas about collective communication based on my own
experience......

Some comments about groups:

    I think that we should not describe the "group" concept in
    collective communication in terms of lists of task ID's. It
    might be implemented that way but I think the underlying
    concept should be related to the user application topology.

    I think the key to really optimizing collective 
    communication routines is to be to match the system's geometric
    knowledge of the hardware topology with the geometric
    behavior of the "logical topology" of the user code. 
    So, for example, you can do a lot better on a column-restricted
    broadcast if you know that the user's logical topology actually
    matches the hardware of the DELTA (for example).

    Similarly the exchange primitive doesn't make too much
    sense when defined only in terms of a list of nodes since
    at best the user is left with the responsibility of forming
    the list in the right order.

    I would like to see groups described in conjunction with
    the topological info and, in general, I though Rolf's idea
    was pretty good except that I didn't see how to deal with
    a very common case; a broadcast from a "host" program
    to all of its "nodes" or a reduction from all the "nodes"
    into the "host". These can both be represented as the
    combination of a "node" only operation and a point-to-point
    operation but it would be nice to encapsulate them somehow
    since they come up all the time.

    We could leave open a loophole for "expert" users to make their
    own groups from lists of task ID's but I don't know how we'd
    optimize their behavior.

Some comments about the individual functions:

broadcast:
---------
    I'm not sure who to address this issue to - it probably
    falls outside our domain of comment, but how do you
    deal with the case of a "master" program broadcasting
    to all of its slaves? (Actually read "host" for "master"
    and "node" for "slaves".) Does MPI1 even support this
    concept? It comes up all the time in our applications.
    I suppose this is an application of the group concept
    but it's one that I would like to see very streamlined
    because of its generality.

reduce:
------
    The comments for "reduce" and "gather" indicate that only
    one task in the group can get at the result? I would hope
    that there was some way for all the tasks in a group to 
    get the answer too, without following the reduce/gather
    with a broadcast operation since this looses a lot of
    efficiency.

    I like the idea of a function pointer for reduce.
    Is this done by having a general facility for the user
    with a function pointer argument and then providing a list
    of pre-defined "external" functions that do the common tasks?
    This would be my preference since the heterogeneity is then
    taken care of by the system. However, how do you express
    a reduce on a standard data type using a user-specified 
    function? Is there an argument to reduce that says the data
    type so that the system can still byte swap or are we
    going to restrict reduce (and possibly all collective ops)
    to the "byte-stream" data type and force the user to
    deal with it themselves. This latter is horrible because
    putting the byte swapping in the right places for a reduction
    operation is hard.

    I would like to add "average" to the list of predefined 
    functions even though it's a triviality.

gather:
------
    As for reduce - I would hope that all tasks can get at the result
    too.

synchronize:
-----------
   This one seems to be a real thorn. I would like to have a 
   non-blocking synchronize - you call the function to say that
   you're interested in synchronizing a particular group of
   tasks and then later check to see whether they've all done
   or not. This is very valuable in certain types of event-driven 
   simulation, for example, where you might start each time 
   step by invoking the sync. function and then go off and
   respond to incoming events. Periodically you then check to
   see if everyone in your group has checked in and if so, 
   increase global virtual time for the next step.

   A non-blocking sync. also allows a single (master) task to wait for
   the completion of either/or subtasks in two disjoint slave 
   groups. Obviously this can be done in another way but is very
   elegant and simple to code with non-blocking syncs.

   I would propose both a blocking and a non-blocking "wait for
   sync to complete" function in the same way that the point-to-
   point style has both.

shift:
-----
   How do you specify the (non-)periodicity of the edge elements?
   In fact what does left and right actually mean - is there an implied
   ordering in the entries of a group?

exchange, all2all:
-----------------
    These are life savers in my opinion since they encapsulate
    the biggest problem that I've seen in user codes. Writing
    these with point-to-point message passing primitives almost
    guarantees that the code doesn't scale and that it runs out
    of memory as you go to more nodes or even bigger problems.

    On the downside I agree with Steve Zenith that implementations
    of these functions are hard. I would also say that the ways that
    user's use these functions often because they don't want to
    think about a better decomposition method and so it's possible
    that my supporting these functions we are contributing to less
    than optimal coding at the user level. I would still vote
    them in, however, on the grounds that I would get fewer
    phone calls from customers!

Generalities:
============
I think the set of functions listed is rich enough 
for most applications. It would be interesting to see how many
arguments these things end up with when you try to write down
functional specs. I wonder if it might be worth having two
functions in each category; one with very few arguments that
does what most user's will probably want and another that
has all the arguments and flexibility. This might reduce the
number of "simple" mistakes that can be made. 

For example, I often forget the "EXTERN MAX" that you need 
to pass MAX as a function pointer in FORTRAN programs. Perhaps 
the simple form of the reduce operation could have a variable 
indicating the operation type instead?

Do the collective routines have message types like the 
point-to-point routines? In general I don't think they need to
since everyone is participating at once. On the other hand if
you make a mistake in this regard having a different message
type for each one sometimes facilitates looking them up in
a debugger. The one area where a message type might be 
interesting is in regard to the "synchronize" primitive
as discussed in the comments above.

	Jon Flower, jwf@parasoft.com
	ParaSoft Corp.
	818-792-9941
From owner-mpi-collcomm@CS.UTK.EDU  Sat Dec  5 22:08:57 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA22904; Sat, 5 Dec 92 22:08:57 -0500
Received:  by CS.UTK.EDU (5.61++/2.8s-UTK)
	id AA19180; Sat, 5 Dec 92 21:55:26 -0500
Received: from msr.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA19176; Sat, 5 Dec 92 21:55:23 -0500
Received: by msr.EPM.ORNL.GOV (5.61/1.34)
	id AA02074; Sat, 5 Dec 92 21:55:20 -0500
Date: Sat, 5 Dec 92 21:55:20 -0500
From: geist@msr.EPM.ORNL.GOV (Al Geist)
Message-Id: <9212060255.AA02074@msr.EPM.ORNL.GOV>
To: mpi-collcomm@cs.utk.edu
Subject: A proposal for collective communication interface. Opinions?

Collective Communication Proposal.

After reading Marc Snir's point-to-point outline, I think our 
work in the collective communication subcommittee is more clear.
A few of the Goals from the outline that I felt were particularly relevant:

1. Design an application programming interface.

2. Design an interface that is not too different from current practice

3. Define an interface that can be quickly implemented on many vendor platforms.

4. Focus on a proposal that can be agreed upon in 6 months.

5. Provide a reliable communication interface.

===============================================================================

Primary Requirement.
--------------------------------------------------------------------
The collective communication interface should be an extension of the
point-to-point interface. 
--------------------------------------------------------------------
As Marc points out on page 2 
 "SEND and RECV are a particular case of broadcast in a group of size 2;
 this observation can be used to check if the definition of collective
 communication semantics are consistent with the definition of 
 point-to-point communication."

This leads to the following points:
a. Collective routines like broadcast should provide the same 
   message data format as point-to-point routines.
   Be that [from page 7] scalar, contiguous, buffer with stride, typed
   or a union of these.

b. Collective communication should follow the same message context paradigms
   and recognize the same context control functions.

c. By using a structured name space (described on pages 7-8) 
   where all processes are identified by a (group,rank) pair
   for both the point-to-point and collective routines,
   then the users will have a consistent naming scheme
   across all the MPI communication routines.
   And those desiring a flat name space can have it by 
   using the default group "ALL".

d. Syntax of collective routines should follow the point-to-point scheme,
   whatever that turns out to be.

Collective communication is a matter of convenience for the user
and a matter of efficiency for the implementer. We must not
lose track of the fact that ANY collective communication function
can be implemented using only the MPI point-to-point routines.
I bring this up because in the spirit of simplicity and robustness
The following proposal contains only the most commonly used
and currently available functions.

=========================================================================

I propose the following minimum set of collective routines
be presented at the next committee meeting.

1. info = MPI_BCAST( buf, bytes, type, gid, root )

   Function:
   Called by all members of the group "gid" 
   using the same argument for "bytes", "type", "gid", and "root".
   On return the contents of "buf" on "root" is contained in "buf"
   on all group members.
   On return "info" contains the error code.

2. info = MPI_GATHER( buf, bytes, type, gid, root )

   Function:
   Called by all members of the group "gid" 
   using the same argument for "bytes", "type", "gid", and "root".
   On return all the individual "buf" are concatenated into the "root" buf,
   which must be of size at least gsize*bytes.
   The data is laid in the "root" buf in rank order that is
   | gid,0 data | gid,1 data | ...| gid, root data | ...| gid, gsize-1 data |
   Other member's "buf" are unchanged on return.
   On return "info" contains the error code.

3. info = MPI_GLOBAL_OP( inbuf, bytes, type, gid, op, outbuf )

   Function:
   Called by all members of the group "gid"
   using the same argument for "bytes", "type", "gid", and "op".
   On return the "outbuf" of all group members contains the 
   result of the global operation "op" applied pointwise to
   the collective "inbuf". For example, if the op is max and
   inbuf contains two float point numbers then 
	 outbuf(1) = global max( inbuf(1)) and 
	 outbuf(2) = global max( inbuf(2)) 
   A set of standard operations are supplied with MPI including:
     global max - for each data type
     global min - for each data type
	 global sum - for each data type
	 global mult- for each data type
	 global AND - for integer and logical type
	 global OR  - for integer and logical type
	 global XOR - for integer and logical type
   Optionally the users may define their own global functions for this routine.
   On return "info" contains the error code.

4. info = MPI_SYNCH( gid )

   Function:
   Called by all members of the group "gid"
   Returns only when all members have called this function.
   On return "info" contains the error code.

5. gid = MPI_MKGROUP( list_of_processes )

   Function:
   Called by all processes in the list.
   Forms a logical group containing the listed processes
   and assigns each process a unique rank in the group.
   The ranks are consecutively numbered from 0 to gsize-1.
   On return "gid" is an MPI assigned group ID (or error code if < 0)

6. gsize = MPI_GROUPSIZE( gid )

   Function:
   Can be called by any process.
   On return "gsize" is the number of members in the group "gid"
   (or error code if < 0).

7. rank = MPI_MYRANK( gid )

   Function:
   Can be called only by members of group "gid".
   On return "rank" is the rank of the calling process in group "gid"
   (an integer between 0 and gsize-1) or error code if < 0.

===========================================================================
Comments?
From owner-mpi-collcomm@CS.UTK.EDU  Mon Dec 14 15:48:54 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA24894; Mon, 14 Dec 92 15:48:54 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA17266; Mon, 14 Dec 92 15:48:41 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Mon, 14 Dec 1992 20:48:40 GMT
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from THUD.CS.UTK.EDU by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA17227; Mon, 14 Dec 92 15:48:19 -0500
From: Jack Dongarra <dongarra@cs.utk.edu>
Received:  by thud.cs.utk.edu (5.61++/2.7c-UTK)
	id AA03749; Mon, 14 Dec 92 15:48:17 -0500
Date: Mon, 14 Dec 92 15:48:17 -0500
Message-Id: <9212142048.AA03749@thud.cs.utk.edu>
To: mpi-collcomm@cs.utk.edu, mpi-pt2pt@cs.utk.edu
Subject: Re: Message Passing Interface Forum
Forwarding: Mail from '"Dr. C.D. Wright" <CDW10@LIVERPOOL.AC.UK>'
      dated: Mon, 14 Dec 92 12:16:10 GMT

---------- Begin Forwarded Message ----------
>From @ibm.liv.ac.uk:CDW10@LIVERPOOL.AC.UK Mon Dec 14 07:20:05 1992
Return-Path: <@ibm.liv.ac.uk:CDW10@LIVERPOOL.AC.UK>
Received: from mail.liv.ac.uk by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22696; Mon, 14 Dec 92 07:19:55 -0500
Received: from ibm.liverpool.ac.uk by mailhub.liverpool.ac.uk via JANET 
          with NIFTP (PP) id <21042-0@mailhub.liverpool.ac.uk>;
          Mon, 14 Dec 1992 12:19:28 +0000
Received: from UK.AC.LIVERPOOL by MAILER(4.4.t); 14 Dec 1992 12:20:02 GMT
Date: Mon, 14 Dec 92 12:16:10 GMT
From: "Dr. C.D. Wright" <CDW10@LIVERPOOL.AC.UK>
Subject: Re: Message Passing Interface Forum
To: dongarra@edu.utk.cs
Message-Id: <"mailhub.li.044:14.11.92.12.19.28"@liverpool.ac.uk>
Status: RO

Hi.

Since I am in the UK it is clear that I can't actively participate
in the MPI Forum.  I do, however, have one particular problem with
every comms library I have used so far that I would like to see
addressed in any new "standard", and I hope you can pass this on to
whoever is the appropriate person to deal with it.

In many packages such as PVM, PARMACS, p4, etc, it is possible to
probe for and/or receive messages selectively, the selection being
based on the message type (usually in integer) and/or the sender.
This is overly restrictive.  It would be far more useful if the
message's format were sufficiently well defined for the user to be
able to provide their own selection function to be passed in and
used as the basis for reception and/or probing.

That's it.  Hope you can do something with this gripe/suggestion.

Colin.
----------- End Forwarded Message -----------

From owner-mpi-collcomm@CS.UTK.EDU  Tue Dec 15 19:28:40 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA21809; Tue, 15 Dec 92 19:28:40 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA15597; Tue, 15 Dec 92 19:28:32 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Wed, 16 Dec 1992 00:28:32 GMT
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from helios.llnl.gov by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA15573; Tue, 15 Dec 92 19:28:06 -0500
Received: by helios.llnl.gov (4.1/LLNL-1.18)
	id AA11599; Tue, 15 Dec 92 16:30:03 PST
Date: Tue, 15 Dec 92 16:30:03 PST
From: tony@helios.llnl.gov (Anthony Skjellum)
Message-Id: <9212160030.AA11599@helios.llnl.gov>
To: dongarra@cs.utk.edu, mpi-collcomm@cs.utk.edu, mpi-pt2pt@cs.utk.edu
Subject: Re: Message Passing Interface Forum

That is what we have been talking about in Zipcode for a long time.
- Tony
From owner-mpi-collcomm@CS.UTK.EDU  Thu Dec 31 22:14:12 1992
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA14668; Thu, 31 Dec 92 22:14:12 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA12447; Thu, 31 Dec 92 22:14:01 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Fri, 01 Jan 1993 03:14:00 GMT
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from pnlg.pnl.gov by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA12429; Thu, 31 Dec 92 22:13:45 -0500
Received: from carbon.pnl.gov (130.20.65.121) by pnlg.pnl.gov; Thu, 31 Dec 92
 19:09 PST
Received: from fermi.pnl.gov by carbon.pnl.gov (4.1/SMI-4.1) id AA21172; Thu,
 31 Dec 92 19:08:30 PST
Received: by fermi.pnl.gov (4.1/SMI-4.1) id AA11537; Thu, 31 Dec 92 19:08:29 PST
Date: Thu, 31 Dec 92 19:08:29 PST
From: d3g681@fermi.pnl.gov
To: littlefield@fermi.pnl.gov, mpi-collcomm@cs.utk.edu, mpi-ptop@cs.utk.edu
Message-Id: <9301010308.AA11537@fermi.pnl.gov>
X-Envelope-To: mpi-ptop@cs.utk.edu, mpi-collcomm@cs.utk.edu

Posted to mpi-collcomm and mpi-ptop.

I have just taken the archived discussion from netlib@ornl and not
found anything more recent than december 15 (collcomm) and 21 (ptop).
Since I asked for my name to be on the mailing lists and have seen
nothing I assume that things have been quiet since then.

Al Geist's proposal (Dec. 5) for collective communication and the
reasoning behind it seems to provide a resonable starting point for
the discussion of interface and functionality.  I have only a few
minor comments in this regard, but given that the efficiency of
collective communications is critically sensitive to hardware topology
it *must* be essential to more closely integrate the definition of
process groups with topology.  I restrict my comments here to
this subject.

For example, on the Touchstone Delta efficient sub-group global-ops
would suggest that process groups map as best possible to square
sub-meshes, on the iPSC as sub-cubes, on the KSR as sub-rings.
Currently, if one's interest is in performing efficient collective
communication in subgroups, there is no way of performing this mapping
in a portable way.  In this instance one might want something that
functions along these lines

  Create NG process groups with P(0), P(1), ..., P(NG-1) processes in each
  group and assign each process to one of these groups so that collective
  communication within each (and perhaps also between all) subgroup is
  optimized.

Such a mapping might also be readily accomodated as a sub-partitioning
of an existing process group, with the default being ALL.  I could
envisage writing, for instance, a fast-multipole integration using this
functionality.

Comments?

Robert J. Harrison

Mail Stop K1-90                             tel: 509-375-2037
Battelle Pacific Northwest Laboratory       fax: 509-375-6631
P.O. Box 999, Richland WA 99352          E-mail: rj_harrison@pnl.gov





From owner-mpi-collcomm@CS.UTK.EDU  Fri Jan  1 11:54:06 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA16862; Fri, 1 Jan 93 11:54:06 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA13835; Fri, 1 Jan 93 11:53:57 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Fri, 01 Jan 1993 16:53:56 GMT
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from msr.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA13817; Fri, 1 Jan 93 11:53:46 -0500
Received: by msr.EPM.ORNL.GOV (5.61/1.34)
	id AA04566; Fri, 1 Jan 93 11:53:35 -0500
Date: Fri, 1 Jan 93 11:53:35 -0500
From: geist@msr.EPM.ORNL.GOV (Al Geist)
Message-Id: <9301011653.AA04566@msr.EPM.ORNL.GOV>
To: d3g681@fermi.pnl.gov, littlefield@fermi.pnl.gov, mpi-collcomm@cs.utk.edu,
        mpi-ptop@cs.utk.edu
Subject: Re: groups and topology.


>I have only a few
>minor comments in this regard, but given that the efficiency of
>collective communications is critically sensitive to hardware topology
>it *must* be essential to more closely integrate the definition of
>process groups with topology.

>Currently, if one's interest is in performing efficient collective
>communication in subgroups, there is no way of performing this mapping
>in a portable way.

It is critical that MPI be portable even if efficiency suffers.
Portability is primary reason for having a standard.

Efficiency is important and tightly coupled to the implementation
on a given vendor's machine. My feeling is that our MPI work
should specify the functionality at the user level
and not dictate how MPI is implemented underneath.

Mapping is the key word in integrating topology and groups,
and mapping is not defined (so far) in MPI. It is related to
the spawning and placement of tasks. I can envision some implementations
allowing tasks to migrate to improve load balance and fault tolerance.
This greatly compounds the mapping problem, but I don't think MPI
should exclude such implementations.
The hope would be that vendors would supply MPI implementations
that map process number to node number in a way that their
collective routines would be efficient with default ALL group
AND that the vendor's mapping would be documented so that
a user could specify subgroups that could exploit this same efficiency.

Al Geist
From owner-mpi-collcomm@CS.UTK.EDU  Sat Jan 16 06:36:12 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA18118; Sat, 16 Jan 93 06:36:12 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA01632; Sat, 16 Jan 93 06:35:42 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sat, 16 Jan 1993 06:35:41 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from sol.cs.wmich.edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA01612; Sat, 16 Jan 93 06:35:37 -0500
Received: from id.wmich.edu (id.cs.wmich.edu) by cs.wmich.edu (4.1/SMI-4.1)
	id AA06149; Sat, 16 Jan 93 06:30:31 EST
Date: Sat, 16 Jan 93 06:30:31 EST
From: john@cs.wmich.edu (John Kapenga)
Message-Id: <9301161130.AA06149@cs.wmich.edu>
To: mpi-collcomm@cs.utk.edu
Subject: A Collection of Primitives


\section{Introduction}
This is a description of some topology independent combined
communication primitives. These primitives are more commonly
referred to relative to "nodes" (eg Single Node Broadcast),
rather than the politically correct "process" (eg Single Process
Broadcast). The names below might look better with the word
Process deleted. These primitives often go under other names
as well I've put the names used in previous posts next to the
names below. The MP_* names appear in Al Geist's post with calls, 
the names further to the right appear in Al Geist's earlier post and
Jon Flowers post.

Some things below were developed in discussion at the last MPI
meeting (IE don't give me credit but you can give me blame). 
This is a list of primitives for discussion.
%
\section{Topology Independent Collective Communication Primitives}
Assume there is a group of $N$ processes. The following collective 
communication primitives can be defined.

Barrier:                  BARRIER       MPI_SYNCH        synchronize
    Every process blocks at the barrier until all processes reach it.
    (unless we have a non-blocking version too, then I would prefer the
     name synchronize)
Collective Operator:      COP           MPI_GLOBAL_OP
    START: every process $i$ has a value $m_i$.
    STOP: a single designated process has the combine of all values
          $m_i: 0 <= i < N$.
    The operations supported in COP are fixed, they include
        add, multiply, min, max, and, or, xor. 
        Types supported include: int, float and double.
Global Operator:          GOP           
    START: every process $i$ has a value $m_i$.
    STOP: every process has the combine of all values $m_i: 0 <= i < N$.
    The operations supported in COP are fixed, they include
        add, multiply, min, max, and, or, xor. 
        Types supported include: int, float and double.
Single Process Broadcast-    SPB        MPI_BCAST
    START: a single designated process $i$ has a message $m$.
    STOP: every process has message $m$.
Multiple Process Broadcast-  MPB	
    START: every process $i$ has a message $m_i$.
    STOP: every process has all messages $m_i: 0 <= i < N$.
Single Process Accumulate-   SPA                          reduce
    START: every process $i$ has a message $m_i$.
    STOP: a single designated process $i$ contains the combine of all the $m_i$.
          Any process can combine two messages into a single new message. (This
          makes the most sense when the combine is associative and commutative.)
Multiple Process Accumulate- MPA
    START every process $i$ has $N$ messages $m_{i,j}: 0 <= j < N$.
    STOP: every process $j$ has the combine of the $N$ messages
          $m_{i,j}: 0 <= i < N$
Single Process Scatter-      SPS
    START: a single designated process $i$ has $N$ messages $m_j: 0 <= j < N$.
    STOP: every process $j$ has message $m_j$.
Single Process Gather-       SPG        MPI_GATHER        gather
    START: every process $i$ has 1 message $m_i$.
    STOP: a single designated process $i$ has all messages $m_i: 0 <= i < N$.
Total Process Exchange-      TPE                          all2all
    START: every process $i$ has $N$ messages $m_{i,j}:  0 <= j < N$.
    STOP: every process $j$ has $N$ messages $m_{i,j}: 0 <= i < N$.

Note a Multiple Process Gather would be the same as a Multiple Process Scatter,
this is called a Total Process Exchange or all2all.
%
\section{Some Background}
For some background, the following simple relationships are known.

Theorem 1:
Assume no computation time and unit communication time per hop for all
messages.  For any network the following diagram holds. A directed arrow
from A to B indicates an algorithm for solving A also solves B and the
optimal time for solving B is not more than the optimal time for solving A.
Horizontal double arrows indicate the relationship holds in both directions.

                             Total Process Exchange
                                      |
                                      V
Multiple Process Broadcast    <----------------->   Multiple Process Accumulate
        |                                                   |
        V                                                   V
Single Process Gather         <----------------->   Single Process Scatter
        |                                                   |
        V                                                   V
Single Process Accumulate     <----------------->   Single Process Broadcast


Theorem 2:
The following optimal complexities can be proven (the log is base 2).
The tree is a balanced binary tree and the times for a linear array are
the same as the ring. p is the number of processors. (W means to a constant)

Problem                     ring      tree          mesh            hypercube
-------------------------------------------------------------------------
single process broadcast    W(p)      W(log p)      W(p ** (1/d))   W(log p)
single process scatter      W(p)      W(p)          W(p)            W(p/log p)
multiple process broadcast  W(p)      W(p)          W(p)            W(p/log p)
total process exchange      W(p**2)   W(p**2)       W(p**((d+1)/d)) W(p)     

Theorem 3:
Additionally, assuming a process can only send one message at a time
(even if it has many links) Some optimal complexities for the above
communications primitives can again be determined.

Problem                    ring      tree          mesh             hypercube
------------------------------------------------------------------------
single process broadcast   W(p)      W(log p)      W(p ** (1/d))    W(log p)
single process scatter     W(p)      W(p)          W(p)             W(p)     
multiprocess broadcast     W(p)      W(p)          W(p)             W(p)     
total process exchange     W(p**2)   W(p**2)       W(p**((d+1)/d))  W(p log p)     
Some results in this direction are also known for wormhole routing.
Cluster architecture machines and be included as well.
%
\section{Remarks}
SPB and SPA
These require a spanning tree of the group.  One difference between
the COP and a SPA followed by a SPB is that the COP uses fixed operations,
while the combine functions should be user supplied. The user supplied function
must be run as a user process on data in user memory on a computation
processor. The COP on the other hand is safe in a system process and my be
able to be run of the communication processor directly.

I tend to use the GOP more often than the COP.

For Steve Ericsson Zeinth's question on non-deterministic order of receives.
My implementations of such primitives have been very deterministic. They 
loosely synchronize to protect the message system. A receiving node on an
all2all (TPE) knows who sent each message, so even if it could be implemented
by N parallel scatters (SPS) the receiver would know where to put each of the 
N incoming messages.

For the primitives above SPB and SPA it becomes important to be very careful
not to overload most current message systems.

We talked about the combine function. Should it be strictly binary or
expect to combine a list of size n? I'll claim binary is enough because
fan in any reasonable implementation is likely to be low at any node.

Two of Jon Flower's requests are for a the GOP (note gop() was such a function
even in an iPSC-1 library) and the  MNB, which is the same as a SNG followed
by a SNB.

I prefer the form of global communication primatives shown by AL Geist,
where all processes make the same call.

The BARRIER and the other primitives could share many of the "512 variations"
currently proposed for the send. In particular a non-blocking BARRIER does
make sense (as requested by Jon Flower). 

There are many questions about details of any colcom primitives, most
of those questions should be clearer as the pt2pt specification matures.
We can discuss the colcom primitives we would propose. 

I would expect BARRIER, COP, GOP, SNA, SNB, SNG and SNS.

I have used 2 of the 3 others (and know where the other might be used).
But If I'm the only one who uses them ... :-)

We could provide (ALL) these primitives based on MPI pt2pt primitives for
groups of with actually topology : Hypercube, Mesh, 2-level Cluster and Generic.
These could be ready a few weeks after the pt2pt specification is stable.
Note these would be much slower than kernel based primitives, but better
than many user codes.

john

From owner-mpi-collcomm@CS.UTK.EDU  Sat Jan 16 06:45:00 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA19956; Sat, 16 Jan 93 06:45:00 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA02116; Sat, 16 Jan 93 06:44:37 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sat, 16 Jan 1993 06:44:36 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from sol.cs.wmich.edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA02108; Sat, 16 Jan 93 06:44:35 -0500
Received: from id.wmich.edu (id.cs.wmich.edu) by cs.wmich.edu (4.1/SMI-4.1)
	id AA06158; Sat, 16 Jan 93 06:39:29 EST
Date: Sat, 16 Jan 93 06:39:29 EST
From: john@cs.wmich.edu (John Kapenga)
Message-Id: <9301161139.AA06158@cs.wmich.edu>
To: mpi-collcomm@cs.utk.edu
Subject: groups and architectures


I strongly agree with Jon Flowers in that topology is important in the
group definition. It seems that any effort to define a group on a large
machine, say 64K nodes, would be futile without using a very regular structure.

I would hope that there are group defining functions which require topology 
and carry that information with them. Most applications on large machines
I know of treat the machine as a unit of a given topology for each stage
of the computation. Whatever else the MPI dose, it must support that mode of
operation efficiently.

For example, a program might do an inquire to find out what kind of machine
topology the machine really is, and then request a 2d-mesh group of a given
size, knowing it will be well laid out on the machine. I know this is against
the architecture independent spirit. If that type of facility is not to be
allowed then it must be shown that on current machines the same effect can
still be achieved.

I would suggest we need the ability to map standard structures onto current
large machines. If we have some primitives that can be safely ignored on later
machines there is no harm.

john
From owner-mpi-collcomm@CS.UTK.EDU  Mon Jan 25 15:20:44 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA25035; Mon, 25 Jan 93 15:20:44 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA18241; Mon, 25 Jan 93 15:20:12 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Mon, 25 Jan 1993 15:20:11 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from beagle.cps.msu.edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA18223; Mon, 25 Jan 93 15:20:06 -0500
Received: from uranium.cps.msu.edu by beagle.cps.msu.edu (4.1/rpj-5.0); id AA05995; Mon, 25 Jan 93 15:19:58 EST
Received: by uranium.cps.msu.edu (4.1/4.1)
	id AA12809; Mon, 25 Jan 93 15:19:58 EST
Date: Mon, 25 Jan 93 15:19:58 EST
From: huangch@cps.msu.edu
Message-Id: <9301252019.AA12809@uranium.cps.msu.edu>
To: mpi-intro@cs.utk.edu
Subject: Subscription 
Cc: mpi-collcomm@cs.utk.edu


Please add my name into your mailing list.

Thanks,

--Chengchang Huang
From owner-mpi-collcomm@CS.UTK.EDU  Mon Feb 15 06:51:52 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA26138; Mon, 15 Feb 93 06:51:52 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA12752; Mon, 15 Feb 93 06:51:16 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Mon, 15 Feb 1993 06:51:15 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from daedalus.epcc.ed.ac.uk by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA12744; Mon, 15 Feb 93 06:51:11 -0500
Date: Mon, 15 Feb 93 11:51:03 GMT
Message-Id: <21574.9302151151@subnode.epcc.ed.ac.uk>
From: L J Clarke <lyndon@epcc.ed.ac.uk>
Subject: Re: A Collection of Primitives
To: mpi-collcomm@cs.utk.edu
Reply-To: lyndon@epcc.ed.ac.uk

Dear MPI Colleagues

I found the "Collection of Primitives" most useful. We have a similar
suite of communication routines which we find most useful.

a) When considering finding maxima/minima of distributed data items, it
is often useful to also be able to locate the maxima/minima either in
terms of the process holding that data value, or its position within a
distributed data structure.  The approach we have taken to this is to
introduce a set of procedures which choose a value from a set, rather
than combining a set of values.  The programmer provides an integer
identifier associated with each data value, this may be simply a process
number or a position within a distributed data set such as matrix row
number, and the routine provides the maxima/minima and there
identifiers.  (Ties are resolved by choosing the lowest identifer value,
and all identifiers must be unique.) I propose that we should add a
routine, or routines, of this nature.

b) After some discussion with other interested persons locally, I come
to the conclusion that we should take time at the meeting to consider
what the collcomm operations involving a mixture of communications plus
calculations, such as combination, mean in a heterogeneous environment -
bith in terms of mixed language applications and mixed processor types. 

c) John poses the question of which operations to retain.  I have never
seen an application which uses a large number of these kinds of
functions, but on the other hand I have seen applications which between
them use all of the functions we have implemented.  I therefore suggest
that we retain all of them. 

Best Wishes
Lyndon

         /--------------------------------------------------------\
    e||) | Lyndon J Clarke    Edinburgh Parallel Computing Centre | e||) 
    c||c | Tel: 031 650 5021  Email: lyndon@epcc.edinburgh.ac.uk  | c||c 
         \--------------------------------------------------------/


From owner-mpi-collcomm@CS.UTK.EDU  Sat Feb 20 10:11:10 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA29389; Sat, 20 Feb 93 10:11:10 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22263; Sat, 20 Feb 93 10:10:14 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sat, 20 Feb 1993 10:10:13 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from vnet.ibm.com by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22249; Sat, 20 Feb 93 10:10:10 -0500
Message-Id: <9302201510.AA22249@CS.UTK.EDU>
Received: from KGNVMA by vnet.ibm.com (IBM VM SMTP V2R2) with BSMTP id 7138;
   Sat, 20 Feb 93 10:07:58 EST
Date: Sat, 20 Feb 93 09:43:04 EST
From: "Daniel D. Frye" <DANIELF@KGNVMA.VNET.IBM.COM>
To: mpi-collcomm@cs.utk.edu

I would recommend we add the following collective communication
routines

  mpi-index - Each process sends a distinct message to all the other
              processes in the group, aka - all-to-all personalized
              communication. Each process in the calling group partitions
              its local buffer into N blocks of equal size, where N is
              the number of processes in the group.  The ith process in
              the sends its jth block in the out buffer to the jth process
              and this block is stored at the ith block in its in buffer.
              Therefore the ith block of the out buffer will be copied
              locally to the ith block of the in buffer.  The only
              arguments necessary are out buffer, in buffer, length of
              the block, gid, and tag/context/whatever.

  mpi-shift - Perform a shift or rotation within a group.  Send a
              block of data any specified number of steps along the
              group either up or down.  The difference between shift
              and rotation is whether or not there is "wrap-around".
              The arguments necessary are out buffer, in buffer, length
              of the block, gid, # of steps, and (perhaps) a flag to
              decide shift or rotation (possibly we want 2 routines?),
              and tag/context/whatever.

  mpi-prefix - Apply parallel prefix (aka scan) with respect to an
               associative reduction operation on data distributed across
               a across and place the corresponding result in each process
               in the group (necessary, I believe, for the generalized
               combine operation we invented in Dallas.)  The operation
               can be any of the functions used in the mpi-reduce operation.


Has anyone taken a shot at a list of reduce operations?


Furthermore, before I forget, given non-blocking collective communication
operations (head-shaking here), we need to define order.  It's more
complicated than ptp message-passing but probably still possible.  I'm
sure we can guarantee order for (e.g.) two successive broadcasts in the
same group with the same root, but not if they have different roots.
Similarly for the cases with a particular destination.   More tricky are
the cases where every process gets a different result.  Can order be
defined for mpi-combine and still preserver some performance?

Thanks.
Dan Frye

From owner-mpi-collcomm@CS.UTK.EDU  Sun Feb 21 11:09:09 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA09315; Sun, 21 Feb 93 11:09:09 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA17311; Sun, 21 Feb 93 11:08:33 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sun, 21 Feb 1993 11:08:32 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from Aurora.CS.MsState.Edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA17303; Sun, 21 Feb 93 11:08:31 -0500
Received:  by Aurora.CS.MsState.Edu (4.1/6.0s-FWP);
	   id AA18416; Sun, 21 Feb 93 10:07:17 CST
Date: Sun, 21 Feb 93 10:07:17 CST
From: Tony Skjellum <tony@Aurora.CS.MsState.Edu>
Message-Id: <9302211607.AA18416@Aurora.CS.MsState.Edu>
To: DANIELF@KGNVMA.VNET.IBM.COM
Subject: hi
Cc: mpi-collcomm@cs.utk.edu

Dan,

It is not obvious to me that we can require the same order for two
successive broadcasts from the same root.  I say this because hardware
implementations (which would be fast) might not support this form of
determinism.  Second, performance characteristics might be better on
average if a different apparent permutation of the participants were
used (for the same root) each time.  I would furthermore add that an
algorithm might like to control that question.

In broadcasts, I see that there are four reasonable cases, modulo
the permutations just discussed.  An algorithm with the root node
sending ceil(log N) messages, an algorithm with each node sending at most
two messages; same algorithms, with the root node off-loading its
data to another node (hot-spot reduction), and then sending no
other messages.

- Tony

From owner-mpi-collcomm@CS.UTK.EDU Sat Feb 20 09:13:40 1993
Received: from Walt.CS.MsState.Edu by Aurora.CS.MsState.Edu (4.1/6.0s-FWP);
	   id AA17871; Sat, 20 Feb 93 09:13:40 CST
Received: from CS.UTK.EDU by Walt.CS.MsState.Edu (4.1/6.0s-FWP);
	   id AA13806; Sat, 20 Feb 93 09:14:36 CST
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22263; Sat, 20 Feb 93 10:10:14 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sat, 20 Feb 1993 10:10:13 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from vnet.ibm.com by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22249; Sat, 20 Feb 93 10:10:10 -0500
Message-Id: <9302201510.AA22249@CS.UTK.EDU>
Received: from KGNVMA by vnet.ibm.com (IBM VM SMTP V2R2) with BSMTP id 7138;
   Sat, 20 Feb 93 10:07:58 EST
Date: Sat, 20 Feb 93 09:43:04 EST
From: "Daniel D. Frye" <DANIELF@KGNVMA.VNET.IBM.COM>
To: mpi-collcomm@cs.utk.edu
Status: RO
Content-Length: 2441
X-Lines: 47

I would recommend we add the following collective communication
routines

  mpi-index - Each process sends a distinct message to all the other
              processes in the group, aka - all-to-all personalized
              communication. Each process in the calling group partitions
              its local buffer into N blocks of equal size, where N is
              the number of processes in the group.  The ith process in
              the sends its jth block in the out buffer to the jth process
              and this block is stored at the ith block in its in buffer.
              Therefore the ith block of the out buffer will be copied
              locally to the ith block of the in buffer.  The only
              arguments necessary are out buffer, in buffer, length of
              the block, gid, and tag/context/whatever.

  mpi-shift - Perform a shift or rotation within a group.  Send a
              block of data any specified number of steps along the
              group either up or down.  The difference between shift
              and rotation is whether or not there is "wrap-around".
              The arguments necessary are out buffer, in buffer, length
              of the block, gid, # of steps, and (perhaps) a flag to
              decide shift or rotation (possibly we want 2 routines?),
              and tag/context/whatever.

  mpi-prefix - Apply parallel prefix (aka scan) with respect to an
               associative reduction operation on data distributed across
               a across and place the corresponding result in each process
               in the group (necessary, I believe, for the generalized
               combine operation we invented in Dallas.)  The operation
               can be any of the functions used in the mpi-reduce operation.


Has anyone taken a shot at a list of reduce operations?


Furthermore, before I forget, given non-blocking collective communication
operations (head-shaking here), we need to define order.  It's more
complicated than ptp message-passing but probably still possible.  I'm
sure we can guarantee order for (e.g.) two successive broadcasts in the
same group with the same root, but not if they have different roots.
Similarly for the cases with a particular destination.   More tricky are
the cases where every process gets a different result.  Can order be
defined for mpi-combine and still preserver some performance?

Thanks.
Dan Frye


From owner-mpi-collcomm@CS.UTK.EDU  Thu Mar  4 10:37:50 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA29481; Thu, 4 Mar 93 10:37:50 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA00367; Thu, 4 Mar 93 10:37:05 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Thu, 4 Mar 1993 10:37:03 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from marge.meiko.com by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA00357; Thu, 4 Mar 93 10:37:00 -0500
Received: from hub.meiko.co.uk by marge.meiko.com with SMTP id AA19768
  (5.65c/IDA-1.4.4 for <mpi-collcomm@cs.utk.edu>); Thu, 4 Mar 1993 10:36:56 -0500
Received: from float.co.uk (float.meiko.co.uk) by hub.meiko.co.uk (4.1/SMI-4.1)
	id AA21448; Thu, 4 Mar 93 15:36:53 GMT
Date: Thu, 4 Mar 93 15:36:53 GMT
From: jim@meiko.co.uk (James Cownie)
Message-Id: <9303041536.AA21448@hub.meiko.co.uk>
Received: by float.co.uk (5.0/SMI-SVR4)
	id AA01959; Thu, 4 Mar 93 15:34:12 GMT
To: mpi-collcomm@cs.utk.edu
Cc: jim@meiko.co.uk
Subject: Synchronisation semantics
Content-Length: 1419

Sorry if you get this twice, I sent something similar yesterday, but
didn't get it back myself, so I guess it's disappeared into the great
bit-bucket in the sky.

As I understand the current collective communication proposal, the
synchronisation semantics of the global operations are only weakly
specified. Either 
1) each process can continue as soon as its contribution to the global
   operation is complete 
or 
2) they can be implemented as if there were a group synchronisation.

The first case allows code like this to execute

	Process 1	Process 2	Process 3

	broadcast(rx)   receive from 1	broadcast(tx)
	send to 2	broadcast(rx)	

the second would cause it to deadlock.

I don't believe we should leave this an open issue, since in the
absence of a specification, the user MUST assume that a group
synchronisation occurs. (And if the assume it does they'll get bitten
when it doesn't).

I believe that we should assert that the synchronisation happens.

Those users who explicitly do NOT want it can then make use of the
non-blocking forms of the collective operations (whichever we allow
in) to relax the synchronisation point.

-- Jim
James Cownie 
Meiko Limited			Meiko Inc.
650 Aztec West			Reservoir Place
Bristol BS12 4SD		1601 Trapelo Road
England				Waltham
				MA 02154

Phone : +44 454 616171		+1 617 890 7676
FAX   : +44 454 618188		+1 617 890 5042
E-Mail: jim@meiko.co.uk   or    jim@meiko.com


From owner-mpi-collcomm@CS.UTK.EDU  Thu Mar  4 12:12:41 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA02263; Thu, 4 Mar 93 12:12:41 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA05340; Thu, 4 Mar 93 12:10:21 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Thu, 4 Mar 1993 12:10:19 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from gstws.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA05321; Thu, 4 Mar 93 12:10:18 -0500
Received: by gstws.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA14943; Thu, 4 Mar 1993 12:10:17 -0500
Date: Thu, 4 Mar 1993 12:10:17 -0500
From: geist@gstws.epm.ornl.gov (Al Geist)
Message-Id: <9303041710.AA14943@gstws.epm.ornl.gov>
To: mpi-collcomm@cs.utk.edu
Subject: Re: Synchronisation semantics



>I believe that we should assert that the synchronisation happens.

I one the other hand would like to declare the example you give
as an erroneous program and put it in the (growing larger) class
of the errroneous programs that can now be written in pt2pt.
And
I would prefer that the user's applications not be forced to wait
on synchronization to occur. It is a mixed bag in existing interfaces
some use method 1 some use method 2. Method 1 is faster
and I don't hear user's complaining about their codes breaking
when using the existing method 1 interfaces.
So I am inclined to specify:
1) each process can continue as soon as its contribution to the global
   operation is complete 

Do other people in this subcommittee have an opinion?

Al Geist
From owner-mpi-collcomm@CS.UTK.EDU  Fri Mar  5 03:32:04 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA23652; Fri, 5 Mar 93 03:32:04 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22953; Fri, 5 Mar 93 03:31:40 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Fri, 5 Mar 1993 03:31:39 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from pnlg.pnl.gov by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22945; Fri, 5 Mar 93 03:31:36 -0500
Received: from fermi.pnl.gov (130.20.182.50) by pnlg.pnl.gov; Thu, 4 Mar 93
 12:10 PST
Received: by fermi.pnl.gov (4.1/SMI-4.1) id AA02720; Thu, 4 Mar 93 12:09:19 PST
Date: Thu, 04 Mar 93 12:09:17 -0800
From: Robert J Harrison <d3g681@fermi.pnl.gov>
Subject: Re: Synchronisation semantics
To: mpi-collcomm@cs.utk.edu
Message-Id: <9303042009.AA02720@fermi.pnl.gov>
In-Reply-To: Your message of "Thu, 04 Mar 93 12:10:17 EST."
 <9303041710.AA14943@gstws.epm.ornl.gov>
X-Envelope-To: mpi-collcomm@cs.utk.edu

In message <9303041710.AA14943@gstws.epm.ornl.gov> you write:
> 
> 
> >I believe that we should assert that the synchronisation happens.
> 
> I one the other hand would like to declare the example you give
> as an erroneous program and put it in the (growing larger) class
> of the errroneous programs that can now be written in pt2pt.
> And
> I would prefer that the user's applications not be forced to wait
> on synchronization to occur. It is a mixed bag in existing interfaces
> some use method 1 some use method 2. Method 1 is faster
> and I don't hear user's complaining about their codes breaking
> when using the existing method 1 interfaces.
> So I am inclined to specify:
> 1) each process can continue as soon as its contribution to the global
>    operation is complete 
> 
> Do other people in this subcommittee have an opinion?
> 
> Al Geist


I do not think that one can define what this

> 1) each process can continue as soon as its contribution to the global
>    operation is complete 

means without reference to an implementation.  Also, some implementations
may require synchronization (e.g. for efficiency, or due to h/w or s/w 
limitations).  Other implementations may not.  With proper use
of tagging etc. no synchronization is required for correct execution
no matter what order messages arrive in, apart from the usual
concerns about available buffer space.

Thus, from consideration of orthogonality of function and efficiency,
I would suggest that

1) The synchronization properties of global operations be left
   undefined where this is not required for their termination
   with correct numerical results (e.g. a global summation).
   Any constraints on tags, etc., for correct execution should
   also be defined, though I think we should work very hard to
   remove any such contraints.

2) A separate primitve that acts as a barrier or synchronization
   be provided (I think this is the case already).

Primitive 2 might be provided as a special form of primitive 1, so
that unecessary communication is avoided.  However, this seems
to me a minor optimization.

Robert.

Robert J. Harrison

Mail Stop K1-90                             tel: 509-375-2037
Battelle Pacific Northwest Laboratory       fax: 509-375-6631
P.O. Box 999, Richland WA 99352          E-mail: rj_harrison@pnl.gov





From owner-mpi-collcomm@CS.UTK.EDU  Mon Mar  8 07:07:24 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA01919; Mon, 8 Mar 93 07:07:24 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA17798; Mon, 8 Mar 93 07:06:48 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Mon, 8 Mar 1993 07:06:47 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from marge.meiko.com by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA17790; Mon, 8 Mar 93 07:06:43 -0500
Received: from hub.meiko.co.uk by marge.meiko.com with SMTP id AA03142
  (5.65c/IDA-1.4.4 for <mpi-collcomm@cs.utk.edu>); Mon, 8 Mar 1993 07:06:39 -0500
Received: from float.co.uk (float.meiko.co.uk) by hub.meiko.co.uk (4.1/SMI-4.1)
	id AA10106; Mon, 8 Mar 93 12:06:35 GMT
Date: Mon, 8 Mar 93 12:06:35 GMT
From: jim@meiko.co.uk (James Cownie)
Message-Id: <9303081206.AA10106@hub.meiko.co.uk>
Received: by float.co.uk (5.0/SMI-SVR4)
	id AA02376; Mon, 8 Mar 93 12:03:45 GMT
To: geist@gstws.epm.ornl.gov
Cc: mpi-collcomm@cs.utk.edu
In-Reply-To: Al Geist's message of Thu, 4 Mar 1993 12:10:17 -0500 <9303041710.AA14943@gstws.epm.ornl.gov>
Subject: Synchronisation semantics
Content-Length: 1001

Jim> I believe that we should assert that the synchronisation happens.
OK, so maybe I was a bit stronger than I meant to be.

I actually don't mind too much one way or the other, as long as we
understand what it is that we're doing. 

Therefore are we specifying
> 1) each process CAN continue as soon as its contribution to the global
>    operation is complete 

or

1) each process MUST continue as soon as its contribution to the global
   operation is complete 

(In other words is an implementation free to treat all global operations
as a global synchronisation or not ?) I'm happy with the first of
these statements, but not the second. (However it should be re-worded to make
the possiblity clearer in a draft).

-- Jim
James Cownie 
Meiko Limited			Meiko Inc.
650 Aztec West			Reservoir Place
Bristol BS12 4SD		1601 Trapelo Road
England				Waltham
				MA 02154

Phone : +44 454 616171		+1 617 890 7676
FAX   : +44 454 618188		+1 617 890 5042
E-Mail: jim@meiko.co.uk   or    jim@meiko.com


From owner-mpi-collcomm@CS.UTK.EDU  Mon Mar  8 09:14:45 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA05445; Mon, 8 Mar 93 09:14:45 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22382; Mon, 8 Mar 93 09:14:09 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Mon, 8 Mar 1993 09:14:07 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from msr.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA22373; Mon, 8 Mar 93 09:14:06 -0500
Received: by msr.EPM.ORNL.GOV (5.67/1.34)
	id AA13188; Mon, 8 Mar 93 09:13:48 -0500
Date: Mon, 8 Mar 93 09:13:48 -0500
From: geist@msr.EPM.ORNL.GOV (Al Geist)
Message-Id: <9303081413.AA13188@msr.EPM.ORNL.GOV>
To: jim@meiko.co.uk
Subject: Re:  Synchronisation semantics
Cc: mpi-collcomm@cs.utk.edu

The draft will read:
1) each process CAN continue as soon as its contribution to the global
   operation is complete.

Cheers,
 Al
From owner-mpi-collcomm@CS.UTK.EDU  Wed Mar 10 08:13:16 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA28509; Wed, 10 Mar 93 08:13:16 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA11051; Wed, 10 Mar 93 08:11:02 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Wed, 10 Mar 1993 08:11:01 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from super.super.org by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA11039; Wed, 10 Mar 93 08:10:59 -0500
Received: from b125.super.org by super.super.org (4.1/SMI-4.1)
	id AA20486; Wed, 10 Mar 93 08:10:57 EST
Received: by b125.super.org (4.1/SMI-4.1)
	id AA01741; Wed, 10 Mar 93 08:10:56 EST
Date: Wed, 10 Mar 93 08:10:56 EST
From: lederman@b125.super.org (Steve Huss-Lederman)
Message-Id: <9303101310.AA01741@b125.super.org>
To: mpi-collcomm@cs.utk.edu
Subject: non-blocking routines

I casually raised the issue at the last meeting of whether we were
going to make the collective communications completely compatible with
the point-to-point standard.  Specifically, I raised the issue of
whether there would be a non-blocking broadcast.  This started a chain
of events that ultimately lead to a non-blocking wait.  I can only hope
that in the official minutes the originator's name will be lost.  I
already see the laughing occurring when one first hears this suggestion
out of context :-)

But seriously, since I started this thing I would like to now see it
resolved.  The next points involve the global picture and then some
details of non-blocking collective communications follow.  I have not
filled in a lot of details until I think the global picture is
resolved.

The way this whole thing started was for symmetry with
point-to-point.  I think people agree that you could take advantage of
a non-blocking collective communication in the same way you can a
non-blocking send.  It was even pointed out that collective
communications are generally more expensive in terms of latency and
time so there might be an even bigger justification.  So the group
voted for a non-blocking broadcast by a fairly large majority (if my
recollection is correct).  Now the slippery slope argument sets in.  If
you have a non-blocking broadcast, you need a non-blocking gather,
scan, etc.  This finally led to the non-blocking wait or more
appropriately called a non-blocking barrier.  As the votes progressed,
there were fewer total votes and fewer yes votes for the non-blocking
version.  I interpret this as people starting to understand the
consequences of the first vote and starting to have second thoughts.
Do people agree with this interpretation?  Is my memory/notes correct?

So the big picture question is whether we should have non-blocking
collective communications calls at all.  Here is my current feelings.
I think that they have merit and can be useful.  However, they add a
lot of complexity to routines that are already difficult to do
correctly and efficiently.  I think it is unlikely that we can specify
these routines and get done in 3 more meetings.  (I am also posting
this idea in a more general context to the whole committee.)  It also
falls outside current practice.  If we decide to pursue a more complex
standard and extend the deadline, then we should include this too.
However, I would think a more manageable first standard that can get
done quickly would be better.

Given that, I raise a few of the issues involved in non-blocking
collective communications.  I only list some to show what is involved.
If we decide to continue down this path, then I will be more explicit
and get involved more in details.

If we have non-blocking calls, then we need all the routines like
point-to-point has.  For example, we need a wait, probe and either two
calls or an option to choose between blocking and not.  Another issue
is dealing with two non-blocking calls in a row.  For example, suppose
you do two non-blocking broadcasts in a row but use a different root.
It seems to me that an intermediate node could get two different
messages from another intermediate node and have trouble telling which
broadcast it is supposed to be for.  Are we going to allow this?  If
so, the coding of the broadcast may be much harder on some systems.
If not, you restrict the user in a way that is unnatural.

Steve

P.S. - The moral is: never make a casual suggestion at an MPI
meeting.  You'll probably live to regret it :-).
From owner-mpi-collcomm@CS.UTK.EDU  Thu Mar 11 12:49:49 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA02571; Thu, 11 Mar 93 12:49:49 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA06487; Thu, 11 Mar 93 12:48:54 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Thu, 11 Mar 1993 12:48:53 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from canidae.cps.msu.edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA06461; Thu, 11 Mar 93 12:48:46 -0500
Received: from pit-bull.cps.msu.edu by canidae.cps.msu.edu (4.1/rpj-5.0); id AA01661; Thu, 11 Mar 93 12:48:43 EST
Received: by pit-bull.cps.msu.edu (4.1/4.1)
	id AA04285; Thu, 11 Mar 93 12:48:42 EST
Date: Thu, 11 Mar 93 12:48:42 EST
From: kalns@cps.msu.edu
Message-Id: <9303111748.AA04285@pit-bull.cps.msu.edu>
To: mpi-collcomm@cs.utk.edu
Subject: reduction and gather

Dear Collective Communications Subcommittee:

I have not participated in this forum in the past;
however, I have been an active MPI reader for the past two
months.  I would like to comment on the following:

1. Reduction
   a. each participating process gets the result
   b. additional ops

2. Gather
   a. concatenation in rank order
----------

Al Geist proposed the following interface for reduction:

>  info = MPI_GLOBAL_OP( inbuf, bytes, type, gid, op, outbuf )
>
>  Function:
>  Called by all members of the group "gid"
>  using the same argument for "bytes", "type", "gid", and "op".
>  On return the "outbuf" of all group members contains the
>  result of the global operation "op" applied pointwise to
>  the collective "inbuf". For example, if the op is max and
>  inbuf contains two float point numbers then
>        outbuf(1) = global max( inbuf(1)) and
>        outbuf(2) = global max( inbuf(2))
>  A set of standard operations are supplied with MPI including:
>    global max - for each data type
>    global min - for each data type
>    global sum - for each data type
>    global mult- for each data type
>    global AND - for integer and logical type
>    global OR  - for integer and logical type
>    global XOR - for integer and logical type

Every process receives the result of the reduction operation.

John Kapenga proposed two different reductions in
"Collection of Primitives" where in one
case all processes receive the result, the other only a
single process receives the result.

I concur with John's more flexible approach since for some
applications, only a single process needs the result.
Consider Gaussian Elimination with columns of the coefficient
matrix distributed to processors.  The following code illustrates.
This code must be translated into message-passing (SPMD) code
for each processor. (Assuming one process/processor)

s1:  DO I=1,N
s2:    LOC = MAXLOC(A[I,I:N])              /* max location in row */
s3:    EXCHANGE(A[1:N,I],A[1:N,LOC])       /* exchange columns */
s4:    A[I,I:N] = A[I,I:N] / A[I,I]
s5:    DO J=I+1,N
s6:       DO K=I+1,N
s7;          A[J,K] = A[J,K] - A[J,I] * A[I,K]
s8:       END DO
s9:    END DO
s10: END DO

The only processes that need to know the max location are
the process which owns column I and the process which
owns column LOC, in order to exchange columns.

The above code also illustrates where MAXLOC (and MINLOC)

Al Geist proposed the following interface for gather:
>  info = MPI_GATHER( buf, bytes, type, gid, root )
>
>  Function:
>  Called by all members of the group "gid"
>  using the same argument for "bytes", "type", "gid", and "root".
>  On return all the individual "buf" are concatenated into the "root" buf,
>  which must be of size at least gsize*bytes.
>  The data is laid in the "root" buf in rank order that is
>  | gid,0 data | gid,1 data | ...| gid, root data | ...| gid, gsize-1 data |
>  Other member's "buf" are unchanged on return.
>  On return "info" contains the error code.

Why must the data be laid out in "rank order"? This may not
always be necessary.  There is certainly additional overhead
in arranging it this way instead of just concatenating messages (with
GCPID) as they arrive. Perhaps there could be an option to obtain
in rank order when necessary.

Regards,
Edgar

======================================================================
| Edgar T. Kalns                     | Internet: kalns@cps.msu.edu   |
| Advanced Computing Systems Lab     | Tel: (517) 353-8666           |   
| Department of Computer Science     |                               |
| Michigan State University          |                               |
| East Lansing, MI 48824, USA        |                               |
======================================================================

From owner-mpi-collcomm@CS.UTK.EDU  Thu Mar 11 13:25:26 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA04034; Thu, 11 Mar 93 13:25:26 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA09574; Thu, 11 Mar 93 13:24:28 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Thu, 11 Mar 1993 13:24:27 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from deepthought.cs.utexas.edu by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA09566; Thu, 11 Mar 93 13:24:25 -0500
From: rvdg@cs.utexas.edu (Robert van de Geijn)
Received: from grit.cs.utexas.edu by deepthought.cs.utexas.edu (5.64/1.2/relay) with SMTP
	id AA23445; Thu, 11 Mar 93 12:24:26 -0600
Received: by grit.cs.utexas.edu (5.64/Client-v1.3)
	id AA13089; Thu, 11 Mar 93 12:24:16 -0600
Date: Thu, 11 Mar 93 12:24:16 -0600
Message-Id: <9303111824.AA13089@grit.cs.utexas.edu>
To: kalns@cps.msu.edu
Cc: mpi-collcomm@cs.utk.edu
In-Reply-To: kalns@cps.msu.edu's message of Thu, 11 Mar 93 12:48:42 EST <9303111748.AA04285@pit-bull.cps.msu.edu>
Subject: reduction and gather

   Dear Collective Communications Subcommittee:

   Al Geist proposed the following interface for reduction:
n
   >  info = MPI_GLOBAL_OP( inbuf, bytes, type, gid, op, outbuf )
   >
 
   Every process receives the result of the reduction operation.

   John Kapenga proposed two different reductions in
   "Collection of Primitives" where in one
   case all processes receive the result, the other only a
   single process receives the result.

   I concur with John's more flexible approach since for some
   applications, only a single process needs the result.
   Consider Gaussian Elimination with columns of the coefficient
   matrix distributed to processors.  The following code illustrates.
   This code must be translated into message-passing (SPMD) code
   for each processor. (Assuming one process/processor)

There are a number of reasons to have two versions: Indeed, the
"Fan-in" is often used, and can be implemented on most systems
requiring half the time of the GSUM to all (for large vectors).
Indeed, I propose a third version: A combine leaving the result in
pieces distributed among the nodes.  (This would be the inverse of the
GCOLX routine, with a combine added, in Intel Lingo).  an integer
array would indicate the size of the piece to be left at each node.
Again, there are performance issues behind the need for this last
operation, since the GSUM to all performs this operation, and more.

Robert




=====================================================================
  Robert A. van de Geijn                     rvdg@cs.utexas.edu  
  Assistant Professor
  Department of Computer Sciences            (Work)  (512) 471-9720
  The University of Texas                    (Home)  (512) 251-8301 
  Austin, TX 78712                           (FAX)   (512) 471-8885 
=====================================================================
From owner-mpi-collcomm@CS.UTK.EDU  Thu Mar 11 13:44:15 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA04337; Thu, 11 Mar 93 13:44:15 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA10540; Thu, 11 Mar 93 13:43:30 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Thu, 11 Mar 1993 13:43:29 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from msr.EPM.ORNL.GOV by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA10532; Thu, 11 Mar 93 13:43:16 -0500
Received: by msr.EPM.ORNL.GOV (5.67/1.34)
	id AA01838; Thu, 11 Mar 93 13:43:03 -0500
Date: Thu, 11 Mar 93 13:43:03 -0500
From: geist@msr.EPM.ORNL.GOV (Al Geist)
Message-Id: <9303111843.AA01838@msr.EPM.ORNL.GOV>
To: mpi-collcomm@cs.utk.edu
Subject: Re: Edgar's questions
Cc: kalns@cps.msu.edu


Hi Edgar,

>John Kapenga proposed two different reductions in
>all processes receive the result
>single process receives the result
>I concur with John's more flexible approach

I also agree that we can have both functions,
and the collective communication draft I  am maddly writing
contains both. (and some others submitted by Frye.)

>Why must the data be laid out in "rank order"? This may not
>always be necessary.

It is a convience to the user so that he may quickly
find data from a particular task. Since bytes is constant
root can place each message in the correct location in buf
with no extra overhead. So there is no incentive to have 
a random order.

Al Geist
From owner-mpi-collcomm@CS.UTK.EDU  Fri Mar 12 11:23:01 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA23453; Fri, 12 Mar 93 11:23:01 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA08903; Fri, 12 Mar 93 11:22:14 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Fri, 12 Mar 1993 11:22:12 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from [128.219.8.54] by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA08885; Fri, 12 Mar 93 11:22:09 -0500
Received: by gstws.epm.ornl.gov (AIX 3.2/UCB 5.64/4.03)
          id AA15629; Fri, 12 Mar 1993 11:21:55 -0500
Date: Fri, 12 Mar 1993 11:21:55 -0500
From: geist@gstws.epm.ornl.gov (Al Geist)
Message-Id: <9303121621.AA15629@gstws.epm.ornl.gov>
To: mpi-collcomm@cs.utk.edu
Subject: First draft of Collective Communication section of MPI.


\documentstyle[12pt]{article}
\begin{document}

\section{Collective Communication}

[I have placed comments and questions in square braces.]

\subsection{Introduction}

This section is a draft of the current proposal for collective communication.
Collective communication is defined to be communication that involves
a group of tasks. Examples are broadcast and global sum.
Because of the need to deal with groups of tasks, this section will also
present a proposal for the formation, partitioning, and managing
of basic groups. 
A basic group has two properties:
It has a group identifier which is associated with a set of tasks.
Each task in a group has a unique rank $0 - (p-1)$ in the group.
There is an initial default group {\bf ALL} that contains
all the tasks.
Giving or forming groups with topological features is presented in section 4.
[by the Topology subcommittee]

The collective communication routines are built above the point-to-point
routines. While vendors may optimize certain collective routines for
their architectures, a complete library of the collective communication
routines written entirely in point-to-point will be available.
The following communication functions are proposed.
\begin{itemize}
\item
Broadcast from one member to all members of a group.
\item
Barrier across all group members
\item
Gather data from all group members to one member.
\item
Scatter data from one member to all members of a group.
\item
Global operations such as sum, max, min, etc., were the result
is known by all group members and a variation where the result is
known by only one member. The ability to have user defined
global operations.
\item
Simultaneous shift of data around the group, the simplest example
being all members sending their data to (rank+1) with wrap around.
For portability, the topology section provides routines for 
defining who is a member's neighbor in a given direction and hops away.
\item
Scan across all members of a group (also called parallel prefix).
\item
Broadcast from all members to all members of a group.
\item
Scatter data from all members to all members of a group
(also called complete exchange or index).

To simplify the collective communication interface it is
designed with two layers. The low level routines have all the
generality of, and make use of, the buffer descriptor routines
of the point-to-point section which allows arbitrarily complex
messages to be constructed. The second level routines are
similar to the upper level point-to-point routines in that they send
only a contiguous buffer.

\section{Group Functions}

Before defining a collective operation between a group of tasks,
it is necessary to create and manage a group.
A group is identified by a group name that is supplied by the user.
[It is sufficient with static groups to only have an opaque group ID,
which is returned to the user during group formation.
But if we allow dynamic groups in some (future) version of MPI,
then there is no way for a new task to join a group since the 
user doesn't have the opportunity to label the group.
To allow for future extensibility of the group concept
the present draft specifies that groups be named. 
The underlying implementation can map this label to any
type of group ID that is convenient or fast. This could be
an elaborate structure or a simple integer.]
Each member of a group has a unique rank in the group. 
The rank values are the integers 0 to number-of-members minus 1.
Each group has a topology associated with it. 
The collective communication routines are implemented in terms
of the topology associated with a given group.
Athough the function would be the same, a broadcast in a group 
with a ring topology could be implemented differently from a broadcast in
a group with hypercube topology.

The default topology for a group is fully connected. 
Existing groups including the {\bf ALL} group
can switch their associated topology using the functions described in
section 4. This allows the user to match the group topology to the
algorithm executed by the group or the underlying hardware.

The debate rages on about whether groups should be dynamic or static.
A static group is defined to be a group where once it is formed
its membership never changes.
Static groups are just a subset of dynamic groups. The added generality
of dynamic groups is perceived as a useful property to have in MPI
at some future time. One of the most important properties dynamic
groups allows is the development of fault tolerant applications.
Given the time constraints for MPI-1,
the following proposal is written so that dynamic groups 
are possible, but MPI-1 only specifies the restricted case
where the groups are static. 

my\_rank = MPI\_PARTGROUP(group, newgroup)

     Returns only after all members of group have called it.
     The newgroup argument is used as a key. All members with
     the same newgroup argument are placed in the same group
     and their rank in this new group is returned.

old\_group = MPI\_LVGROUP(group)

     In the restricted case of MPI-1, returns only after all
     members of group have called it. 
     Frees all memory and system resources used by group. 
     Returns the name of the old group
     from which they were last partitioned (and of which
     they are still a member). It is an error to call lvgroup(ALL).

size    = MPI\_GSIZE(group)

     Returns the (instantaneous) size of group. [can be called by any task?]

rank    = MPI\_GETRANK(group,pid)

     Given that pid is the unique (possibly opaque) task identifier,
     returns the rank of pid in group.

pid     = MPI\_GETPID(group,rank)

     Given that pid is the unique (possibly opaque) task identifier,
     returns the pid of the task identified by (group,rank).

pid = MPI\_MYPID()

This is included here for completeness to show how a task could
get its rank in a group.

my\_rank = MPI\_JOINGROUP(group)

     Dynamic group function available in MPI-2. 
     Can be called by an individual task with any argument for group.
     If group doesn't exist, then it is created and this task
     becomes its first member.
     If the group exists, then this task is placed in the group
     and given the lowest available rank. For example, if there is
     a gap in the ranks due to a process failure, then this task
     would fill the gap.

\section{Communication Functions}

The proposed communication functions are divided into two layers.
The lowest level uses the same buffer descriptor routines 
available in point-to-point to create noncontiguous, multiple data type
messages. The second level handles only contiguous single data type
messages. Like the point-to-point high level interface, the second
level of collective communication routines handles heterogeneity.

There has been discussion about the synchronization properties
of the collective communication routines. In this proposal
routines can (but are not required to) return as soon as their 
participation in the collective communication is complete.

Each of the following functions returns an error code 
in the info argument.

\subsection{Level 2 routines}

info = MPI\_BCAST( buf, nitems, type, tag, group, from\_rank )

MPI\_BCAST broadcasts a message to all members of a group.
It is called by all members of group using the same arguments for
nitems, type, tag, group, and from\_rank.
On return the contents of the array buf on the member with from\_rank
is contained in buf on all group members.
type is the data type to be sent, nitems is the number of 
these items, tag is a user supplied message tag.

info = MPI\_BARRIER( group, tag )

MPI\_BARRIER blocks the calling task until all group members have called it
using the same tag, 
MPI\_BARRIER returns only when all group members have called this function.

info = MPI\_GATHER( inbuf, outbuf, nitems, type, tag, group, to\_rank~)

MPI\_GATHER gathers the nitems in each group member's inbuf
and places these items in rank order in the to\_rank member's outbuf.
It is called by all members of group using the same arguments for
nitems, type, tag, group, and to\_rank.
The receiving member must declare outbuf to be at least
(nitems * sizeof(type)) * (gsize(group)).
outbuf is unchanged on all the other group members.


info = MPI\_SCATTER( inbuf, outbuf, nitems, type, tag, group, from\_rank~)

MPI\_SCATTER sends different pieces of the from\_rank member's inbuf
to each of the other group members.
The routine is called by all members of the group using the same arguments for
nitems, type, tag, group, and from\_rank.
The data is laid in the from\_rank member's inbuf in rank order.
The other member's inbuf is unchanged by the routine.
On return each member's outbuf contains its nitems piece of the
originators inbuf.

info = MPI\_GLOBAL\_OP( inbuf, outbuf, nitems, type, tag, group, op~)

MPI\_GLOBAL\_OP performs a global operation on the inbuf and
returns the result in outbuf.
The routine is called by all group members using the same arguments
for nitems, type, tag, group, and op.
On return the outbuf of each member contains the result of 
the global operation op applied pointwise the the collective inbuf.
For example, if the op is max and inbuf contains two floating point numbers,
then outbuf(1) $=$ global max(inbuf(1)) and outbuf(2) $=$ global max(inbuf(2)).
A set of standard operations are supplied with MPI including:
\begin{itemize}
\item global max for each data type
\item global min for each data type
\item global sum for each data type
\item global mult for each data type
\item global AND for integer and logical
\item global OR for integer and logical
\item global XOR for integer and logical
\item global scalar max and who has it
\item global scalar min and who has it
\end{itemize}

info = MPI\_USER\_OP( inbuf, outbuf, nitems, type, tag, group, func~)

Same as the global operation function above except the user
supplies the function that is performed on each member rather
than using the standard operations.

info = MPI\_REDUCE(inbuf, outbuf, nitems, type, tag, group, to\_rank, op~)

Same as the global operation function above except only the 
to\_rank member receives the result in its outbuf. The outbuf
of all other routines is unchanged.

info = MPI\_SHIFT( inbuf, outbuf, nitems, type, tag, group, steps~)

Simultaneous shift of data a given number of steps around the group, 
the simplest example
being all members sending their data to (rank+1) with wrap around.
For portability, the topology section provides routines for
defining who is a member's neighbor in a given direction and hops away.

info = MPI\_SCAN( inbuf, outbuf, nitems, type, tag, group, op )

MPI\_SCAN is used to perform a parallel prefix with respect to
an associative reduction operation on data distributed across the group. 
The same standard operations as found in MPI\_GLOBAL\_OP are supplied
with MPI.

info = MPI\_ALLCAST( inbuf, outbuf, nitems, type, tag, group )

Broadcast from all members to all members of a group.

info = MPI\_ALLSCATTER( inbuf, outbuf, nitems, type, tag, group~)

Each process sends a distinct message to all the other
processes in the group, aka - all-to-all personalized
communication. Each process in the calling group partitions
its local buffer into N blocks of equal size, where N is
the number of processes in the group.  The ith process in
the sends its jth block in the out buffer to the jth process
and this block is stored at the ith block in its in buffer.
Therefore the ith block of the out buffer will be copied
locally to the ith block of the in buffer.

\subsection{Level 1 routines}

[I suggest that the level 1 routines be deferred to MPI-2
as well as the buffer descriptor versions of point-to-point.
But if point-to-point includes bd versions then it will be
easy to include comparable version of collective communication routines.
I like the bd version of point-to-point and collective, but I feel
it deviates too far from common practice for MPI-1.]

Level 1 routines allow the user to communicate noncontiguous messages
containing multiple data types. The present proposal is for the 
collective routines to use the same routines that are in the
point-to-point interface to create these arbitrary messages.
Not all collective operations make sense in this context.
The following functions are provided in level 1:

\begin{tabular}{l}
info = MPI\_BCASTBD( bd, tag, group, from\_rank )            \\
info = MPI\_GATHERBD( inbd, outbd, tag, group, to\_rank )    \\
info = MPI\_SCATTERBD( inbd, outbd, tag, group, from\_rank ) \\
info = MPI\_USER\_OPBD( inbd, outbd, tag, group, func )     \\
info = MPI\_SHIFTBD( inbd, outbd, tag, group, steps )       \\
info = MPI\_ALLCASTBD( inbd, outbd, tag, group )            \\
info = MPI\_ALLSCATTERBD( inbd, outbd, tag, group )         \\
\end{tabular}

The descriptions of the functions is the same as in level 2
with the exception that instead of a contiguous block of data
of the same data type each block of data is described by a
buffer descriptor for both input and output buffers.
data types.

\subsection{Nonblocking Communication}

[There was discussion at the last meeting about having nonblocking
variants of the collective communication routines.
They are not presented here because a formal proposal was never 
submitted to the collective communication subcommittee for discussion.
The proposal must explain how the routines work, how they are
used in an application preferably with an example, and if 
possible how the routines could be implemented with discussion
about message order guarantees, robustness, and cancellation.
I feel that the nonblocking routines are far too complex for MPI-1,
and should not be discussed in the present proposal.]

\end{document}
From owner-mpi-collcomm@CS.UTK.EDU  Sun Mar 14 13:59:12 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA05206; Sun, 14 Mar 93 13:59:12 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA16184; Sun, 14 Mar 93 13:58:44 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sun, 14 Mar 1993 13:58:42 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from pnlg.pnl.gov by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA16158; Sun, 14 Mar 93 13:58:04 -0500
Received: from carbon.pnl.gov (130.20.188.38) by pnlg.pnl.gov; Sun, 14 Mar 93
 10:57 PST
Received: from sodium.pnl.gov by carbon.pnl.gov (4.1/SMI-4.1) id AA29374; Sun,
 14 Mar 93 10:55:26 PST
Received: by sodium.pnl.gov (4.1/SMI-4.0) id AA05149; Sun, 14 Mar 93 10:55:22
 PST
Date: Sun, 14 Mar 93 10:55:22 PST
From: rj_littlefield@pnlg.pnl.gov
Subject: proposal to mpi-collcomm
To: d39135@sodium.pnl.gov, geist@gstws.epm.ornl.gov, gropp@mcs.anl.gov,
        jim@meiko.co.uk, lusk@mcs.anl.gov, lyndon@epcc.ed.ac.uk,
        mpi-collcomm@cs.utk.edu, mpi-context@cs.utk.edu, ranka@top.cis.syr.edu,
        tony@Aurora.CS.MsState.Edu
Message-Id: <9303141855.AA05149@sodium.pnl.gov>
X-Envelope-To: mpi-context@cs.utk.edu, mpi-collcomm@cs.utk.edu

Al & Tony, et.al.:

I am about to send to mpi-collcomm, two notes regarding changes I
propose to the collective communication specification.  (One note
summarizes the changes; the other discusses the reasons for them.)

I am also sending these notes to mpi-context and friends because
they relate to other discussions going on there.

Thought you'd like to know...
--Rik

----------------------------------------------------------------------
rj_littlefield@pnl.gov               Rik Littlefield
Tel: 509-375-3927                    Pacific Northwest Lab, MS K1-87
                                     P.O.Box 999, Richland, WA  99352
From owner-mpi-collcomm@CS.UTK.EDU  Sun Mar 14 15:04:40 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA06334; Sun, 14 Mar 93 15:04:40 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA18143; Sun, 14 Mar 93 15:04:09 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sun, 14 Mar 1993 15:04:08 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from pnlg.pnl.gov by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA18125; Sun, 14 Mar 93 15:03:46 -0500
Received: from carbon.pnl.gov (130.20.188.38) by pnlg.pnl.gov; Sun, 14 Mar 93
 12:01 PST
Received: from sodium.pnl.gov by carbon.pnl.gov (4.1/SMI-4.1) id AA29382; Sun,
 14 Mar 93 11:59:17 PST
Received: by sodium.pnl.gov (4.1/SMI-4.0) id AA05208; Sun, 14 Mar 93 11:59:13
 PST
Date: Sun, 14 Mar 93 11:59:13 PST
From: rj_littlefield@pnlg.pnl.gov
Subject: collcomm changes, summary
To: geist@gstws.epm.ornl.gov, gropp@mcs.anl.gov, jim@meiko.co.uk,
        lusk@mcs.anl.gov, lyndon@epcc.ed.ac.uk, mpi-collcomm@cs.utk.edu,
        mpi-context@cs.utk.edu, ranka@top.cis.syr.edu,
        tony@Aurora.CS.MsState.Edu
Cc: d39135@sodium.pnl.gov
Message-Id: <9303141959.AA05208@sodium.pnl.gov>
X-Envelope-To: mpi-context@cs.utk.edu, mpi-collcomm@cs.utk.edu

SUMMARY OF SUGGESTED CHANGES TO COLLECTIVE COMMUNICATION PROPOSAL

The draft proposal that Al Geist distributed several days ago
contains some features that would prevent it from being
implemented as a layer on top of MPI point-to-point facilities.

The purpose of this note is to propose changes to the group
control routines in order to permit layering, and to propose
other changes for better and more predictable performance.

A discussion of the rationale for these proposed changes 
will be distributed separately because of its length.

The main changes introduced in this note are:

. The concept of group identification is firmed up.  Most
  operations use a "group handle" that is local to the process.
  (Think of the group handle as being just the address of a
  potentially large and complex "group descriptor".)  There is
  still a "group ID" that is globally unique, but it has only a
  secondary role and can be ignored by most applications.  The
  "group name" is entirely removed from MPI-1.  (Group names are
  still anticipated in MPI-2, but upward-compatibility is
  maintained in a different way from the draft proposal.)

. A semantic restriction is introduced, that a process can access
  information about a group only if the process holds a group
  handle for it.  Group handles can be obtained in two ways: 1)
  they are produced by group formation routines, and 2) a process
  can explicitly distribute copies of its group handles to other
  processes, using new routines introduced specifically for that
  purpose.

. A cacheing mechanism is introduced, that allows modules to
  attach arbitrary information to a group descriptor in such a
  way that it can be quickly retrieved.  Cacheing facilitates the
  construction of collective communication routines that are
  "fast after the first execution in a group", no matter how the
  other group operations are implemented.

. A new group formation routine is introduced, that is less
  synchronous and more general than MPI_PARTGROUP.

Specifically, the following routines are proposed to be added or
modified:

1. Arbitrary group formation:

    newgrp_handle = MPI_FORMGROUP (grouptag,groupsize,knownmembers)

    where
     grouptag     is a user-provided integer tag, sufficiently unique
                  to disambiguate overlapping groups that might be
                  formed simultaneously (say by multiple threads).

     groupsize    is the number of members that will compose the group.

     knownmembers is a set of pid's of some or all members of the group.
                  Each member of the group must provide the same
                  set of knownmembers.

     newgrp_handle  is a group handle for the newly formed group

    This new routine must be called synchronously, but only by those
    processes forming the group.

2. Group partitioning:

    newgrp_handle = MPI_PARTGROUP (oldgrp_handle,grouptag)

    where the semantics are the same as the draft proposal except that
    the return value is now a new group handle instead of a rank.
    (The rank can be determined by a separate call to
    MPI_GETRANK(group_handle,pid) .)

3. Group disbanding:

    MPI_LVGROUP (group_handle)

    where the semantics are the same as the draft proposal except that
    MPI_LVGROUP now does not return any result.  (Since groups can now
    be formed arbitrarily, not just by partitioning, it is not obvious
    what MPI_LVGROUP could return in general.)  This routine can be
    called only by members of the group.

4. Distribution of group handles and disposition of distributed handles:

    MPI_SendGroupHandle (pid,context,tag,old_group_handle)

    new_group_handle = MPI_RecvGroupHandle (pid,context,tag)

    MPI_FreeGroupHandle (group_handle)

    (The latter routine is similar to MPI_LVGROUP except that
    it can be called only for distributed group handles.  This is
    solely for semantic clarity; a single interface routine would do.)

5. Cacheing group-specific process-local information:

    The following routines get and free keys for use with group
    cacheing.

      key = MPI_GetAttributeKey ()
      MPI_FreeAttributeKey ()

    The following routines cache and retrieve information.

      MPI_SetGroupAttribute  (grouphandle,key,value,destructor_routine)
      status = MPI_TestGroupAttribute (grouphandle,key,&value)

    where
      key         must be unique within the group
      value       is anything the size of a pointer
      destructor_routine   is an application-provided routine that
                           is called by MPI_LVGROUP, with arguments
                           being the group handle, cached key and value.

    Cached information is stripped from the new group handle
    returned by MPI_SendGroupHandle.

    In a conforming implementation, MPI_TestGroupAttribute must
    be no slower than a point-to-point communication call.

6. Retrieving global group ID:

    global_id = MPI_GetGlobalGroupID (grouphandle)

7. Other collective communications:

   Consistently substitute "grouphandle" in place of "group".

----------------------------------------------------------------------
rj_littlefield@pnl.gov               Rik Littlefield
Tel: 509-375-3927                    Pacific Northwest Lab, MS K1-87
                                     P.O.Box 999, Richland, WA  99352
From owner-mpi-collcomm@CS.UTK.EDU  Sun Mar 14 15:50:50 1993
Received: from CS.UTK.EDU by surfer.EPM.ORNL.GOV (5.61/1.34)
	id AA06925; Sun, 14 Mar 93 15:50:50 -0500
Received: from localhost by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA19504; Sun, 14 Mar 93 15:50:24 -0500
X-Resent-To: mpi-collcomm@CS.UTK.EDU ; Sun, 14 Mar 1993 15:50:22 EST
Errors-To: owner-mpi-collcomm@CS.UTK.EDU
Received: from pnlg.pnl.gov by CS.UTK.EDU with SMTP (5.61++/2.8s-UTK)
	id AA19411; Sun, 14 Mar 93 15:49:35 -0500
Received: from carbon.pnl.gov (130.20.188.38) by pnlg.pnl.gov; Sun, 14 Mar 93
 12:48 PST
Received: from sodium.pnl.gov by carbon.pnl.gov (4.1/SMI-4.1) id AA29389; Sun,
 14 Mar 93 12:46:59 PST
Received: by sodium.pnl.gov (4.1/SMI-4.0) id AA05301; Sun, 14 Mar 93 12:46:57
 PST
Date: Sun, 14 Mar 93 12:46:57 PST
From: rj_littlefield@pnlg.pnl.gov
Subject: collcomm changes, rationale
To: geist@gstws.epm.ornl.gov, gropp@mcs.anl.gov, jim@meiko.co.uk,
        lusk@mcs.anl.gov, lyndon@epcc.ed.ac.uk, mpi-collcomm@cs.utk.edu,
        mpi-context@cs.utk.edu, ranka@top.cis.syr.edu,
        tony@Aurora.CS.MsState.Edu
Cc: d39135@sodium.pnl.gov
Message-Id: <9303142046.AA05301@sodium.pnl.gov>
X-Envelope-To: mpi-context@cs.utk.edu, mpi-collcomm@cs.utk.edu

RATIONALE FOR SUGGESTED CHANGES TO COLLECTIVE COMMUNICATION PROPOSAL

In a related summary, I outlined a set of suggested changes to
the concepts and routines in the collective communication proposal.

The purpose of this note is to present the rationale for those
suggestions and to discuss possible alternatives.

The discussion is organized into 5 areas, flagged with "----- Topic #".

Entries flagged with > are from my summary of suggested changes.
Entries flagged with >>> are from the draft proposal sent out by Al Geist.

----- Topic #1: Group Identification -----

> . The concept of group identification has been firmed up.  Most
>   operations use a "group handle" that is local to the process.
>   (Think of the group handle as being just the address of a
>   potentially large and complex "group descriptor".)
>        ...
> . A semantic restriction is introduced, that a process can access
>   information about a group only if the process holds a group
>   handle for it.  Group handles can be obtained in two ways: 1)
>   they are produced by group formation routines, and 2) a process
>   can explicitly distribute copies of its group handles to other
>   processes, using new routines introduced specifically for that
>   purpose.

There are two issues here: one of being able to layer collective
communications on top of point-to-point at all, and a secondary
one of efficiency.

The more fundamental issue is layering.  Given only MPI point-
to-point functionality, how can a group identifier (whatever it
is) be transmitted between processes so as to be useful to the
receiver?

Presumably we want to allow group identifiers to be passed around so
that any process holding the group identifier can use it for purposes
like translating between (group,rank) and pid.  We also want to allow
this translation to be done asynchronously, i.e., without requiring
the explicit cooperation of any other MPI process at the time of
translation.  Since MPI pt-pt does not support asynchronous servers or
an interrupt receive capability, this implies that the group
identifier must come complete with enough information to resolve all
translations without communication.

This prompts the concept that the group identifier must be associated
with a "group descriptor" that is large and complex enough to
fully describe the group.

How is the association done?  This is a question of efficiency.  If
the identifier is allowed to be process-local, the group descriptor
can be located very quickly -- just make the identifier be a pointer
to the group descriptor.  Requiring the identifier to have global
scope would not be so good.  In that case, either the identifier has
to be carefully constructed or the association has to be done with
some sort of table search.  These issues also arise with global pid's.
However, groups can be formed much more often and in greater numbers
than processes.  I doubt that careful construction tricks could be
assured to be adequate, and if not, then a table search would be
required on each collective communication call.

The conclusion is that, for most purposes, a process-local
identifier generated by the system is preferred.  Such things are
typically called "handles", hence the term "group handle".

> 4. Distribution of group handles and disposition of distributed handles:
> 
>     MPI_SendGroupHandle (pid,context,tag,old_group_handle)
> 
>     new_group_handle = MPI_RecvGroupHandle (pid,context,tag)
> 
>     MPI_FreeGroupHandle  (group_handle)

The next question is how group handles should be distributed.

Implicit distribution is out because MPI pt-pt doesn't support
a server capability, and presumably we aren't willing to synchronize
all of the processes whenever somebody creates a group handle.

So, explicit distribution is required.  How do we handle it?

Two ideas that I do not like are the following.  MPI might provide
routines to translate to and from some machine- and process-
independent format, so that the translated information could be sent
using normal point-point primitives.  This strategy requires that the
user program manage the storage of indefinite-length objects, which
makes for an ugly Fortran interface.  Or, group descriptors (and their
translation routines) might be built into point-point MPI as another
data type.  This violates the spirit of layering collective
communication on point-to-point, and has the same storage management
problem.

The three routines proposed above were the cleanest interface
I could think of.

----- Topic #2: Global Group ID -----

>   There is
>   still a "group ID" that is globally unique, but it has only a
>   secondary role and can be ignored by most applications.  
>         ...
>     global_id = MPI_GetGlobalGroupID (grouphandle)

Given that we are now able (and required) to pass around copies of
group handles, it is not clear to me that MPI really needs special
support for the concept of a global group ID.  On the other hand,
it's easy to provide, since we have to construct one or more
globally unique context values for each group anyway.  So just
use the first such context value as the global ID.  This gives
something unique that all processes can agree on.  

But note that knowing just the global group ID does not let you
get other information about the group -- you have to hold a group
handle for that.

(We could add a routine that would accept the global group ID and
return a handle for that group, presuming that the process held
one.  This would be cheap to do, since group handles are managed
by MPI anyway, and I can vaguely imagine that it might help some
applications.  On the other hand, there are no similar "handle
lookup" facilities provided elsewhere in MPI, and I'm reluctant
to set that kind of precedent without clear need.)

----- Topic #3: Group Formation -----

> . A new group formation routine is introduced, that is less
>   synchronous and more general than MPI_PARTGROUP.
>        ...
> 1. Arbitrary group formation:
> 
>     newgrp_handle = MPI_FORMGROUP (grouptag,groupsize,knownmembers)
> 
>     where
>      grouptag     is a user-provided integer tag, sufficiently unique
>                   to disambiguate overlapping groups that might be
>                   formed simultaneously, say by multiple threads.
> 
>      groupsize    is the number of members that will compose the group.
> 
>      knownmembers is a set of pid's of some or all members of the group.
>                   Each member of the group must provide the same
>                   set of knownmembers.
> 
>      newgrp_handle     is a group handle for the newly formed group
> 
>     This new routine must be called synchronously, but only by those
>     processes forming the group.  

The draft proposal distributed by Al Geist says that

>>> A group is identified by a group name that is supplied by the user.

A group name by itself is not enough to allow implementing groups
as a layer on top of point-to-point, unless we impose
restrictions that I think would be not acceptable.

The problem is: how does a group-forming routine know whom it
should send messages to, in order to form the group?

MPI_PARTGROUP does not have a problem with this, because it has
to be called synchronously by all members of the group.  Since
each current member of the group holds a handle (descriptor) for
that group, it is easy for each member to figure out who talks to
whom.

Unfortunately, there are some important application designs that
I do not see how to implement with just MPI_PARTGROUP.

For example, I am now doing an application that uses a
master-slaves strategy to asynchronously parcel out chunks of
work, with each chunk being done by several processes working
collaboratively.  Collective communication between those
processes is required, so it seems natural to organize them into
MPI groups.  Using a synchronous group partitioning routine
would introduce a risk of load imbalance, because the varying
chunk size implies that groups can finish their work at
different times, and synchronous partitioning would delay their
reassignment.

Applications like this could benefit from a group formation
routine that is called synchronously, but only by those
processes forming the group -- hence MPI_FORMGROUP.

This type of routine does have the problem of identifying its
collaborators, and the only solution I can think of is to
tell it.  That's what the knownmembers argument is for.

I have specified knownmembers in terms of pid's because I assume
that point-to-point communication based on pid's is always fast
and unrestricted.  If knownmembers were based on (group,rank)
pairs, then per the discussion above, all processes making this
call would have to hold handles (descriptors) for the referenced
groups.  This seems to me to be more trouble than it's worth, but
others may disagree.

Another comment about efficiency...  The size of the knownmembers
set affects the efficiency of group formation.  At one extreme,
only one member is required to be known.  This is scalable in a
memory sense, but not in a time sense, because it implies O(P)
group formation time for a group of P processes.  At the other
extreme, all members can be specified.  This is not scalable in a
memory sense, but allows guaranteed O(log P) formation time.
Other tradeoffs are possible, such as O(sqrt P) knownmembers and
O(sqrt P) formation time.  The interface as specified allows
each application to choose the type of scalability it wants.

----- Topic #4: Group Names -----

>   ...  The
>   "group name" is entirely removed from MPI-1.  (Group names are
>   still anticipated in MPI-2, but upward-compatibility is
>   maintained in a different way from the draft proposal.)

The draft distributed by Al Geist states:

>>> To allow for future extensibility of the group concept
>>> the present draft specifies that groups be named. 

Requiring names has the drawback that 1) it burdens the user with
at least the appearance of having to create unique names, in
order to be upward-compatible with dynamic groups, even though 2)
in a layered MPI-1, there is no way in general to check global
uniqueness, and thus programs can work fine with non-unique names.

This combination strikes me as actually impeding upward-
compatibility.  The tendency will be for programmers to use
non-unique names because it works and it's easy.  But such programs
would break when MPI-2 came along and started actually using
the names for something.  I don't like encouraging people to
write programs that are going to break.

I do support upward compatibility.  However, rather than requiring
names in MPI-1, I propose that they be deferred entirely to
MPI-2, at which point they can be supported either just through
MPI_JOINGROUP (as an alternative to MPI_FORMGROUP) or via
additional routines to attach globally unique names to groups
that have already been formed via MPI_JOINGROUP.

----- Topic #5: Cacheing -----

> 5. Cacheing group-specific process-local information:
> 
>     The following routines get and free keys for use with group
>     cacheing.
> 
>       key = MPI_GetAttributeKey ()
>       MPI_FreeAttributeKey ()
> 
>     The following routines cache and retrieve information.
> 
>       MPI_SetGroupAttribute  (grouphandle,key,value,destructor_routine)
>       MPI_TestGroupAttribute (grouphandle,key,&value)
> 
>     where
>       key         must be unique within the group
>       value       is anything the size of a pointer
>       destructor_routine   is an application-provided routine that
>                            is called by MPI_LVGROUP, with arguments
>                            being the group handle, cached key and value.
> 
>     Cached information is stripped from the new group handle
>     returned by MPI_SendGroupHandle.
> 
>     In a conforming implementation, MPI_TestGroupAttribute must
>     be no slower than a point-to-point communication call.

This feature is purely for efficiency, but I think it's so valuable,
cheap, and clean that something like it has to go in.

One feature of collective communication is that the fastest
algorithm for any particular job usually depends on the machine
topology, which processes belong to the group, and the amount of
data being manipulated.  For example, global combine of L data
elements across P = RC processes on a 2-D RxC mesh can be done in
O(L log(P)) time using a fanin/fanout algorithm, or in O(L + sqrt(P))
time using a nested rings algorithm.  The former is better for
small L, the latter for big L, and using the wrong one can easily
cost a factor of 3 in execution time.

So, there is strong motivation to write collective communication
routines that are adaptive in the sense of figuring out which
algorithm is best.  The problem is that it can take quite a lot
of time to make the decision, starting from a scratch position
of not even knowing which processes belong to the group.  It's
going to take lots of calls to the inquiry routines to get that
information, and then some more cycles to make the proper decisions.

Obviously it would be profitable to cache the information and/or
decisions.  The question is, where?  

It is tempting to say that the collective communication routine
could or should keep its own cache, indexed by group handle
and/or global group ID.  The problem is, groups are dynamic in
the sense of being formed and disbanded, so that unless group IDs
can get very large, eventually they will have to be reused.  Now,
it wouldn't do to have a collective communication routine use
stale cached information, so if the collective communication
routine is keeping its own cache, then it needs to b