Minutes of the Message Passing Interface Standard Meeting Dallas, January 6-8, 1993 The MPI Standards Committee met in Dallas on January 6-8, 1993, at the Bristol Suites Hotel in North Dallas. This was the third meeting of the MPI committee, but the first following the format used by the High Performance Fortran Forum. There were both general meetings of the committee as a whole and meetings of several of the subcommittees. Because interest in the Point-to-Point communications and the Collective communications was so general, these met as committees of the whole. No formal decisions were taken at this meeting, but a number of straw votes were taken in the subcommittees. These are included as part of the reports on the work of the subcommittees. These minutes were taken by Rusty Lusk (lusk@mcs.anl.gov) with some additions by Bob Knighten. Marc Snir's notes on the point-to-point subcommittee meetings are included here as well. These minutes are quite long. If you want to see the important topics you can search for --- and this will quickly the lead to each topic (and a few other things.) January 6 --------- ------------------------------------------------------------------------------- General Meeting ------------------------------------------------------------------------------- The meeting was called to order by Jack Dongarra at 1:30. Jack Dongarra presented the rules and procedures that had been circulated in the mailing list. In general, they say that we intend to operate in very open fashion, following the example set by the High-Performance Fortran Committee. He also described the subcommittee structure. For details, see the mailing list, A tentative schedule for future meetings was presented, which was amended on the last day (see there). All meetings will be in Dallas at the Bristol Suites. Steve Otto will coordinate the production of the document. He will obtain a set of LaTeX macros from the HPF Committee and distribute them to the subcommittee heads. It was suggested by Bob Knighten that the Executive Director arrange for copies of all pertinent documents be provided at the meetings. Dennis Weeks, who is somewhat local (Convex), volunteered to help with the relevant copying. The attendees were: Ed Anderson Cray Research eca@cray.com James Cownie Meiko jim@meiko.co.uk Jack Dongarra UT/ORNL dongarra@cs.utk.edu Jim Feeney IBM-Endicott feeneyj@gdlvm6.vnet.ibm.com Jon Flower ParaSoft jwf@parasoft.com Daniel Frye IBM-Kingston danielf@kgnvma.vnet.ibm.com Al Geist ORNL gst@ornl.gov Ian Glendinning Univ. of Southampton igl@ecs.soton.ac.uk Adam Greenberg TMC moose@think.com Bill Gropp ANL gropp@mcs.anl.gov Robert Harrison PNL rj_harrison@pnl.gov Leslie Hart NOAA/FSL hart@fsl.noaa.gov Tom Haupt Syracuse U. haupt@npac.syr.edu Rolf Hempel GMD hempel@gmd.de Tom Henderson NOAA/FSL hender@fsl.noaa.gov C. T. Howard Ho IBM Almaden ho@almaden.ibm.com Steven Huss-Lederman SRC lederman@super.org John Kapenga Western Michigan Univ. john@cs.wmich.edu Bob Knighten Intel SSD knighten@ssd.intel.com Bob Leary SDSC leary@sdsc.edu Rik Littlefield PNL rj_littlefield@pnl.gov Rusty Lusk ANL lusk@mcs.anl.gov Barney Maccabe Sandia abmacca@cs.sandia.gov Phil McKinley Michigan State mckinlehy@cps.msu.edu Chuck Mosher ARCO ccm@arco.com Dan Nessett LLNL nessett@llnl.gov Steve Otto Oregon Graduate Institute otto@cse.ogi.edu Paul Pierce Intel prp@ssd.intel.com Peter Rigsbee Cray Research par@cray.com Ambuj Singh UC Santa Barbara ambuj@cs.ucsb.edu Marc Snir IBM snir@watson.ibm.com Robert G. Voigt NSF rvoigt@nsf.gov David Walker ORNL walker@msr.epm.ornl.gov Dennis Weeks Convex weeks@convex.com Stephen Wheat Sandia NL srwheat@cs.sandia.gov ------------------------------------------------------------------------------- Point-to-point subcommittee ------------------------------------------------------------------------------- Mark Snir called the meeting to order at 1:40 p.m. It adjourned at 4:10 p.m. It resumed the following morning at 9:10 a.m. and adjourned at 4:15 p.m. Marc Snir began by summarizing the decisions that we have to make: * which operations? send receive channels? sendreceive? info arguments operations on queues probe? * operation modes sync async local and/or global termination interrupt-driven? * message types (data types) structure of data in core buffer packing * send-receive matching type (We later decided to call this "tag".) sender? * correctness criteria (See Marc Snir's paper in handouts) * heterogeneous operations * name space how processes are addressed flat? structured? implicit/explicit * error handling * interaction with threads, interrupt handlers, remote signalling * special operations for high performance ready receiver? * process startup * syntax/style (The plan is to postpone this for this meeting.) We will prioritize this list and then go through them one by one. (The priorities assigned were more or less in the order listed above.) Two preliminary questions were then discussed: A. Must we worry about multithreaded environments? James Cownie pointed out that threads were coming, in almost all new systems. Most systems have threads now. It was proposed that a process, which could send and receive messages, should be an address space, so that individual threads would not be (MPI-) addressable. B. What about signals? Paul Pierce suggested that we discuss signals first: do we want to support send/receive from interrupt handlers? These two questions were then discussed at length. Dealing with threads argues against the notion of "last message", since that implies state is maintained by the system. There was general agreement that "state" was a ` bad thing, but arguments in favor of state are: Sometimes one doesn't want all of the information available after an operations, so it shouldn't be returned. Having lots of arguments to calls is bad, especially inout arguments. Ways to avoid state are: Structures could be returned Return individual arguments Return tag to do queries on (but they one needs to free it.) Additional out arguments (OK in Fortran 90, but not in C or f77) User passes in storage to be used (so he knows the address), and MPI provides inquiry functions [For more details, see Jim Cownie's mail message of January 4, 1993 entitled: Multifarious] There was a general agreement that system state decreases portability and manageability, and we should decrease it when we can. James Cownie said that We need a reentrant style, and Mark Snir suggested that we try to make all function calls reentrant. When queried no one in the group objected to trying to make all the functions that are introduced in MPI reentrant. Now we began going through the above-mentioned major topics. Which Operations? ----- ---------- We have send and receive. How about send-receive (also called shift)? It can be efficiently implemented, and buffer can be reused. There was a discussion of the "two-body" send-receive (exchange) and the "three-body" version (ring-shift). Variations include those in which the send-buffer is required to be the same as the received- buffer and those in which is is required to be disjoint from the receive-buffer. Al Geist: We should focus on *required* operations. Steve Otto replied that send-receive *is* a required operation. Using "exchange" can help avoid deadlock. It was agreed that there was no consensus on these issues and it was decided to defer this to the collective communication subcommittee. Operation Modes --------- ----- The next topic that Marc Snir raised for discussion was when do send and receive return. Marc described several options: For send: 1) return as soon as possible 2) return when send-buffer is clear 3) return when the corresponding receive has completed For receive: 1) return as soon as possible 2) return when the receive-buffer is full "Receive has completed" means "when the user knows". In other words, when the sender has returned from send, the receiver has returned from receive. There was a general discussion about whether 3) was necessary? dangerous? Robert Harrison said he believed that 3) was the minimal version that was truly portable. Steve Otto pointed out that 3) is CSP-like. Rusty Lusk said that 3) would be easier to prove things about than the others. Adam Greenberg and Paul Pierce pointed out that neither TMC nor Intel have implemented an operation depending on the behavior of the receiver. A straw vote was taken and the vote was 17-3 in favor of having 3) as an option. Marc Snir pointed out that in his original proposal send returns a handle and the status of the handle is then tested for completion of the send operation, and asked if this is this desirable. There was general agreement that something of this sort was desirable, but a variety of alternatives were mentioned It was pointed out that sometimes one wants to wait on multiple outstanding operations. Al Geist prefers separating "wait" into "sendwait" and receivewait" for code readability. Bill Gropp suggested that instead of using handles, one could supply a routine to be called when an operation completes. James Cownie: "This gets really hairy in Fortran". There was a discussion of probing multiple outstanding receives: If the receives return handles, h1 = recv( ... ) h2 = recv( ... ) wait ( h1 or h2) ? wait ( h1 and h2 ) is not needed. Jame Cownie proposed that we supply an operations to *wait* on a vector of handles, which would return one of those that have succeeded. It would return the handle, not the status. A straw vote as taken on this proposal, which passed 17 - 0. So we have: status (handle) wait (array of handles) The send specifies what completion of send means. Handles need to be freed. It was pointed out the only the existence of such an operation has been decided, the semantics are yet unspecified - e.g. issues such as fairness or what wait returns when several complete are not yet specified. There was a long discussion of cancellation of send and receive. It was observed that there are serious implementation problems because of race conditions, freeing resources, etc. A straw poll was taken on including cancel in the initial MPI. It failed 7-19. This was the end of the Wednesday afternoon point-to-point meeting. January 7 --------- The point-to-point subcommittee (now a Committee of the Whole) resumed at 9:15 a.m. on Thursday morning. Marc Snir opened the meeting and summarized the progress so far: 3 ways in which send can terminate sendreceive postponed no cancel of incomplete send operation status and wait (successful status accomplishes same as wait) We did not get to: channels (the idea of trying to bind as soon as possible as many parameters as possible, so that they can be reused.) probe readyreceive Marc noted that channels and readyrecv address similar issues. Probably want only one of these. Do we want either? Rolf Hempel observed that we don't need channels - can depend on operating system to cache the connection information when doing synchronous communication. Adam Greenberg replied: NO! Want to be able to do this all at user level without "smart" OS. Channel creation and use might look like: handle = send_init( ... ) start(handle) wait(handle) free(handle) This is an intermediate point between bundled send/receive and full named channels. Indeed there are many intermediate points based on various early bindings. Is there enough experience to justify a standard? Bob Knighten observed that there has been substantial experience with channels on the iWarp system. There was next a discussion of the ready-receiver semantics proposed by Lusk and Gropp in the handouts. Steve Huss-Lederman said that such operations could make a difference of as much as 25% for matrix multiplication on the Delta. Some doubt was expressed about the universality of this optimization. Question of use of readyrecv by naive users again. Cownie mentions experience again. Greenberg: facilities for efficiency should not make it difficult to write correct programs. Wheat: Don't penalize users who do understand and can take advantage of efficient procedures. General back and forth discussion. Two straw votes were taken: Ready-receiver operations passed 13 - 10 Channels passed 19 - 2 (Marc Snir will write up a detailed proposal) The next topic discussed was the probe operation. Do we want such an operation, and if so, what should be its semantics? Probing must "lock" the message that it finds, else the information returned by the probe may be unreliable. (Consider the multithreaded environment.) Bill Gropp pointed out that probe is often used to find out the length of a message so that a buffer of the appropriate size can be allocated. Marc Snir pointed out that this is a problem with the November document, that we need to know the length of a message ahead of time. Jon Flowers suggested the need for a blocking probe. What is needed is to probe and then to receive the message found via the probe: handle = probe(params) . . . recv(handle) release(handle) Marc Snir pointed out that the handle serves as a lock on the message. James Cownie pointed out that while we agreed to not have a cancel for a send, we do need to be able to cancel receives, since an outstanding receive is permission for the system to write in the user's address space, which is a permission the user may want to revoke. A straw vote was taken on the existence of some form of probe, and it passed 25 to 0. Send-Receive Matching ------------ -------- The next topic is the matching of send and receive. Currently we have to discuss matching on: tag sender group id context id We will also need to discuss the name space issue for messages. Here are three proposals for the predicate that determine whether a message matches a particular receive: 1) simple matching on fields 2) more general, with mask, etc. 3) user defined function Adam Greenberg said that at TMC: A user defined function is used by the system whenever a message is received by a node to decide if it is to actually be received by the application. The parameters to the user defined receive predicate are tag and group. Issue: If most information is encoded in the tag, then the tag protocol must be understood by all users involved in writing a particular application. True, but not a serious problem. Best to identify small class of specific matching parameters (e.g. group) and use tag for everything else. James Cownie pointed out that the matching function, if not too complicated, can (and is, on many systems) done by special communications processors. There was further discussion of the difficulties of having the system call user code for screening messages. Paul Pierce pointed out that receipt of a message by the hardware is a crucial point for performance. There was general discussion of alternative approaches to getting at least some of this. The question of need for this generality was also raised. TMC has a user who wants and uses his own predicate function. Possibilities: (a) select on mask for fields (including a don't care); (b) simple static logical operations on fields; (c) user defined (b) might be match = AND (( message(i) = pattern(i) ) OR mask(i)) fields A straw vote was taken on whether to pursue allowing user-defines predicates. It was decided 26-1 not to allow user-defined functions for this purpose. (b) deferred until a proposal is available. Marc Snir summarized that matching by tag is generally agreed on and that this is not the only item for selection. After some discussion, matching by sender was also generally agreed on. So now, how do we identify a sender? Rusty Lusk spoke in favor of a flat name space, so that processes could be addressed independently of group, etc. There ensued a general discussion of groups, contexts, and the name space. It was pointed out that the name space expected by send could be flat and groups could be implemented by a function that converted any structured name into a flat integer id. Other proposals were to to have name=(rank,gid) with the restriction that this name be usable only within the given group (gid) and the sender must be a member of this group. By default the group would be ALL. Other alternatives mentioned were name=(rank,ALL)=pid and name=(pid,context). This led to a general discussion of context and the relation to groups. Marc Snir pointed out that we could have pid pid,context in which context did not change the meaning of pid. Paul Pierce said that tags and contexts should be separated since they need to be handled in different ways. Marc Snir pointed out that there should be no "don't care" on context. There was a discussion of servers that can process "any" message. This also led into discussion of flat name space vs. hierarchical name space where we would have pid(group, rank) function. Can use context to define groups, but there are other uses as well. Why groups as well as context? What is the difference between context and groups? Cownie: Context is just another integer used in the same manner as tag. Not quite - it is reserved, but what is the meaning of "reserved"? Greenberg was concerned about connecting send/receive behavior with groups. Snir: Suppose a users wants to have two independently written subroutines that use the usual rank notation. Wheat: Similarly want to use rank notation when partitioning machine. Snir: Both contexts and groups are nice, but do we need both? Gropp: Problem with mixing two applications both of which use 0-based indexing will need a larger common name space when they need to communicate. There was a general discussion of the cost of contexts. Cownie observed that context is cheap if only used to distinguish code - obtain a unique context id for the code by means of the "one-dollar random number generator": each author obtains a one-dollar bill, copies the serial number, and then burns the bill. But in general context is not cheaper than groups. Someone asked about spawning additional processes while program is running? Various people raised the question: If use name=(pid,context), does context change the meaning of the pid (i.e. is pid context {or gid or ???} relative.) There was some discussion of message registration. Paul Pierce observed that tag vs. context is only matter of registration. He wants to divorce tag and context for safety. This implies that one cannot use wild card for selecting on context. Various people noted difficulties with mixing tag and context. Adam Greenberg offered: Proposal - always separate tag and context. Have a context, NONE, so that pid with context NONE is unmodified, but with other contexts the pid may be relative. [NONE, GLOBAL, BASE] tag, context - must match on context Several people noted that there are two very different uses of context - identification of distinct code and identification of a group of processors. There is state, even distributed state, associated with remapping of processors with groups. POSSIBLE FIELDS FOR SEND/RECEIVE: tag context id group - no wild card - set of processes - registration management - receive only from group - managed by system Marc Snir asked whether we could agree on what would be carried with a message: tag context (like tag, except no wild card; management to be determined) Two straw votes were taken: Having contexts passed unanimously. Having the context *not* modify the process id passed unanimously. Groups ------ Three alternatives: no groups (use send(pid(group, rank), ...) instead) group as explicit parameter in operations use contexts to implement groups The basic difference is: do we want to be able to select on group? Straw vote: yes: 10 no:11 on the capability of selecting by group. (Thursday lunch occurred here) Message Data Types ------- ---- ----- WHAT IS A BUFFER? (Language bindings are going to be important here.) There are many options to consider: a) contiguous bytes (non-controversial) General agreement that 0-length messages should be allowed. b) contiguous buffers of other (implementation specific?) units? b) stride? (parameters: base-address, element-length, stride, number-of-elements) c) basic data types? d) arbitrary structures? e) How will we specify data to be communicated in a heterogeneous environment? f) iovec structures (array of pointers and lengths, as in un*x iovec for reads and writes) Marc Snir pointed out that one possibility is to have separate pack/unpack routines and then just send contiguous buffers. Rusty Lusk pointed out that this requires a copy that may be unnecessary on certain machines. Two choices - pack scattered buffer and send OR send scattered buffer. If the second, then may need a pack that produces the descriptor of a scattered buffer to be used by the send scattered buffer. Straw poll: Use IOVEC type send. Passed 18-1 Basic data types were deferred. Marc Snir observed that up to this point, a message is a set of bytes in storage, but now we are about to consider more meanings: message = sequence of *values* Should we use the same calls for homogeneous and heterogeneous calls? Can we have a fast homogeneous implementation of the heterogeneous calls? Bill Gropp pointed out that the current testbed implementation does this. SEND vs. SENDCHAR, SENDREAL, . . . To be compliant with F77 need to have at least SENDCHAR for correctness (and this is a real issue, e.g. on VAX.) Strictly need to have different for each basic data type (but in practice this is not an issue.) But for other than CHARACTER there is also an efficiency issue. 1. F77 conformance 2. Special problem of CHARACTER 3. Performance 4. Heterogeneity (?) Postpone to language binding discussion. This led into the issue of the general problem of converting types between languages and machines! This in turn led to a discussion of XDR (and mention of other systems such as NDR, ...) XDR supports the basic types (INT, REAL, COMPLEX, CHAR, etc.), array constructors, pack/unpack routines, etc. Do we use the same calls for homogeneous and heterogeneous systems? Can we have a fast implementation of heterogeneous procedures for a homogeneous system? What about a "message envelope" that specifies the environmental aspects of messages (e.g. heterogeneity features such as XDR.) When we talk about heterogeneity, do we expect MPI libraries from different vendors on different machines to cooperate? Include general SEND as SENDBYTES? Agreed that do not want SEND in homogeneous to require type information needed for heterogeneous environment. There was a discussion of whether we have to pick an interchange format, for example XDR. There seemed to be some agreement that we do (as MPI implementations from different vendors have to be able to communicate with one another), but no vote was taken. Error Handling ----- -------- The main issue here is whether an error detected by an MPI routine should result in the calling of an error-handler or return of a return code. Other issues are how much of error handling should be standardized as opposed to implementation-dependent, and how much user control there should be over error-handling. There are two types of error environments - soft (recoverable) and hard (unrecoverable). In a soft error environment there is the opportunity for cleanup on the part of both the "application" and the system, while in the hard error environment the system will cleanup and terminate the application. Choices: An mpi routine always returns (though it may return with an error code.) An mpi routine may call an exception handler There may be a default exception handler and there could be a user-installable one as well. Library writers may want to handle errors differently from how a user program wants to handle them (or have them handled by the system). Robert Harrison described the error modes used in TCGMSG and p4: A process has a user-settable state that determines whether an error should result in a (hopefully) graceful takedown of the program or in a error return code. Paul Pierce described the Intel method which uses two syntactically distinct classes of functions. For one class an error results in a message being printed and the process in which the error occurred terminating. For the other class an error code is set. There was some discussion of the problem of maintaining state in a multithreaded environment. Two straw votes were taken: Do we want a version of MPI that calls an exception handler: yes: 23 no: 0 Do we want a version with return codes: yes: 19 no: 1 Specific discussion of modes or "shadow" routines was deferred. Correctness Criteria ----------- -------- This concerns defining what is a correct implementation of MPI An assumption that had to be restated several times during the meeting is that MPI assumes a reliable underlying communication system, i.e. MPI does NOT address what happens if that fails. Two specific topics are order of messages and resource bounds. There was discussion about whether order preservation is required; that is, for messages from one process to another, messages are received in the order they are sent. Maintaining message ordering is troublesome, but seems essential for conveniently writing reliable portable programs. But then comes the question of what exactly this means, particularly with multithreaded processes! What is the effect of probe on the ordering of messages? Straw vote in favor of requiring order preservation: yes: 23 no: 4 On the issue of correctness with regard to resource exhaustion, Marc Snir suggested the following example: Process 1 Process 2 --------- --------- send to 2 send to 1 recv recv What should an implementation be required to do with this program? On the CM-5 this will always deadlock. On Intel and Meiko machines this will "usually" work (but how does one specify exactly when it will work.) Exchange is an even nastier case. ------------------------------------------------------------------------------ Summary of both Wednesday and Thursday point-to-point subgroup meetings by Marc Snir 1. Multithreaded systems and signal handlers. Should these be of concern to us? No vote was taken, but the general feeling was that we should try to define the various communication calls so that they do not rule out the case where the communicating process is multithreaded. The implications seems to be that all calls should be made reentrant, and the communication subsystem is, from the view-point of the application code, stateless. (With one with one obvious exception, namely that due to posted receive or send buffers, and perhaps additional exceptions having to do with global "modes", like error handling mode. 2. Small or large library? No vote taken. The general feeling is that we should provide many options for the advanced programmer that wants to optimize code (otherwise, all "interesting" codes will use non-portable features, but set the syntax so that the use that uses only the "core" functions need not be burdened by the availability of advanced options. 3. What functions? Clearly, SEND and RECEIVE General sentiment that a combined send-receive would be nice ("most used function on CM"), but discussion postponed until we have a proposed definition: Do we we want an exchange (source=dest), or a 3-body function (source != dest), or allow for both? do we want send_buffer identical to receive_buffer, or disjoint from receive_buffer, or allow arbitrary overlap between the two? What attributes are shared by sent message and received message, if at all? WAIT, STATUS and PROBE functions, and persistent handles are discussed later. 4. What modes? We want blocking and nonblocking sends and receives (blocking -- returns when operation terminated; nonblocking -- returns as soon as possible and a different call is needed to terminate the operation). We want synchronous and asynchronous modes (Synchronous -- operation terminated when terminated at all participating nodes. Asynchronous -- operation terminated when terminated at the calling node; e.g. a send terminates asynchronously when the sender buffer can be reused. Please let me know if you dislike this terminology and prefer something like "local" and "global".) The vote went 17-2 toward having a synchronous SEND (completes when RECEIVE has completed, i.e. when the corresponding WAIT has returned, or STATUS has returned successfully.) We did not discuss whether we want all 4 combinations of blocking-nonblocking and synchronous-asynchronous, or just 3 (blocking synchronous, blocking asynchronous and nonblocking asynchronous). We did not discussed explicitly, but "kind of assumed" that any SEND mode can match any RECEIVE mode. 5. How does one complete a nonblocking operation? The SEND and RECEIVE nonblocking operations return a handle that can be used to query for completion. WAIT(handle) blocks until operation completed; STATUS(handle) returns as soon as possible, and returns an indication for successful completion. In addition, these operations return information on completed RECEIVES: tag, message length, etc. for the received message. The information is returned in a structure provided by the caller. After return of a WAIT or successful return of a STATUS the operation handle is freed; the system has no more information on the completed operation, and has freed all associated resources. A more complex WAIT is needed, that waits for the completion of one out of several pending operations. Proposed syntax is WAIT(array_of_handles) that returns information on which operation succeeded and its parameters (voted 17 to 0). No CANCEL operation -- Once a SEND or RECEIVE is posted, it must complete. (Voted 19 to 7. Some peoples asked to reconsider at least canceling posted RECEIVEs, even if posted SENDs must complete). 6. Additional operations "ready-receive" SEND. SEND with a promise that a matching RECEIVE is already posted (A program where such SEND occurs with no preceding matching RECEIVE is erroneous and, hopefully, the implementation detects this error.) The justification is "it exists on some machine" and "it can improve performance by 25% on Delta". Accepted by 13 against 10. Persistent handles. Created by SEND_INIT(params) (resp RECV_INIT(params). can now be repeatedly used to send/receive messages with these parameters, and then explicitly destroyed. Supported by 19 against 2. PROBE. Allows probing for messages available to receive. Justification - "provides a mechanism to allocate memory to a message of unknown length, before it is received". The proposed mechanism is PROBE(params) that returns a lock to a matching message if there is a matching message that can be received. This message is now locked and can only be received using this lock. This was voted 25 to 0. Some level of uncertainty whether we should also allow to unlock without receiving (why should one want to do this?) 7. What is the buffer argument in SENDs and RECEIVEs? A message is a sequence of values, and as a particular case which is of most interest for homogeneous systems, and for which the syntax ought be simpler, a message is a sequence of bytes. There are various ways of specifying this sequence of bytes. a. Contiguous message: Starting address and length b. Regular stride message: Starting address, number blocks, length of blocks, stride. Voted with no opposition. c. IOVEC: a list of entries, each of which describes a type a or type b message. Voted 18 against 1. There was no discussion on a concrete proposal for typed messages, short of agreement that there should be such. The standard is not going to propose a concrete encoding of typed messages, and a concrete mechanism for message exchange in heterogeneous systems. 8. Matching of SENDs and RECEIVEs. A SEND operation associates with a message the following attributes. a. Sender id. b. Tag c. Context The idea of associated a group id, too, was rejected 11 to 10. The RECEIVE criterion is a Boolean predicate on these attributes of the form. (SENDER_ID = param1) and (TAG = param2) and (CONTEXT = param3). Don't cares are allowed for sender_id and tag, but not for context. Sender_id is determined by system, in the obvious manner, and is absolute (not relative to a group or a context). Tag is under sender control. Context is under sender control, but a yet to be determined mechanism is used to allocate valid context values to processes so as to prevent conflicts. All this was approved with no opposition. The idea of allowing the user to provide its own Boolean function as a receive predicate was rejected 26 to 1 (Reason: "hard to do if the matching is done by a communication coprocessor".) 9. Error handling a. We need a version of MPI where errors generate exceptions (user program halts when an error is detected in an MPI call, or a specific exception handling mechanism is invoked). Voted 19 to 1. b. we need to provide a version of MPI where calls return error codes, and do not cause exceptions, whenever possible. Voted 23 to 0. 10. Ordering of messages Messages sent from the same source to the same destination "arrive in the order they where sent". Voted 23 to 0. The exact implications in terms of order in which RECEIVEs can occur has to be worked out. It was pointed out that this condition may be somewhat hard to define in a multithreaded environment. End of Marc Snir's summary --------------------------------------------------------------------------- Collective Communication Subcommittee --------------------------------------------------------------------------- The Collective Communication Subcommittee was called to order by Al Geist at 4:30 p.m. on Wednesday. It continued until 6:40 p.m. when there was a break for dinner. The meeting resumed at 8:25 p.m. and finally adjourned at 10:10 p.m. Al Geist introduced this as the first meeting, since no real discussion on groups and collective communication took place in Minneapolis. One goal of this committee is to maintain consistency with the point-to-point operations. Any discussion of groups necessarily involves this subcommittee. Collective communication operations can be constructed out of the point-to-point primitives, but are desired because they can be implemented efficiently they are convenient for programmers. The committee then went through the set of collective communication primitives that had been proposed by Al Geist during the email discussions. Broadcast: info = MPI_BCAST(buf,bytes,tag,gid,root) On return, contents of buf for root is in buf for all processes. Al Geist pointed out that the group id here is explicit. Root has to be a member of the group. It was at this point that the committee decided that it would use the word "tag" for message type from now on to distinguish it from "type", which will now always mean type of data. Marc Snir pointed out that for consistency with point-to-point operations, there should be both local termination (the operation returns when the local process has done its part) and global termination (the operation returns when all processes have finished their participation) versions. There followed a discussion of the fact that the point-to-point committee seems to be adopting many different versions of send and receive, and that total compatibility will require many different versions of broadcast. There was a discussion of the reason for the tag parameter in the call. It is needed to disentangle multiple broadcasts occurring at approximately the same time. Paul Pierce described how the system can do this by generating sequence numbers. Others argued that the tag was useful for the programmer in any case, particularly for verifying program correctness. Marc Snir argued that there is a problem (because of the intuition that bcast provides barrier) 1 2 3 send(3) bcast rec(don't care) bcast send(3) bcast rec(don't care) Note that 3 may receive from 2 before 1, i.e. no barrier. Al Geist replied that we need barrier, but broadcast is NOT a barrier. James Cownie initiated a general discussion of whether broadcasts could be received by general receives. This would make it simpler to inherit some of the point-to-point semantics. Al Geist said that broadcast should be consistent with the other collective operations, all of which are symmetric. Paul Pierce suggested we specify collective communication routines in terms of model P-P implementation. This has consequences in terms of what options can be supported. Marc Snir pointed out that one can't actually specify collective communication in terms of point-to-point operations because they need dynamically-allocated additional space. It was decided to postpone a straw vote on whether all processes participating in a broadcast should do "broadcast" or only the root should "broadcast" and the others should "receive" because of concern about remaining issues, e.g. different varieties of recieves. The discussion of "error code" was deferred until the issue is settled in the Point-to-point communication subcommittee. MPI_GATHER: (see mail archives for details) It was proposed to have a version in which each participant contributes a different amount of information (a general "concatenate" function). Issues raised: How handle the situation where the number of bytes on each processor is different. How specify the type of data? For example one needs to know the size of the data type for various purposes, e.g. when doing recursive bisection. MPI_GLOBAL_OP: (see archives for definition) This does not include the data types. There was a discussion of how the forwarding processors know where to break buffers if the data type is not specified. Paul Pierce suggested that we should separate the case of user-defined combining operations from the system ones, which could be optimized. Robert Harrison suggested that the buffer be specified as (#items, length) at least for the user-defined operations. (Tag would be retained) Someone noted that "bytes" would be different on each processor in the heterogeneous case. Back to GATHER. Many agreed that the interface should be changed, but no proposal was offered. Straw vote on having separate general concatenation, to go along with the gather operation: yes: 18 no: 0 MPI_SYNCH There was general agreement that "BARRIER" would be a better name. James Cownie suggested that a tag argument would be helpful for debugging. There was also some discussion of failure of such a barrier, e.g. because some node fails. It was agreed that this was not a problem peculiar to this particular function. One individual nonetheless argued strongly for some kind of timeout for the barrier. Groups ------ gid = MPI_MKGRP (list of processes) There was much discussion of the format of the process list. As defined MKGRP defines a group as a subset of a pre-existing group. One alternative would be to allow creating a group consisting of processes from a number of other groups. (NB Identification of processes is unspecified. This is a task for the Point-to-point Communication Subcommittee.) MKGRP provides an implicit barrier among the processes joining the group. There are a number of problems about making sure that gid is uniform and known across the system. This is an efficiency issue. Should it be possible to SEND to a (gid,rank) pair? Marc argued that one should do Point-to-point communication only within a group, not between groups. Note that groups are constant - cannot add or delete members from a group. Also group creation is a barrier for the processes that are part of the group. This raises the question of how the processes joining the group know that they are joining. What is utility of groups? Certainly at present the only commonly used group is ALL. MPI_FREEGRP(gid) MPI_GRPSIZE MPI_MYRANK There was a general discussion of how group id's would be generated. Also a discussion of the mapping information: How to map back from my_rank and gid to rank in ALL? (In order to actually do a SEND.) ----- At this point the group broke for dinner ----- The continuation after dinner was an informal general discussion. There were some general question about experience from Al Geist to Paul Pierce. Adam Greenberg expressed interest in discussing channels. Channels are seen as an early binding, (Curryification) of various of the SEND/RECV functions which offer a number of gains in efficiency. There was a discussion of Fortran language bindings (F77, F90, HPF) of MPI. It was agreed by those knowledgable in the area that there are no special issues in regard to HPF. Steve Wheat discussed the Sandia implementation of channels on the Ncube. Sounds very similar to iWarp channels except that they are dynamic in creation. Jim Cownie noted that global-ops are going to result in non-determinism in numeric routines. Jim also elaborated on Meiko's BAD experience with ready_receive function - lots of user problems. Commonly user's try it on small problems and it works and speeds up. But then on large problems things erratically break and the user bitches. Paul Pierce noted that this is essentially Intel's force type and the Intel experience has not been so bad. In particular it is harder to use and does not generally work easily on small problems. Cownie: In general what to do when a ready_receive fails? No reasonable way to raise error. Response: Use a signal. Cownie: GAACK! This is implementation and not viable on all systems. John Kapenga listed six collective communication issues that he considers particularly important. [Missed the list] Other desirable collective communication features that were mentioned: global-exchange; all-to-all communication. What are criteria for including? Proposal: Difficulty of implementation; frequency of use; efficiency gain John Kapenga asked about 2-D and 3-D mesh operations - e.g. shifts? Adam Greenberg said this should be left to compilers. John: No Way! Adam argued that the compiler can recognize opportunity to avoid memory copies. Unless that same facility is available to user the compiler can do much better. The group adjourned at 10:10 p.m. --------------------------------------------------------------------------- Topologies Subcommittee --------------------------------------------------------------------------- The Topologies Subcommittee was called to order by Rolf Hempel at 4:00 on Wednesday. It lasted until dinner. --------------------------------------------------------------------------- Other Subcommittees --------------------------------------------------------------------------- The other subcommittees (Introduction, Formal Semantics, Environmental Enquiry, Language Binding) met informally after dinner on Wednesday. --------------------------------------------------------------------------- Meeting of the Whole Committee --------------------------------------------------------------------------- Thursday, January 7, 4:30 The Agenda for the rest of the meeting was presented: Introduction subgroup report Collective-communications subgroup report Process Topology subgroup report Environmental Inquiry subgroup report Formal Language subgroup report Language Binding subgroup report Profiling (Jim Cownie) Dates for future meetings Report of the Introduction Subcommittee: ------ -- --- ------------ ------------ Jack Dongarra presented the results of the subcommittee meeting that took place Wednesday night. This is essentially the draft that has been available from netlib for the last six weeks. There was some on-the-fly editing by the group at large. The goal of the Message Passing Interface simply stated is to develop a *de facto* standard for writing message-passing programs. As such the interface should establishing a practical, portable, efficient, and flexible standard for message passing. Goals ----- Design an application programming interface (not necessarily for compilers or a system implementation library). Allow efficient communication: Avoid memory to memory copying and allow overlap of computation and communication and offload to communication coprocessor, where available. Allow (but not mandate) extensions for use in heterogeneous environment. Allow convenient C, Fortran 77, Fortran 90, and C++ bindings for interface. Provide a reliable communication interface. User need not cope with communication failures. Such failures are dealt by the underlying communication subsystem. Define an interface that is not too different from current practice, such as PVM, Express, P4, etc. Define an interface that can be quickly implemented on many vendor's platforms, with no significant changes in the underlying communication and system software. The interface should not contain more functions than are really necessary. (Based on the latest count of send/receive variants, this drew a large laugh from the crowd.) Focus on a proposal that can be agreed upon in 6 months. Added: Semantics of the MPI should be programming language independent. Who Should Use This Standard? --- ------ --- ---- --------- This standard is intended for use by all those who want to write portable message-passing programs in Fortran 77 and/or C. This includes individual application programmers, developers of software designed to run on parallel machines, and creators of higher-level programming languages, environments, and tools. In order to be attractive to this wide audience, the standard must provide a simple, easy-to-use interface for the basic user while not semantically precluding the high-performance message-passing operations available on advanced machines. What Platforms Are Targets For Implementation? ---- --------- --- ------- --- --------------- The attractiveness of the message-passing paradigm at least partially stems from its wide portability. Programs expressed this way can run on distributed-memory multiprocessors, networks of workstations, and combinations of all of these. In addition, shared-memory implementations are possible. The paradigm will not be made obsolete by architectures combining the shared- and distributed-memory views, or by increases in network speeds. It thus should be both possible and useful to implement this standard on a great variety of machines, including those ``machines" consisting of collections of other machines, parallel or not, connected by a communication network. It was agreed that explicit remarks that MPI is intended to be usable with multithreaded processes and with MIMD (not just SPMD) programs should be added somewhere. What Is Included In The Standard? ---- -- -------- -- --- --------- The standard includes: Point-to-point communication in a variety of modes, including modes that allow fast communication and heterogeneous communication Collective operations Process groups Communication contexts A simple way to create processes for the SPMD model Bindings for both Fortran and C In addition A model implementation and A formal specification. will be provided. It was proposed that explanation and rationale for the standard would also be provided as would sample programs and a validation suite. This is getting very ambitious. Jim Cownie also wants wrappers available for use by, for example, profiling. The suggestion is to provide "name shift", e.g. __MPI_SEND, etc. so the profiler can have MPI_SEND call __MPI_SEND after doing whatever is useful for profiling. What Is Not Included In The Standard? ---- -- --- -------- -- --- --------- The standard does not specify: Explicit shared-memory operations Operations that require more operating system support than is currently standard; for example, interrupt-driven receives, remote execution, or active messages Program construction tools Debugging facilities Tracing facilities Features that are not included can always be offered as extensions by specific implementations. Report of the Collective Communication Subcommittee: ------ -- --- ---------- ------------- ------------ Al Geist summarized the meeting that took place Wednesday afternoon (described above). Global functions beyond those discussed by the subcommittee, such as all2all or total_exchange, await written proposals. The (whole) committee added that Fortran 90 and HPF would be a good place to look for more combining functions (other than max, min, sum, etc.) It was agreed that a way to supply user-supplied functions would be useful. Issues mentioned include: What is a group? How are groups formed? Are group elements addressable, if so how? Are groups ordered (e.g. for prefix/suffix operations)? Group always an ordered subset of the ALL group? Partitioning? Connection with virtual topologies? This will be discussed when topology group reports. Friday, January 8 ------ ------- - Jack Dongarra called the meeting to order at 9:00. Report of the Process Topologies Subcommittee: ------ -- --- ------- ---------- ------------ Rolf Hempel reported on the meeting held Wednesday afternoon: Motivation: Applications have structures of processes Most natural way to address processes Processor topology is valuable to user Creation of subgroups is a natural way to implement topologies Draft proposal for MPI functions in support of process topologies (by Rolf Hempel) is in the handout bundle. The subcommittee made some changes to the draft. What functions should MPI contain? specification of logical process structure lookup functions for process id's clean interface to other parts of MPI (process groups) What should it not contain? any reference to particular hardware architectures algorithms for mapping of processes to processors If it does this, the user program will be portable, but will contain full information for processes mapping at the logical level. Claim: The use of process topologies is not an obstruction to quick implementation of MPI, since the first implementation can make random assignments. A process topology is assigned to a process group. Copying groups can be used to overlay different topologies on the same processes. All processes in a group call the topology definition function. Inquiry functions provide the translation of logical process location to process id. Supported Topologies: General graph structure: For each process, define the complete set of neighbors for each node. In principle this is sufficient as it covers all topologies. But it is not scalable as all processes have knowledge of all others. we should investigate a scalable version. However, important special cases should be treated explicitly, because regular structures can be specified in a scalable way easier to implement the mapping they cover a large number of applications. A special case: Cartesian structures grids/tori hypercube is a special case Support for creation of subgroups for regular structures will be useful. Special treatment for trees? deferred User-defined topology definition functions? deferred It will be necessary for the inquiry functions to provide information on the hardware topology, so that a user can provide his own mapping function. Marc Snir: We need to consider consistency of mapping alignments, for example an octtree for image processing with a grid structure. Al Geist: What is connection between group and topology. Recall that a group is a linear *ordered* array which is a kind of topology. General discussion of copying topologies and groups Proposal is to have at most one topology per group so can use group id as name for topology. This is reason that there must be a group copy. David Walker: We need closer coordination between the collective communication subcommittee and the topology subcommittee, since groups are central to both. Report of the Environmental Enquiry Subcommittee: ------ -- --- ------------- ------- ------------ Bill Gropp reported that the Environmental Enquiry subcommittee needs to wait and get a better picture of what MPI will contain. Jon Flower again asked for cpu_time. This was discussed, and we were reminded that these were more-or-less rejected at the Minneapolis meeting as not being part of MPI. Standardization should come from POSIX. Marc Snir: Part of the subcommittee's job should be to decide *what* can be enquired about as well as how it will be done. There was general discussion about inquiring about both MPI parameters and implementation parameters. Also if parameter *setting* as well as enquiry should be supported. (Buffer pool sizes, for example). Jon Flower also asked about system hints. He suggested it should be possible to tell the system about implementation specific tuning in a system independent way. Report of the Formal Specification Subcommittee: ------ -- --- ------ ------------- ------------ Rusty Lusk reported the committee was without its chairman, Steven Zenith, but that it viewed its mission as to try to formalize what the other subcommittees decide on. It will probably use CSP, for lack of experience with any other formal specification language. Bob Knighten suggested that the subcommittee look into LIS (Language Independent Specification) that POSIX defined in order to separate semantics from language bindings. Report on MPI -1 (minus one) ------ -- ------ ----------- James Cownie presented an MPI anti-specification. Ya hadda be there, but in case you weren't or just want to be reminded, here is a transcription of Jim's slides. MPI -1 (Jim Cownie) In the spirit of LPF (Low Performance Fortran) * Bindings ONLY for Mathematica Occam ML * No function take arguments or returns result * Point to Pointless communication * 1024 different sends NO receives * Full support for 0 dimensional topologies * User data in a message limited to 1 byte (of 6 data types) BUT 1 KByte of TAG, CONTEXT * Informal semantics - Formal Syntax * All groups are contexts * All contexts are groups * Non blocking wait * Non blocking barrier * All user programs are unsafe & erroneous, they therefore do all their work in the exception handler. --------------------------------------------------------------------------- A Profile/Instrumentation subgroup was formed with Jim Cownie as chairman. Steve Otto, as general editor, will contact subgroup chairmen to begin discussion of editing concerns. Discussion of meeting format. The following was proposed as a format for subsequent meetings, based on the experience with this meeting. Wed. afternoon: point-to-point Wed. night: all subcommittees other than pt-to-pt and collective comm. Thurs. morning: collective communication Thurs. afternoon: subcommittee reports Fri. afternoon: subcommittee reports Meeting Dates: It was decided to moved the next two meetings up a week from when they were tentatively scheduled. The next meeting will be Feb 17-19. The next one after that will be Mar 31-Apr 2 The currently-scheduled May 19-21 and June 30-July 2 meetings may also be moved up as well. Note that July 2 will be a holiday in the United States.