Minutes of the Message Passing Interface Standard Meeting Dallas, Texas February 17-19, 1993 The MPI Standards Committee met in Dallas on February 17-19, 1993, at the Bristol Suites Hotel in North Dallas. This was the fourth meeting of the MPI committee and the second of the now regular meetings in Dallas. There were both general meetings of the committee as a whole and meetings of several of the subcommittees. Because interest in the Point-to-Point communications and the Collective communications was so general, these met as committees of the whole. No formal decisions were taken at this meeting, but a number of straw votes were taken in the subcommittees. These are included as part of the reports on the work of the subcommittees. These minutes were taken by Rusty Lusk (lusk@mcs.anl.gov) and Bob Knighten (knighten@ssd.intel.com). These minutes are quite long. If you want to see the important topics you can search for --- and this will quickly the lead to each topic (and a few other things.) Wednesday, February 17 --------- ----------- ------------------------------------------------------------------------------- General Meeting ------------------------------------------------------------------------------- Jack Dongarra called the meeting to order at 1:30, There was a discussion of the agenda. Other topics included the possibility of some DARPA funding and a tutorial for Supercomputing '93. The next meeting will be March 31-April 2 at same place (the Bristol Suites in Dallas). The following meetings are tentatively scheduled for May 12-14 and June 23-25. Bob Knighten proposed that we set a definite schedule, particularly if we are going to follow the example of the HPF committee. This was discussed more fully on Friday. (Search for "Schedule" below.) Attendees: --------- Joe Baron IBM Austin jbaron@vnet.ibm.com Harry Scott Berryman Yale Univ. berryman@cs.yale.edu Lyndon Clarke EPCC, U. Edinburgh lyndon@epcc.ed.ac.uk James Cownie Meiko jim@meiko.co.uk Jack Dongarra UT/ORNL dongarra@cs.utk.edu Vince Fernando NAG & UC Berkeley fernando@jaguar.berkeley.com Jon Flower ParaSoft jwf@parasoft.com Daniel Frye IBM-Kingston danielf@kgnvma.vnet.ibm.com Al Geist ORNL gst@ornl.gov Adam Greenberg TMC moose@think.com Bill Gropp ANL gropp@mcs.anl.gov Leslie Hart NOAA/FSL hart@fsl.noaa.gov Tom Haupt Syracuse U. haupt@npac.syr.edu Don Heller Shell Development heller@shell.com Rolf Hempel GMD hempel@gmd.de Tom Henderson NOAA/FSL hender@fsl.noaa.gov Steven Huss-Lederman SRC lederman@super.org John Kapenga Western Michigan U. john@cs.wmich.edu Bob Knighten Intel SSD knighten@ssd.intel.com Rik Littlefield PNL rj_littlefield@pnl.gov Rusty Lusk ANL lusk@mcs.anl.gov Peter Madams nCube pmadams@ncube.com Alan Mainwaring TMC amm@think.com Oliver McBryan U. Colorado mcbryan@cs.colorado.edu Barney Maccabe Sandia abmacca@cs.sandia.gov Dan Nessett LLNL nessett@llnl.gov Steve Otto Oregon Graduate Institute otto@cse.ogi.edu Peter Pacheco U. of San Francisco peter@sun.math.usfca.edu Howard Palmer nCube hep@ncube.com Paul Pierce Intel prp@ssd.intel.com Sanjay Ranka Syracuse U. ranka@top.cis.syr.edu Peter Rigsbee Cray Research par@cray.com Mark Sears Sandia mpsears@cs.sandia.gov Anthony Skjellum Mississippi State U. tony@cs.msstate.edu Marc Snir IBM, T.J. Watson snir@watson.ibm.com Alan Sussman U. of Maryland als@cs.umd.edu Bob Tomlinson LANL bob@lanl.gov Dennis Weeks Convex weeks@convex.com Stephen Wheat Sandia NL srwheat@cs.sandia.gov Stephen Ericsson Zenith Kuck & Associates zenith@kai.com The group then became a committee of the whole to meet as the Point-to-Point Communications Subcommittee. ------------------------------------------------------------------------------- Point-to-Point Subcommittee ------------------------------------------------------------------------------- Marc Snir opened point-to-point subcommittee meeting, and asked for discussion of his draft ("Point-to-Point Communication" by Marc Snir, Feb 8, 1993; this is also included in the overall MPI draft dated February 16, 1993.) He asked about additions to his draft. Cancel was mentioned and was discussed later. Alignment of "sequence of bytes" buffers: --------- -- --------- -- ------ ------- Marc began discussion of the draft by asking about alignment. There followed a discussion of whether messages of type "sequence of bytes" should be restricted to be of length a multiple of 4 or 8, or should be aligned. Jim Cownie proposed that we also vote on requiring that all data types start on their natural boundaries. We decided that this was too restrictive, given that some Fortran compilers do not deliver this. Straw vote: String of bytes can start on any byte address and can be of any ---------- length (including 0.) Yes: 34 No: 0 After a question from Bob Knighten, it was agreed that bytes have 8 bits. (Bob Knighten reminded us that few other standards require that. Named Constants for Options: ----- --------- --- ------- Discussion of whether we should use named constants or specific values for various options. Fortran 77 does not specify an "include" facility. Straw vote: Use named constants? ---------- Yes: 27 No: 2 Scott Berryman spoke in favor of using fixed constants, because of existing Fortran practice. Joe Baron also spoke in favor. This was deferred to the language binding committee. Structure of Buffer: --------- -- ------ Note the message is a sequence of bytes (at this point). There is no requirement that the structures on send and receive side match. (Can move contiguous area into scattered area, etc.) Paul Pierce raised the issue that we need to be clear about whether we want to use *real* iovecs or not. General discussion of how general the datavec should be. Tony Skjellum spoke in favor of generality (mentioned BLAS, which don't have most general striding, so people invent their own.) Jim Cownie suggested that we include data type in the descriptor vectors (that is, there would always be a data type, which might be "byte"). Bill Gropp suggested that the data descriptor vector be an opaque data type, both to help Fortran binding, to allow taking advantages of real iovecs where they apply, and to allow extensibility. Baron mentioned the FORTRAN favorite way of specifying these vectors, with a[b[i]]. This was postponed to the language binding committee. (Note: a concrete proposal by Gropp and Lusk on how data description vectors might be handled was made later, and is described below.) Receive Criteria: ------- -------- (See proposal) selection by tag and by source, and by context. Tony Skjellum proposed that an AND mask be available to deal with ranges of bits in the types and sources. Scott Berryman objected to "bit-twiddling" in standards. Truncation of buffer: ---------- -- ------ Rolf spoke in favor of allowing buffer to be longer than the message received, although some thought this should be an error. There was general agreement that "too short messages" should not be erroneous. Al Geist said truncation (messages too long) should be an error. Paul Pierce said that experience was on this side. Jim Cownie said that we should be clear about whether the first part of a truncated message appears in the buffer. Marc Snir observed that standards seldom specify the behavior of an erroneous program. Straw vote: Matched messages must fit in buffer, otherwise an error. ---------- Yes: 26 No: 0 Send and Receive: ---- --- ------- Marc Snir noted that all the point-to-point operations can be defined in terms of the four low level operations: INIT, START, COMPLETE, FREE. See Section 1.5 (Communication Handles) of Snir's document. Discussion of how restrictive the init routines should be (difference between Snir and Gropp-Lusk proposals.) Much discussion of efficiency of handle creation and modification, whether handle would be in system space or user space,and whether there should be default values. Some discussion of whether we should have this Level 1 (of the Gropp-Lusk multilevel proposal) at all. (General agreement that it should.) Handles: Do we have handles? Do handles have default values? Are ------- handles malleable? Adam Greenberg argued that modifiable handles make channels much harder. They are hard to start with. [NOTE: If so, then must be a single "create handle with attributes" operation.] Straw vote: Unmodifiable handles? ---------- Yes: 1 No: the rest Bill Gropp suggested that handles could be created, then repeatedly modified, and finally "committed to", after which time they cannot be modified without recommitment before use. This allows creation to proceed by modifying defaults, and some sort of compilation to take place on the commit operation. Jim Cownie proposed a "dup" function for handles. Straw vote: This section should be rewritten to allow modification, ---------- followed by commit operation. (Can even modify and reuse repeatedly.) Yes: 28 No: 0 What about defaults. One argument against defaults - want to be able to catch unset fields. Gropp/Lusk went with defaults because extensibility and requiring setting of attributes of handles are conflicting as new attributes break old programs. Also did not want to be able to create handles that cannot be used. Straw vote: Have Defaults ---------- Yes: 20 No: 5 Parameters of a handle: buffer start mode Some more discussion of ready receive semantics. Marc says he added ready-send semantics more for symmetry, but also to allow for the "pull" model of communication, to go along with the currently prevalent "push" model. General discussion followed about whether it provides a way to write erroneous programs. It was pointed out that the call always works; on machines which don't support a special protocol it merely provides no performance improvement. But on some machines the ready-receiver semantics of send provides a performance win. Ready-receiver (send message with fore knowledge that receiver has posted the receive) passed narrowly last time, so another vote was taken. Straw vote: Have ready-receive? ---------- Yes: 20 No: 12 When does an operation complete? (1.5.4 of Snir proposal) ---- ---- -- --------- -------- Adam Greenberg began by asking whether we need synchronous mode (send does not complete until receive has completed at the other end) if we are assuming reliable communication. General discussion of whether we need synch mode. Lusk retracts previous arguments in favor of synch mode. Greenberg asks why. Only argument in favor is that it can use no buffering. Jim Cownie pointed out that it is a way of forcing the effect of "no system buffering". Paul Pierce argued that there be a global method of specifying that no system buffering is to be used. Straw vote: Have synchronous mode? ---------- Yes: 10 No: 15 As a consequence section 1.5.4 goes away. Extracting information from handle on completion: (Section 1.5.6) ---------- ----------- ---- ------ -- ---------- Separate COMPLETE-RECEIVE and COMPLETE-SEND? Al Geist: separate improves clarity. Jim Cownie: same allows for waiting on a set of completions of mixed sends and receives. Marc Snir: the completes have different parameters. Paul Pierce: could have query between complete and free, so that complete could have same parameters. Jim Cownie spoke in favor of this: it is similar to the modifying the attributes of a handle and then committing it. But he proposes that the user pass to the complete routine an area (in user space) where all the parameters are stashed. Marc says that the parameters then become different again, since these structures are not the same. Jim: it still could be. Rehash of commit and handles in general. Should the commit return a system handle? Straw vote: Have a separate query function. ---------- Yes: 21 No: 3 Straw vote: Have a single complete function? ---------- Yes: 23 No: 0 Rik Littlefield noted that the phrase at the top of page 12: "Note that it is correct, but inefficient, to implement MPI_CHECK via a call to MPI_COMPLETE, in which case, MPI_CHECK always returns true." is wrong. This will change to remove comment at top of p. 12 that says mpi_check might block. Higher-Level Operations: (Section 1.6) ------------ ---------- Now that synchronous operations have been discarded,there are only 6 operations for each send and receive, as opposed to 12. Jim Cownie objected to in/out arguments. Discussion: Jim Cownie reminded us of "solution number 5": (For more information see minutes of last meeting.) 1: handle + inquiry. 2: pass in arguments where unwanted information is 5: pass in structure where things are stashed (in user space) (out arguments are replaced by one structure pointer to opaque structure) This is also thread-safe. Three proposals: in/out arguments, multiple parameters, opaque structures. Straw vote: Do not use in/out arguments? ---------- Yes: 32 No: 2 Straw vote: Package the arguments in a structure rather than using list of ---------- arguments. Yes: 24 No: 8 The Point-to-Point Communication Subcommittee meeting ended at this point. It continued on Thursday. ------------------------------------------------------------------------------ General Meeting ------------------------------------------------------------------------------ There was a brief general meeting with Marc Snir presiding. Short discussion of next meeting dates: Next meeting March 31, April 1 - 2 as scheduled 6 weeks later is May 12 - 14; this was approved. and June 23 - 25 was approved. The subcommittees meeting Wednesday night were: Collective, Topology, Context, Introduction, Environment and Profiling. ______________________________________________________________________ ______________________________________________________________________ Some subcommittee meetings took place Wednesday evening. Reports on those meetings are part of the General Meeting minutes for Friday. ______________________________________________________________________ ______________________________________________________________________ Thursday, February 18: -------- -------- -- Marc reminded us that we need to move quickly toward readings and approval --------------------------------------------------------------------------- Collective Communication Subcommittee --------------------------------------------------------------------------- Al Geist opened the collective communication discussion at 9:15. He reviewed where we were at the end of the last meeting and urged people to send in written proposals. Broadcast: --------- Discussion of syntax of broadcast -- do both the sender and receivers of a broadcast call mpi_bcast or do the receivers call mpi-receive. If the receive must handle broadcasts, it puts an extra burden on them. Suggestion that there are applications, such as discrete-event simulation, where it would be convenient if broadcasts were received by normal receives. Marc Snir: Do we have the same data types as for point-to-point messages? Al Geist: Yes. Discussion of whether the broadcast should be synchronous or not. Straw vote: Broadcast not required to be synchronous? ---------- Yes: 11 No: 5 (There were 37 people in the room.) Straw vote: Have a non-blocking broadcast? ---------- Yes: 21 No: 7 A non-blocking broadcast returns as soon as possible, but buffers are invalid until the operation is complete (as verified by some inquiry routine.) Buffer options will be like point to point. Straw vote: Broadcast is received by broadcast, not receive. ---------- Yes: unanimous Barrier: ------- Discussion of tag on barrier. Jim Cownie suggested a non-blocking barrier, so that one could initiate a test for whether processes have reached a certain point, and test later. Non-blocking barrier: Each process entering the barrier posts that fact. There is an inquiry function to check that everyone in group has entered barrier. Non-blocking barriers would then *require* tags, since one could be participating in multiple barriers. Scott Berryman said that he has had to implement this. Adam Greenberg noted that CM-5 implements in hardware (with limitations). Users would like with fewer limitations. Straw vote: Should we have a non-blocking barrier? ---------- Yes: No: 1 Straw vote: Should we have a tag for both blocking and nonblocking barriers? ---------- Yes: 31 No: 1 gather/concat: ------------- There is a need for a gather-then-broadcast. Discussion of whether there should be in and out buffers on the gather. Observation that buf is an IN/OUT parameter and for root is actually used both IN and OUT. Proposal to separate the IN and OUT buffers. General discussion: How much buffer space and where needed. GENERAL_GATHER(in_buf, out_buf, bytes, out_bytes, tag, context, gid) out_bytes = 0 indicating if result is to be delivered in out_buf Weeks: Note that bytes must be the same in every call. If non-blocking, one call may return without any error indication even though call is erroneous. Weeks: Proposal to specify out_bytes with it being 0 if result buffer fills in out_buf. In any case should have in_buf and out_buf because root is using buf for two very different purposes. Greenberg, Flower: Agree but for different reasons. Alternative proposal: GENERAL_GATHER(in_buf, out_buf, bytes, flag, tag, context, gid) flag indicates if result is to be delivered in out_buf Oliver McBryan: Question of whether gen_gather should replace concat. Straw vote: Use two buffers (inbuf, outbuf), not a single in/out-buf on gather ---------- Yes: 33 No: 0 There is a problem of when it is know where flag is set. A straw vote was proposed, but Adam Greenberg argued that we need further discussion before voting and the vote was postponed. Straw vote: Have a gather function to a single "root node"? ---------- Yes: 32 No: 0 Straw vote: Have an all-to-all gather? ---------- Yes: 26 No: 0 Straw vote: Should we have any further discussion of including flag? ---------- Yes: 9 No: 15 [So absent a new proposal flag is out.] Straw vote: Have non-blocking versions of gather and all-to-all? ---------- Yes: 18 No: 5 Straw vote: Have a gen_gather and all-to all, with different buffer size ---------- on each process. Yes: 29 No: 0 Creating and Freeing Groups: -------- --- ------- ------ Al Geist opened discussion on creating and freeing groups. He pointed out that this is still tangled up with process id's and contexts. mkgroup (list of processes, specified somehow) returns group id. called by all processes, Cownie and McBryan proposed that all in a group call this, passing flag to say whether they want to join new group or not. Rolf says that we don't want to have to have topologies to have groups. Marc pointed out that the "flag" version subsumes the list version, given that everyone calls it, and pointed out that some sort of global synch is desirable in order to have system-global gids. Discussion of whether gids are globally known (i.e. known to all processes in a group) and valid or not. gids could be valid only in one process only among members of a group system wide Discussion of only making subgroups vs. creating union of groups. Discussion of whether we want to deal with dynamic process creation or not. In dynamic situation, there is no common ancestor for group creation. Tony suggested that people are going to want dynamic groups. MPI must be competitive with PVM. Paul Pierce called for a concrete proposal, since dynamic processes goes considerably beyond what we have seriously considered so far. Oliver McBryan proposed a special join operation for forming unions of groups. Marc proposed that we vote at least on an operation on partitioning an existing group by key. Straw vote: Should MPI have an operation to partition an existing group? ---------- Yes: 32 No: 0 Should we throw out the list version? Straw vote: Should MPI only provide partitions of an existing group? ---------- Yes: 10 No: 17 [The alternative is to keep the list form of mkgroup.] This was the end of the Collective Communications Subcommittee meeting. ------------------------------------------------------------------------------- Point-to-Point Subcommittee ------------------------------------------------------------------------------- Marc Snir called the Point-to-Point Subcommittee to order at 1:30 p.m. He started with a review of the previous day and reminded us that we were now discussing Level II rather than Level I. (Page numbers from here on in the minutes refer to the "Draft Document for a Standard Message-Passing Interface" of February 16, 1993 [prepared and distributed by Steve Otto] whereas previous page numbers referred to the draft "Point-to-Point Communication" by Marc Snir, Feb 8, 1993.) Discussion of how to get information about completion of calls: opaque structures used for both blocking and non-blocking. Discussion that a program in which if there is a posted send on one process and a matching posted receive on another node, then the operations will eventually complete. Jim Cownie proposed that there be only one type of handle, and then only one kind of wait: wait(handle) together with query functions. Discussion of having a uniform wait routine. We don't have concrete proposals for what the query functions would look like. wait(handle, opaque_return) query (opaque_return,.... ) or could have layered higher-level specialized query function. Tony Skjellum pointed out that you will first have to query the opaque_return to find out what the type of the handle is, in order to determine which query function to use on it. (Applicable to wait_any.) Straw vote: wait_send, wait_recv with different parameters? ---------- Yes: 2 No: 29 (The preferred alternative is to have a uniform wait, handles, opaque_return and query function(s)). Steven Zenith asked that the word "alternation" be replace by "choice" - this was accepted without further discussion. Waiting on Set of Events (wait_any): (pages 20-21). ------- -- --- -- ------ -------- The two functions discussed are wait_any and wait_all. Wait_any completes a single operation, and the handle is freed. There was a discussion of which handle is selected and the issue of fairness. It was observed that it would be responsibility of programmer to pass always a list of valid handles. Another possibility would be to have wait_any modify the list of handles on return (and would change the matched handle to a magic "null" handle that would match nothing but always be accepted.) What would happen if only these null values were passed? Rik Littlefield suggested that the in/out argument problem doesn't apply here, so it would be better to have the handle list modified. The specific freed handle replaced by NULL, say. Straw vote: wait_any(list, index, opaque_return) returning index and ---------- opaque_return with index identifying the handle returned Yes: 28 Con: 1 Straw vote: "null" handles (for deletion from list)? ---------- Yes: 19 No: 3 Straw vote: wait_any to set handle matched to null? ---------- Yes: 21 No: 6 (Bob Knighten pointed out that this simplifies handling of a shared list when there are multiple threads.) Is all-null an error or is it a no-op? Postponed. Straw vote: Should we have a wait_all? ---------- Yes: 15 No: 9 wait_all(list-of-handles, list-of-opaque-returns) thus approved. Bill Gropp suggested that we should worry about an error during this operation. Should we have a wait_all for *all* operations? Jim Cownie suggested that if we know what contexts are, we might want to have a wait_all for all events in the context. Other questions: What happens with multiple wait_all with overlapping lists? What happens in a multithreaded environment? Probe: ----- We want to receive a message without knowing its length, for example. (probe returns the envelope of the message and locks the message). Then you can receive it. We should also have an unlock operation. This is a different sort of handle than at level 1, since at level one the buffer has been associated with the handle, but for this one you learn about the buffer. Therefore this section (middle of p 21. ) should have "handle" replaced by, say, "lock". This requires a separate receive in order to receive on a lock. Also the in/out parameters should have input parameters, plus an opaque return object. The revised functions are: mpi_probe(source, tag, context, opaque_return, lock) mpi_precvx(lock,...) mpi_unlock(lock) Peter Rigsbee suggested that there be a wait version and a status version. (wait version doesn't return until it returns with a lock). Discussion of blocking, nonblocking version of the precvx. Jim Cownie argued against unlock, since probing is sort of a contract to receive the message. There was a counter that it is often desirable to decide NOT to read this message but rather toss it back into the pool. What to do about message order after unlocking? Several possibilities: head of queue, original location, tail of queue, unspecified. PROBE clearly perturbs the order of receipt of messages. Skjellum and Lusk argue that the only reason for this is because of other problems - we ought to fix those problems instead. (The problems here are the specification of fixed size buffers all over the place rather than just providing the needed buffers to the extent possible.) Steve Wheat proposed that we do away with the lock and the new receive. Have PROBE(source,tag,context,info) and then using a blocking receive to get the particular message the probe located. Dealing with thread safety can be dealt with by using critical regions. Note that you cannot use this to look at the entire message queue. Paul Pierce proposed that "you get the buffer and give me the address" be one of the buffer types. Summarize: 1. lock, unlock precv 2. probe gives you info, then you receive from that tag-source. 3. get rid of probes. and mpi gets buffer for you. Buffer Descriptors: ------ ----------- Bill Gropp described a buffer-descriptor proposal, in which there is a function create_bd( ... ) returning a buffer-descriptor. It is then possible to append different kinds of descriptors. The idea is to be able to build arbitrary structures for mixed data types and gather-scatter operations. The append operations might look like: bd_contig(bd, address, datatype, numitems) (so can build mixed-type heterogeneous messages) bd_stride(bd, address, datatype, stride, numitems {, itemlen?}) for strided data bd_abi( bd, address, index_array, datatype, numitems) for indirect address vectors {abi stands for A[B[I]]} free_bd(bd). or have send free it. Note probably want to reuse, so free should be explicit. Al Geist noted that this is for sophisticated users, so likely can get by with just bd_stride. Data types? Gropp's current view is that these are only primitive data types, not derived data types, e.g. no structures. Can build buffer descriptor for structure by multiple calls. It would be nice to have a program/function that would do that for the programmer. That would not be hard. Note that this is very different from existing practice. There should certainly be a level close to existing practice. After a discussion of this, there was a Straw vote: Get a fleshed-out proposal for this? ---------- Yes: 29 No: 2 Straw vote: Should there also be a simple version for contiguous messages ---------- of fixed types (to better conform to existing practice)? Yes: 28 No: 3 This was the end of the data type discussion and Marc Snir took the floor again. Cancel: ------ Someone brought up that all the operations for which we can wait, require perhaps a cancel. Jim Cownie argued (again) that the worrisome case is the wait for an outstanding nonblocking receive. MPI_cancel(handle) guarantees that the buffer will not be written on. Here is a sample program that illustrates the usefulness of cancel: IRECV(...) REPEAT(...) WAIT(HANDLE) . . . IRECV(...) UNTIL(converge) CANCEL Berryman: this could be handled with an appropriate use of tags. Snir: Cancel send could be done with free_handle. Pierce: That should be an error. Marc withdrew the suggestion. This led us into a discussion of whether MPI will require a fixed number of processes. Note that introduction does not discuss the (fixed number of processes) requirement. Correctness: (p. 21) ----------- What is a correct MPI program? What is done with erroneous MPI programs? Review of message-order preservation. In the case of threads, there may not be an order to the messages. So *if* there is an order on messages, it is preserved. Note that this is for *matching* receives. What about receive any with two messages from same source? They should be received in order. There is also a fairness question. Paul Pierce said that the important order is the order in which the receives are posted, not the order of the receives themselves, so that messages land in the correct buffers. General agreement. Jim Cownie brought up the fairness issue with respect to receive-any. Progress and Fairness: -------- --- -------- This brings up the resources issue. Discussion of minimum resource requirements. number of handles, etc. Bob Knighten proposed that the bounds be implicit in the test suite. It was proposed that there be an appendix to describe implementation profiles, which will be an agreement on what an implementation will try to support. Mark Snir's current document attempts to specify the weakest possible requirement - no system buffering buffering is explicitly required. Rather there is just the requirement that if a matching send and receive have been posted, the the operation will complete. Marc noted that this can be extended to collective communications. Oliver McBryan suggested that the user could supply some space to the system that it could use for MPI, even on machines that supply no buffering. Rik Littlefield suggested that the user could provide the buffer space to the system, and declare its requirements. This discussion was deferred until there is a concrete proposal from the Environment Subcommittee (which has renamed itself from the Environmental Subcommittee to the Environmental Management Subcommittee). Error Handling: ----- -------- Snir: Two communities: program writers and system programmers System programmers not likely to write in Fortran. Single mechanism may not be suitable across languages. One approach, especially appropriate for Fortran application programmers: an error should bring the system down and maybe help you debug, the other, for system programmers writing in C, test return codes for errors. Marc suggested that the default be to blow up. Jim Cownie suggested the solution of having an alternate set of routines, and Rick Littlefield pointed out that this is the only possible thread-safe mechanism. Paul Pierce suggested having the syntactic alternatives, as in NX. It was suggested that the error-handling mode should be attached to the levels. It was proposed that C routines return negative values on error, while Fortran routines are used to having extra out parameter. Marc summarized the alternatives: both f77 and C always return error code. only C 2 different libraries can select to signal when error occurs At what granularity? per job per context Straw vote: F77 code should {always/never} return an error code? ---------- Always: 22 Never: 3 Straw vote: Should there be alternate libraries to select between these ---------- alternatives? Yes: 6 No: 18 Without debate it was assumed that the same result would prevail in a vote for C. Straw vote: Can select (in some way) what happens when an error occurs? ---------- In F77: Yes: 21 No: 6 Without debate it was assumed the same vote would prevail in C. ------------------------------------------------------------------------------- General Meeting ------------------------------------------------------------------------------- Friday, February 19: ------ -------- -- The meeting started at 9:00 with: Report of the Context Subcommittee: ------ -- --- ------ ------------ Tony Skjellum reported on the meeting of the previous night (see notes above.) There is a Contexts Draft 1.0 that is available. Contexts are a partition of the tag space for matching, no wild cards on context. Contexts chosen by users, tags by users. MPI_NEW_CONTEXT is executed by one process to obtain a context. It is then broadcast to those who need it. If there is a group all there is an associated context, to deal with bootstrap problem. Pairwise message ordering is preserved within context. The context subcommittee will try to come next time with proposals that are consistent and resolve the circular interaction among groups, contexts, and process identifiers. Rick Littlefield noted that there is no mechanism for statically created context. He prefers static name_server model. Discussion of whether groups can be used to replace contexts. Marc Snir pointed out that the real question is that of another parameter one sends, and whether we call it gid or context is irrelevant. Rolf Hempel pointed out that the intentioned use is quite different, so we should have both concepts, like we do selectivity on source although it can be encoded in the tag, Steven Wheat said that his users understand contexts quite clearly as types you can't wildcard, while groups are more confusing. Jim Cownie suggested that there is a definite bootstrap problem with getting the context to the processes that need it. Tony Skjellum said that in zipcode contexts are associated with groups, so obtaining a context is a synchronized, group operation. John Kapenga spoke in favor of a way for a group to obtain an associated context. Marc Snir spoke in favor of the name-server approach for libraries. Rik LIttlefield said that it would be nice to have guidelines on how to write an MPI-safe library. Marc Snir summarized that the extra match field can be: separate in the tag (tag range registration) in the pid (send to ports instead of processes) Paul Pierce summarized issues that need to be addressed in concrete proposals: are groups local, are contexts global or local how to implement service. General agreement that the context subcommittee will produce a new white paper clarifying these issues. Report of the Process Topologies Subcommittee: ------ -- --- ------- ---------- ------------ Rolf Hempel reported on the work of the Process Topologies Subcommittee. (See the proposal in the "Draft Document for a Standard Message-Passing Interface" of February 16, 1993) the Process Topologies and Collective Communication Subcommittees met jointly, as both deal with groups. Division of Responsibilities: -------- -- ---------------- Process Topologies Collective Communications topology group creation basic group creation group partitioning along group partitioning by key coordinate lines Topology Functions: -------- --------- Topology definition function always creates a new group/ advantage: ranks in parent group so not change ranks in new group are aligned with the Topology supported topologies: agreed on cartesian structures (grids, tori) arbitrary graphs This is all we should aim at for MPI-1. We can do trees later. {McBryan: What is relation of order to topology? There should be translation functions. What is order of a tree?} Standard case: MPI decides which process in group gets which position in topology. {Geist: Does this mean the topology must be encoded in the gid? Hemple: No.} Additional option: User assigns topology position to each process explicitly. (Marc Snir will write a proposal) Mapping: ------- MPI implementation may try to efficiently place processes. Option: user can explicitly ask for random mapping. This would be more efficient. Also could be used to explicitly request a random pattern of communication. This might be redundant, since it could be a user-requested mapping. {Random or arbitrary? Random. Why random? Because it is sometimes the correct answer. May reduce contention over any systematic placement.} Indexing in MPI: -------- -- --- There is a general problem in MPI: How are n object numbered? 0, 1, ..., n-1 C style 1, 2, ..., n Fortran style For inter-language compatibility we should pick one, but this issue was deferred to the Language-Binding Subcommittee. Applications: ------------ rank in group node numbers in graph structures MPI_WAITANY there was a possibility of having this alternative selectability, but Tony pointed out that this breaks libraries. Straw vote: MPI to number objects using the C convention (0, ..., n-1)? ---------- Yes: 32 No: 1 (So (0,0) is the first element of a two-dimensional arrays.) Another issue, row-major vs. column-major in arrays. visible in ranks in groups and order in buffering. Alignment of groups and subgroups. General discussion. is (0,1) the second element or (1,0) the second element? Discussion of usefulness of elaborate mappings, and whether vendors will offer support for this. Otherwise, topologies are lightweight (function mesh(i,j) returns process id) Snir: General comments on proposal. One purpose is to renumber processes so that they can be more efficiently placed. But will this actually be used by any vendor. Hempel: This is not part of the standard and creating a new group solves other problems. Snir/McBryan advantage of topology is ability to express communication in terms of topology. Minor advantage. Geist: Relation of collective communication to topolgy? Want SHIFT to be relative to topology? But argument is a gid and so how is topology encoded? Topologies are implemented on top of (augmented) groups. I.e. the group associated with a topology can "know" the process order for "shift left". Marc Snir suggested that Topology functions be local (who is my left neighbor, right neighbor) McBryan: Sounds like this should be in a library, not in the language. Hempel: MPI is not a language. Snir: Need an interesting example of a system where this will actually be used, otherwise this is only a convenience feature. Cownie: Based on experience with Parmacs on machines where this placement is important. Have vendors moved onto machines where this is no longer relevant. Hempel: It could be a serious mistake to believe that situation with newest machines shows that placement is not relevant to future very large machines. Skjellum: With very large machines, the entire model will have to change. Topology is not the right way to provide information. Not relevant to program running several distinct kernels. Snir: Global communication vs. local communication. Berryman: Do not accept idea that topology is not relevant. User needs to be aware of machine topology and able to use this in program. Jim Cownie pointed out that on machines where the processes are on the nodes of the switch network, and there there is an important performance benefit to mapping correctly. But new machines are getting away from this, and process placement may become less of an issue. Machines are becoming "flatter". End of Rolf's discussion of topologies. MPI Tutorial at Supercomputing `93: --- -------- -- ------------------ Rusty Lusk asked those interested in participating in a tutorial on MPI at Supercomputing '93 contact Jack Dongarra or himself. Validation of MPI: ---------- -- --- Oliver McBryan suggested that we have an effort to write, port, provide and share applications programs for MPI. Rusty Lusk noted the ongoing implementation by Bill Gropp and himself that will allow people to test applications soon. There will be an effort to port some HPF programs to MPI. Schedule: -------- Bob Knighten asked about the schedule. Marc said that we would have a reading about the language-independent stuff, and the C and fortran bindings separately. The point-to-point proposals will be ready for the next meeting. Profiling should be ready. Introduction should be ready. Others should be ready for first reading at the following meeting. Report on Environmental Management Subcommittee (Bill Gropp): ------ -- ------------- ---------- ------------ three classes of routines MPI Parallel-related Non-MPI useful, but not specifically MPI, like high-resolution timers. 3 routines agreed on (no syntax) mytid numtid validtags Management Hints: ---------- ----- provide requested value return actual value for implementation limits and characteristics. Exact choices of items that can be managed not determined yet. We aren't doing error handling. Report of the Language Binding Subcommittee ------ -- --- -------- ------- ------------ Scott Berryman reported on the Language Binding Subcommittee. This subcommittee has been biding its time. There will be a "Thou shalt not" list on the network soon. A proposal for standards used and exceptions allowed will be presented at next meeting. For F77 the basic proposal will be F77 plus long names plus underscores plus include. We will vote on these at the next meeting. John Kapenga asked whether we want to say anything about I/O? Jon Flower pointed out that we need some minimal requirement, driven by the need to write a test suite. Bill Gropp pointed out that minimally we should be able to run this program: if (master) printf ("hello"). Jim Cownie and Rusty Lusk said that all one really needs is a requirement that at least one node be able to do stdio. Marc Snir pointed out that there needs to be an enquiry function to find out which node can do I/O. ------------------------------------------------------------------------------ MPI -1.1 ------------------------------------------------------------------------------ Jim Cownie presented MPI -1.1. A revision from -1.0 was required since many of the concepts humorously presented there have now been adopted into MPI (e.g., non-blocking barrier.) Design approach: macc o (opposite of Occam!) h "Entia sunt multiplicanda." Objective: To be as complex as possible with no coherent subsets. Developments since MPI -1.0 Non blocking barrier remove to MPI-1 Handles added - needed for opening doors NPROCS redefined - now guaranteed never to return the same answer twice. Number of collective routines increased - need more to keep all procs busy. Another two versions of all functions probably erroneous & guaranteed erroneous All errors are opaque (following industry practice) Non blocking exit added {But can it be canceled?} Preserved from MPI -1 All groups are contexts All contexts are groups Environmental management: I require the hardware to be ... I require the vendor to be ... ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Thursday night there were meetings of at several subcommittees. Notes from the meeting of the Communication Contexts Subcommittee are included below. ------------------------------------------------------------------------------ Communication Contexts February 18, 1993 Anthony Skjellum opened the discussion [Started with 12, another 10 or so came in about 9:20] Contexts are to make it feasible to build scalable software that can be mixed. TAGS - unstructured bits (at least 32) 1. definition of tags 2. matching of tags CONTEXT - unstructured integer, system assigned NEW_CONTEXTS(number_of_contexts, array_of_contexts) number specified, array returned FREE_CONTEXTS() CONTEXTS_AVAILABLE() minimum in system >= 16K contexts are gotten by one process and then distributed to others Contexts -------- 1) Avoid crossing messages between libraries and user code Select: by source, by tag (and mask), by context - "safe" Relation to groups and global operations Use of contexts in a 3-D grid model: context for each full two dimensional array section => big numbers Alternative need for multiple name spaces apart from groups: Wheat: Server with multiple contributors to a package of data. Also, context as a means of separating stages in a software pipeline without need barrier. Cownie: Proposes using many fewer contexts a very fast implementation by using an array of queues with indexing via the context. An alternative to context is using copies of group ids, but then need point-to-point routines to be aware of groups, i.e. able to select on gids. Purpose of contexts is to keep separate part of program from interfering without preventing users from doing anything they want with tags. Context as an "endpoint of communication" (Cownie as "queue index"). Connection with groups, i.e. collective communications. Lusk: What are cost/complication of using instead tag registration, i.e. the user requests a range of tags? Change in "don't care". Rik Littlefield agreed with need for large number of contexts - situations where intersections of groups are complicated. Paul Pierce argued that semantics of using a part of tag as context implies enough difference that there is likely only a syntactic distinction. Sandia - user's group had no problem with idea of context; groups caused them a great deal of confusion and disinterest. Some situations, e.g. separating software libraries, need only a very small number of contexts, but using contexts to separate intersecting groups can lead to situations where you need many contexts. Skjellum asks Cownie about performance impact. What about using index for small number and switching for large number? Well . . . Paul Pierce: "IF" in performance critical code is always a problem. There followed a discussion of the relationship between contexts and the problem of implementing collective operations using P-P operations. Paul Pierce pointed out that some implementations will want to do this! [Intel will do this; Meiko will not do this.] Rik Littlefield proposed to have context associated with a "code package" in some fashion, then structured use of tags within the code is sufficient. For this to work we need to have some essentially static assignment of context. But only modest number of contexts needed. Groups vs. contexts again. What is a pid? Etc. Zipcode bases everything on groups. Proposal: Make groups and contexts the same. [Of course this is part of MPI -1, rejected last time] ------------------------------------------------------------------------------ ==============================================================================