The MPI Forum met March 30 - April 2, 1993, at the Bristol Suites Hotel in North Dallas. This was the fifth meeting of the MPIF and the third of the now regular meetings in Dallas. There were both general meetings of the committee as a whole and meetings of several of the subcommittees. For the first time a number of formal votes were taken at this meeting. All of these are recorded in these minutes (and can be found by searching for VOTE) and have also been published (to the mpi-core mailing list) in a summary of all the formal votes and all of the straw votes for the committee as a whole. The notes for these minutes were taken by Bob Knighten (knighten@ssd.intel.com) and Rusty Lusk (lusk@mcs.anl.gov). These minutes are quite long. If you want to see the important topics you can search for --- and this will quickly the lead to each topic (and a few other things.) Attendees: --------- Joe Baron IBM Austin jbaron@vnet.ibm.com Eric Barszcz NASA Ames barszcz@nas.nasa.gov Harry Scott Berryman Yale Univ. berryman@cs.yale.edu Rob Bjornson SCA bjornson@sca.com Lyndon Clarke EPCC, U. Edinburgh lyndon@epcc.ed.ac.uk James Cownie Meiko jim@meiko.co.uk Jack Dongarra UT/ORNL dongarra@cs.utk.edu Anne C. Elster Cornell U. elster@cs.cornell.edu Sam Fineberg NASA Ames fineberg@nas.nasa.gov Jon Flower ParaSoft jwf@parasoft.com Ian Glendinning U. of Southampton igl@ecs.soton.ac.uk Adam Greenberg TMC moose@think.com Bill Gropp ANL gropp@mcs.anl.gov Leslie Hart NOAA/FSL hart@fsl.noaa.gov Tom Haupt Syracuse U. haupt@npac.syr.edu Rolf Hempel GMD hempel@gmd.de Tom Henderson NOAA/FSL hender@fsl.noaa.gov C. T. Howard Ho IBM Almaden ho@almaden.ibm.com Steven Huss-Lederman SRC lederman@super.org Rusty Lusk ANL lusk@mcs.anl.gov John Kapenga Western Michigan U. john@cs.wmich.edu Bob Knighten Intel SSD knighten@ssd.intel.com Rik Littlefield PNL rj_littlefield@pnl.gov Peter Madams nCube pmadams@ncube.com Arthur B. Maccabe U. of New Mexico maccabe@cs.unm.edu Oliver McBryan U. Colorado mcbryan@cs.colorado.edu Dan Nessett LLNL nessett@llnl.gov Steve Otto Oregon Graduate Instiute otto@cse.ogi.edu Peter Pacheco U. of San Francisco peter@sun.math.usfca.edu Paul Pierce Intel prp@ssd.intel.com Sanjay Ranka Syracuse U. ranka@top.cis.syr.edu Arch Robison Shell Development robison@shell.com Mark Sears Sandia mpsears@cs.sandia.gov Anthony Skjellum Mississippi State U. tony@cs.msstate.edu Marc Snir IBM, T.J. Watson snir@watson.ibm.com Alan Sussman U. of Maryland als@cs.umd.edu David Walker ORNL walker@msr.epm.ornl.gov Dennis Weeks Convex weeks@convex.com Stephen Wheat Sandia NL srwheat@cs.sandia.gov Wednesday, March 30 --------- -------- ------------------------------------------------------------------------------- General Meeting ------------------------------------------------------------------------------- Jack Dongarra called the meeting to order at 1:30. The first topic for discussion was the agenda. David Walker had mailed out the following Provisional Agenda for MPI Meeting, March 31-April 2, 1993 Wednesday 1:30-6:00 Discussion of Snir, Gropp, Lusk point-to-point proposal (everyone) (Snir) 6:00-7:30 Unofficial dinner break 7:30-10:30 Break up for subcommittee meetings Thursday 9:00-12:00 Discussion of Snir & Geist collection communication proposal (everyone) (Otto?) 12:00-1:30 Lunch (provided) 1:30-3:00 Full group meeting for presentation of alternate approaches to groups and contexts, dynamic vs. static process models, and other issues (Volunteer?) 3:00-6:00 Full group meeting for presentation of process topology subcommittee ideas and proposals. (Hempel) 6:00-8:00 Dinner (attendees pay, but hotel provides transport to area restaurant) 8:00-10:00 Continued informal subcommittee meetings if necessary Friday 9:00-11:00 Full group meeting with the intent of taking binding votes on point-to-point and collective communication proposals, or sending proposals back to subcommittees for revision. (Snir?) 11:00-12:00 Full group meeting for defining timetable for producing MPI (or subset) by deadline in July. (Dongarra) Following the discussion on the mpi-core mailing list the question was moving the discussion of the timetable for producding MPI to the beginning of the meeting. After a brief discussion it was decided to proceed first with reports from the Communication Context and Point-to-Point Communication subcommittees in order to have a basis for discussing the schedule. The schedule was discussed on Thursday afternoon, following the completion of the Point-to-Point subcommittee report. The Context subcommittee was alloted two hours, from 1:30 to 3:00, with the Point-to-Point Communications subcommittee scheduled from 3:00 to 6:00 and on Thursday morning. ------------------------------------------------------------------------------- Report From the Communication Context Subcommittee ------------------------------------------------------------------------------- Tony Skjellum presided. There was a large volume of activity on the mpi-context mailing list before this meeting and so there were five proposals available for consideration, labeled: I (Marc Snir) III (Tony Skjellum) VII (Lyndon Clarke/Rik Littlefield) VIII (Mark Sears) X (Tony Skjellum/Lyndon Clarke) 25 minutes was alloted for presentation of each of these in the order I (Marc Snir), VII+X (Lyndon Clarke), VIII (Mark Sears) , III (Tony Skjellum), X followed by general discussion. Tonight there will be a subcommittee meeting to produce a single proposal. Proposal I (Marc Snir) ---------- ----------- {Marc used overhead projector slides and these notes are largely a transcription of those slides.} Group=Context Proposal Goals: + Keep it simple (and keep MPI small) + Keep it efficient Minimal needs: + Protection mechanism + Local name space Group = Context = Ordered set of processes. -- Method for protecting communication between e.g. libraries. -- + All point-to-point communication is WITHIN a group and uses (group,rank) address. + All collective communications by a group (which is a context) OPERATIONS: + Group copy + Group partition + Group creation by list + Group deletion + ALL group prexists Group handle has only local (i.e. within group) use and meaning. -- There is no reason to pass the handle of a group outside the group - it has no use. -- 1. Impact on "current practice": Need additional argument ALL in all p-p calls. 2. Overhead for p-p send - no impact when ALL use. One lookup for other groups. receive - context id match Overhead at creation Loosely synch collective communication within group affected Storage Member table (good protection) -- 3. Compatibility with dynamic process creation and deletion Process creation/deletion requires same for group {What is ALL group after process creation or deletion?} 4. Interaction with topology Group has no topology information (but it can be used as a peg for such information.) 5. Inter group communication (e.g. client-server models) +-------------------------------------+ +---+ ---> (---) | * | | * | | * | +---+ +--------------------------------------+ Do the communication within an encompassing group. - + Encompassing group needed for protection + May not be convenient for naming (e.g. send(server[5], ...)) + Inconvenience does not warrant change in p-p layer + Can be handled by creating and explicityl passing arrays of ranks: MPI_LIST_RANKS(list, subgroup, group) returns the list of the rank of each subgroup member within group -- Discussion: How have available both subgroup and group? There must be an encompassing group which has full knowledge, e.g. server that is member of both the group and the subgroup. This proposal is orthogonal to question of attaching additional information (e.g. topology information, caching, etc.) Can't deal with situation of contacting an independent pre-existing server. Marc's approach is that dynamically adding processes requires dynamically creating a group containing all of the processes. Opacity vs. accesibility to mechanism. ============================================================================= Lyndon VII and X VII: Context is a higher level mechanism than a group. It is basically a unique identifier together with a reference to a group. This means that as a group changes, all contents that reference that group change as well. Same ability to hang on facilities (e.g. topology, caching) as others. Relation to p-p: Three forms: "closed form", "null form", "open form" Open form is to allow communication between different groups. Experience is that creating encompassing groups is difficult. Disagree with Snir's claim. Addressing is via (context, rank). Need a "context allocation" mechanism - this implies global communication. Relation to c-c: Works very cleanly for all-all using closed form. Two group communcations - for MPI-2 Discussion: How to establish communication between groups? Can send context. What is opaque? Lyndon - not important. Startup/bootstrap - everyone starts in the ALL group. Can use common ancestor or name registry. Power relative to I? Lyndon claims that this is more convenient once communication is setup. Basic idea is to be able to communicate via (my_context, remote_context, rank) ORTHOGONAL ISSUES - caching, tag selection, transfer of ???? X: Attempt to synthesize III & VII A CONTEX is a space of tags. A GROUP is a set of process references. [What does this mean?] Idea is to give method for combining groups and context for purpose of communication. COMMUNICATORS (see pp. 3-4) Silly names, but a serious proposal. Floopy - arbitrary communication between processes allowing wild card on tag. Bongo - basically like Marc's proposal - communication within a group using rank naming ("closed") Bingo - communication between groups. Question: Why do we want ANY of these proposals? Performance, build large scale software safely We need examples for all of these proposals! Collective -> group; express collective in terms of p-p implies need to discriminate message -> context. But there are reasons to have groups and contexts that have nothing to do with collective communication. Argument that in Marc's proposal the need for a context means have to have an additional group - but is this a problem? Lyndon argues that there are good reasons to separate group and context. Static vs. dynamic groups. Ability to move context. Proposal VIII - Mark Sears {{{ MPI-1 char * MPI_ALLOCATE Group and Context Proposal Number(); }}} -- Context and Groups are orthogonal + Orthogonal purpose + Orthogonal functionality + Orthongonal implementation -- Contexts Purpose - promote software modularity by allowing construction of independent tag spaces Definition: A CONTEXT is an integer-valued extension to the tag component of the meggage envelope, and must match exactly between sender and reciver -- Model: + Contexts are global. + No concept of process belonging to a context. + Contexts are scarce resources (16). + Context allocation is a rare event. + MPI p-p requries no reference to groups. -- Context allocation/deallocation ALLOCATION: int mpi_getcontext() + called synchronously by all (EVERY SINGLE ONE) processes + signals to mpi that use of context is now allowed DEALLOCATON void MPI_free_context(value) MPI_DEFAULT CONTEXT + Preallocated; can't be freed. + Solves initialization + Free-for-all But believes that allocation/deallocation not truly needed - could have entirely static system. + Contexts are global + No concept of process belonging to a context + Context are scarce resources (e.g. 16) + Context allocation is a rare event + MPI p-p requires no reference to groups -- GROUPS Purpose - provide tools for organizing subsets of processes in a parallel task (i.e. MPI program.) Definition: A groups is a 1-1 mapping from (0..n-1) to another set of integer. A group is a collection of processes only in so far as the elements are process addresses. Groups have no associated have no associated context or tags, default or otherwise. -- Group Implementation + local to each process based on information needed to construct the mapping + Group type is local and opaque + Groups can be sent in message only by sending the information needed to construct the group. + Groups are objects in the OOP sense -- Usage: MPI_SEND(n,buf,process,tag,context) MPI_BROADCAST(n,buf,group,tag,context) Group identifier is a local opaque type, thought of as a pointer to one of many possible group structures. MPI_SEND(..., element(group,rank),tag,context) --- GROUP FUNCTIONS int order (group) int range(group) int element(group, int rank) int iselement(group, int element) int rank(group, int element) CLASSES identity, permutation, linear, list, bilinear, composition, cartesian CONSTRUCTORS group makelineargroup(order,start,delta) -- Two kinds of 3rd party code 1. Code that inherits context and tag space from caller. Example: MPI collective communication 2. Code which allocates and manges own context and tag space. MPI should allow both of these. -- Topology Global topology - mapping of processes to processors. provide inquiry function returning a string describing this mapping: char * MPI_global_topology() Examples of output: "N 564" - random network of 564 processes "H 5" - 5 dimensional hypercube "R 2 16 13" - 2D mesh, 16x13 Local topology - implicity within group; no additional functions needed. -- ADVANTAGES + Ease of implementation + Close to hardware + Good use of resources + Flexibility in implementation of higher level concepts + MPI p-p requires no reference to groups + MPI c-c can be layered on top of MPI p-p -- Discussion - Serious problem with global communication. This destroys the software modularity. How to do global operations using groups? Responsibility on code to insure there is disambiguation, synchronization, etc. Tony - III + Tag/context partiontion message space for "safe" software + Groups encapsulate scope of operations for 1) notation 2) optimization 3) performance ........... ............... / . GROUPS . . . / ......... . CONTEXTS . / lower,higher? ............. Relation between groups and contexts? Groups can be orthogonal except for group creatin. -- Forms of communiations + collective communication ON groups (compatible with I) + p-p A] (group,rank,tag) - analagous to I B] (context,pid,tag) Models 1) Contexts created/destroyed 2) contexts can be published dynamic server implied or shared address area Contexts & groups interelate when creating new groups not necessarily from LCA ^^^{{{WHAT DOES THIS MEAN?}}} -- contexts & groups interrelate when creating new groups, not necessarily from LCA (what does this mean?) Dispute/reply regarding optimization. Argument that group can be used to provide to information about special situations (e.g. shared memory) that can be used for optimization. Dynamic groups more feasible using III or VII. Can contexts be sent in III? Yes. VII more complex than III because it offers more layers. Tony believes that dynamic groups are essential for the heterogeneous case and so believes that I is inappropriate. John K. notes that other proposals can be built on top of VIII. Proposal to defer straw vote until tomorrow to give people time to ponder. Is global synchronization an essential part of VIII? Sears - no, there are various possibilities. Why is it important that contexts are global? Because a context is not associated with a group of processes. This looses much of the safety. It also looses the local addressing within a group. Sears argues that this complicates p-p, but Snir says that something like his proposal has been implemented and is not complicated or expensive. What level of protection is needed/desirable? Picture of using context for safety - at startup send a context to each library used which it then uses for internal safety. Problem - libraries using libraries using libraries, etc. Importance of receiving on wild cards and relation. Host/Node model? Must support this and all do. (What about loading program - not part of THIS discussion.) Subcommittee will meet tonight and present a more unified proposal tomorrow morning. Rik asked for example showing how to implement safe barrier using p-p, group, context. Adam - How can we evaluate these proposals if we have not agreed on what we want from the context concept? ----- break 4-4:20 pm ----- ------------------------------------------------------------------------------- Report From the Point-to-Point Communication Subcommittee ------------------------------------------------------------------------------- First reading of p-p proposal - Marc Snir presiding Presentation with some minimal assumptions about use of context. No language binding included. Use of handles and opaque objects. [Ignore text comments on implementation - issues for language experts.] Discussion of ephemeral/persistant. Greenberg asked about "error to free a handle with pending operation" - change to "handle becomes ephemeral" or some such. How deal with lists - included length, separate length, EOL marker. States [Implementation again - not part of current proposal] If have only one send/receive, what is type of buffer and what to do about this? Marc's proposal is to accept breaking F77 in this case. Skipping 2.4 (Contexts) for now (including error handling.) 2.5.1: What about C/Fortran compatibility for messages? Also skipped for now. 2.6: Greenberg wants at least escape hatch to allow functions (MPI_ADD_?) that add, e.g., other F90 objects. Discussion of len in MPI_ADD_BLOCK as "Number of elements" rather than as bytes. There are language and portability issues. Marc mentioned issues in middle of p. 14 (negative displacements, ...) and notes that these must be settled. Note that delete/commit functions discussed last time are not in this proposal. 2.6.1 (Data Conversion) It just happens - this needs to be stated clearly. -- back to 2.2 - vote on each section [[NOTE: A proposal, in the form of a proposed chapter is offered. Votes are on amendments and on accepting particular parts of chapter.] How are handles allocated? User or system. There are efficiency advantages to user allocation, e.g. on stack. PRP argues for a create for each data type rather than a generic MPI_CREATE. At the moment there is an admixture in the proposal. ====== Discussion of voting rules. Organizations voting: 24 ===== VOTE: Separate mpi_create for each type ---- Yes: 20 No: 0 Abstain: 4 Possible meanings of free: free 1.1 can always be done 1.2 can be done only if user will not use handle no pending operation after free 2 if no pending operation, deallocate otherwise free when done 1.2 = current proposal; 2 is Greenberg proposal Greenberg explantion - common thing done in their system is to have handlers with free only taking effect after handler completes. PRP - Has primitive that does essentially 2 (which is not "mark ephemeral") It does not imply cancel. Fire and forget. Cownie: Two arguments - Paul's fire and forget; Adam's handler for messages. Snir: Is buffer available after free? PRP/Adam: No. Why not just have another kind of free? Multiplicity of functions. VOTE: mpi_free is valid even if handle is in use, but effect is to ---- free the object when the operation completes. Yes: 7 No: 8 Abstain: List of handles - change name to array of handles VOTE: list_length explicit (rather than included in array or EOL)? ---- Yes: 22 No: 0 Cownie wants a method to provide cheap allocation, e.g. on user's stack rather than in system heap. Need a concrete proposal. RLK wants explict statement of what is erroneous, checked, etc. VOTE: Accept 2.2 ---- Yes: 23 No: 1 2.3 & 2.4 skipped -- they will be considered elsewhere. 2.5 Cownie - proposes that there be both CHAR and BYTE data types rather than just the basic data types of the host language. Another proposal is to have an MPI_STRING data type. But what length? Null terminated? VOTE: BYTE data type. ---- Yes: 24 No: 0 Pending action of Context Subcommitte is the content of the envelope. VOTE: Accept 2.5 (minus 2.5.2) ---- Yes: unanimous Proposals for units of bytes and for consistency. 1. Units are elements 2. bytes in C and elements in F77 (No one favors this. Against: 14) 3. bytes everywhere (and provide some way of getting size of basic types) a) index stays as is, but bytes elsewhere Pro: 9 Con: 13 (VOTE 1) b) truly bytes everywhere, including indices Pro: 10 Con: 8 (VOTE 2) Greenberg will bring forward additional proposal for 2.6 at next meeting. VOTE: In vector, stride may be negative? ---- Yes: 10 No: 2 VOTE: Allow repetition? ---- Yes: 13 No: 1 VOTE: Can multiple components overlap? ---- vote not taken PRP proposes that in vector len is a count of the number of blocks. VOTE: total length is an integer multiple of the block size ---- Yes: 4 No: 10 Tony proposes adding a COMMIT operation. Note: Current proposal does not clearly state it, but a handle cannot be modified after it is in use. There is nothing that explicitly specifies that the modifications are complete. VOTE: Commit? ---- Yes: 11 No: 6 VOTE: Accept 2.6 ---- Yes: 12 No: 4 ----- Break for dinner - 6:15 p.m. ----- Subcommittees tonight: Here: Vendor caucus Rm 1: Process topology Rm 2: Collective Communications Subcommittee Rm 3: Context Subcommittee BAR: formal, language binding, profile meet at 8 p.m. ----------------------------------------------------------------------------- Revised agenda Thursday 9 p-p (cont) 12 lunch 1:30 (registration) 1:30 collective communication 3 process topology 6-8 dinner Future: May 12-14; June 23-25; August ? ============================================================================= Thursday, April 1, 1993 9:10 a.m. - ------------------------------------------------------------------------------- Report From the Point-to-Point Communication Subcommittee (continued) ------------------------------------------------------------------------------- Point-to-Point Communication - First Reading (continued): -------------- ------------- ----- ------- Marc Snir presiding 2.7 (Receive Criteria) & 2.8 (Communication Mode) ------------------------------------------------- Suggestion that we need more than one DONTCARE, a SOURCE_DONTCARE and a TAG_DONTCARE. Lyndon Clarke proposed a "secure" communication mode where the send that will return once the system can guarantee that the receive will actually complete. (During the discussion of this there was a suggestion that the word REGULAR on p. 16 of Marc Snir's draft be changed to STANDARD so first can always be used. Marc remarked that he welcomes all manner of stylistic improvements, but asked that they be sent to him, not voted on here.) Why such a secure communication mode? To provide a portable manner to write programs that are guaraneed to be safe, even without buffering. This is similar to but weaker than the synchronous functions that were rejected in a straw vote last time. The unease with a proliferation of functions was again mentioned. Adam Greenberg suggested that one could be against this and still specify equivalent function by requiring no buffering throughout the system. Rik Littlefield (as a pseudo Tony Skjellum) proposed receive criteria based on an intag and mask. Variations on this have been discussed in the past. VOTE: Typed DONTCARE? ---- Yes: 19 No: 1 After the vote, a count showed 25 organizations present. Proposal: Receive selection based on (tag & mask) = (intag & mask) Why do this if have context? Discussion of efficiency. PRP proposes sending tag (exact match by system) and extra-info (for recognition of message category by application) in envelope. Cownie unhappy because of effect on latency of small messages. Alternatives being considered: (1) RECEIVE(..., tag, info, ...) no DONTCARE for tag (2) RECEIVE(..., tag, mask, ...) no DONTCARE for tag (3) RECEIVE(..., tag, ...) DONTCARE for tag Does this imply small tags? PRP: reasonable for implementation to limit size of tag, category, and context. Greenberg argues againt PRP propsal because of existing practice and because it could force user to duplicate some system function. Eric argues in favor of wildcarding because it often allows reuse of buffer and so fewer resources. VOTE: (1) fails for lack of second. ---- (2) Yes: 6 No: 12 so (3) remains. Proposal: Secure communication mode. Rusty - P4 experience argues against this. Cownie - this is useful for reasoning about program correctness. An alternative is to have a "secure mode". But what happens if using secure mode, but some library may have been written without secure mode? VOTE: Do we want a "secure" communication mode? ---- Yes: 7 No: 6 VOTE: Delete send/ready_receive? ---- Yes: 10 No: 10 [Amendment tabled] VOTE: Accept amended versions of 2.7 & 2.8 ---- Yes: 25 No: 1 Discussion of 2.9 (Communication Objects): ----------------------------------------- Marc Snir began with an overview of the four subsections. He asked if there should be handles for user space objects? This is not in the current proposal. He noted that the use of a STATUS_ALL compared to using STATUS in a loop is one of convenience. Jim Cownie argued: WAIT_ANY should be told where to start scan because of fairness concerns. [Discussion - this is a strong implementation requirement. MPI requires "fairness", though the requirement is so weak as to be untestable. WAIT_ANY is small part of the fairness problem. Easier for user to pass in this information than for the system to maintain it. The user can always guarantee fairness. But at some cost. Postpone until we discuss correctness and fairness in general.] The current draft has MPI_RETURN_STAT providing the free space in the buffer. The current practice appears to return the number of bytes received. Issues: Byte count of data received; Where do handles come from; Partially specified handles; Make explicit that can a message can be sent to oneself. VOTE: The return status for a receive operation should be the number ---- of bytes received Yes: 17 No: 3 Abstain: 7 The return status handle should be allocated by the user: -------------------------------------------------------- Typically the user knows exactly the what needs to be done and the lifetime is often short. Thread safety says can't use global storage, so ... Suggestion that there be an overall proposal to deal with handles in user space. But this is not a general handle - this is a special situation. Postpone for general consideration. Partial handles --------------- return_status_handle + part of communication handle + separate user space object + separate system object VOTE: Accept 2.9 (with amendments) excluding 2.9.1 and WAIT_ANY ---- Yes: 19 No: 2 Abstain: 3 ----- break 10:48 - 11:08 ----- Discussion of sections 2.10-2.12 -------------------------------- Marc presented the following table for discussion: |general|contig|vector|contig| |buffer |byte |byte |type | --------------+-------+------+------+------+-------------------------------- blocking | | | | | send | * | | | | receive | | | | | --------------+-------+------+------+------+ blocking | | | | | ready-receive | * | | | | send | | | | | --------------+-------+------+------+------+ immediate | | | | | send | * | | | | receive | | | | | --------------+-------+------+------+------+ immediate | | | | | ready-receive | * | | | | send | | | | | --------------+-------+------+------+------+ secure | | | | | send | | | | | receive | | | | | --------------+-------+------+------+------+ Same buffer types are used in Collective Communication. Discuss probe Issues: Counting units? Vector type? Contig type includes contig byte because have byte type? Lyndon - There should be a secure-receive (for optimization of the protocol.) Marc - Is system or user responsible for insuring this works? Lyndon - User. Various - What is value of secure-receive. Gropp - Because these are different protocols loose performance if don't have different functions as general must always deal with worst case. Marc: error? ssend(2) ---------------------------- recv(1) Proposal is that it is erroneous to attempt to receive a message sent by a secure-send by other than a secure-receive. Rusty argues that this should not be erroneous because ??? Possibilities: 1) secure-send {can, cannot} be received by receive 2) Enforced by MPI or user responsibility Adam argues that this is confusing the secure-send features with pre-acknowledgement and these are independent issues. The purpose of secure-receive is entirely performance - it does not have any semantic content. secure-send/secure-receive can be implemented on top of regular p-p. Note that the proposal is to add 2 secure-sends, 2 secure-receive VOTE: Add both blocking and immediate secure-send/secure-receive ---- with failure of program to match being erroneous. Yes: 10 No: 8 Abstain: 9 VOTE: 2.10 as amended ---- Yes: 20 No: 4 Abstain: 3 VOTE: 2.11 ---- Yes: 26 No: 0 Abstain: 1 2.12 ---- Are blocks just blocks of bytes or are they typed? How do count the size of blocks? Lusk - Want to have blocks of typed data for use in a heterogeneous environment. PRP - Offers precise propsal that use the same parameters as in MPI_ADD_BLOCK. VOTE: Use exactly the same parameters in MPI_ADD_BLOCK as amended ---- Yes: 26 No: 0 Abstain: 1 Adam - Proposal to have functions to have strided messages using blocks. Rusty - Argues against becasue of problem of proliferation at this level. [Arguement was clearly on matter of taste. This is syntactic sugar as the most general low level routines certainly can be used for this.] VOTE: Have strided block message functions. ---- Yes: 5 No: 9 Abstain: 12 ----- lunch 12:00 - 1:35 ----- Continuation of discussion of Chapter 2. Proposal that more time discussing 2.10-2.14 needed, so p-p subcommittee will meet after dinner tonight. Discussion of Schedule ---------------------- Future meetings 4. May 12-14 set 5. June 23-25 set 6. August 11-13 tent. 7. September 22-24 tent. Draft to be available: November 15-19 SC '93 Portland Reading Schedule ---------------- P-P Snir April & May Collective Otto/Guist April & May Profiling Cownie April & May Process Topolgy Hempel May & June Environmental Mgmt Gropp May & June Lang. Bind. gen. Berryman May & June Context Skjellum May & June Formal Spec Zenith June & August ??? Specific language bindings will follow general material by one meeting. Where is the language binding material? There will be a general principles chapter and also the actual bindings. These will be separate votes. General language Berryman June Specific lang. bind. Berryman Is anything coming in the Formal Spec? Does anyone care. Rusty - one participant and two observers. Zenith told Rusty that he was working on something, but nothing has appeared and he is not here. What about public comment? Discussion of the HPFF model of two opportunities for comment. Proposal to have only one round of public comment with draft released to public at Supercomputing '93. Reference implementation - Gropp/Lusk effort Test suite - Greenberg and Haup will lead an effort Subset - "Implementation Order Recommendation" Huss Goals + Define a reasonable subset of MPI that is recommended for initial implementation + Only a minimum - welcome to implement more + Allow MPI to begin to show up in a timely fashion while still consistent across vendors + Consistent with complete standard -- Method + Create new "subset committee" + Write an Annex (like HPF) + Present flushed out proposal/annex at next meeting + Hope "other" committee will create initial test suite for minimal implementation :-) Might motivate implementors -- First shot list + want subcommitte members + Begin discussion via e-mail ----- NOT IN SUBSET 1 No persistent handles 2 No multicomponent buffer descriptors {only one item described} 3 No indexed component 4 No waitany or waitall 5 No name server model If interested send mail to Huss ------------------------------------------------------------------------------- Report From the Collective Communication Subcommittee ------------------------------------------------------------------------------- Collective Communications - First Reading ----------------------------------------- Steve Otto Subcommitte of three met last night (Otto, Ho, El...) Propose a discussion about safety, semantics and function of collect c NOT: Contexts/groups NOT: detailed questions of data types, lan. bind, p-p Semantic Warm-up ---------------- Barriere implies a time synchronization of processes, but NOT an emptying of all message buffers. I.e. p-p messages may span (in time) a barrier: 1 2 post-send(2) post-receive(1) barrier barrier complete-send(2) barrier does not imply that message queues were emptied. Object to statement on p. 8. {{{QUOTE}}} -- If we want collective function that does ensure all messages queues were emptied, let's invent it. wait-for-all-global Example 3. p. 46 Is this safe? 0 1 bcast /--------------- receive(0) send(1) ------------/ bcast Does this deadlock? Depends on implementation of bcast If bcast implemented with buffered contenxt-unique messages, probably won't deadlock If bcast synchronizes strongly, probably will deadlock Otto unhappy about defining sematics of collective communication routines in terms of operational p-p routines because this depends on side effects such as emptying message queues. -- Cownie/Snir argue that this is not true because ... {{{MISSED}}} -- Conservative Proposal Instead of mandating that example above is "safe", we propose that no messages are allowed to be "in the air" upon entry to collective communication call. So: We require that c-c routines be used AS IF they implied barrier synch but === the user cannot assume that they actually provide barrier synch -- Confusion about what is actually intended. Unhappiness about phrase "in the air". The example above is unsafe, but is it erroneous? Claim is that this is unsafe, but an implementation that allows it is compliant. Now what is the situation of the example on p. 46? It is there because Marc want people to be aware that the behavior of a valid (if unsafe) program may be surprising. -- Related point: May want to have a "barrier mode" for c-c so that they do behave as barriers when this is on. Useful - can detect many erroneous programs. -- Keeping multiple c-c's separated -------------------------------- /group 1 /collective operation 2 [ ( ] ) \collective operation 1 ^ \group 2 | can messages in here to go to wrong destination? Intersecting spanning trees ... can they catch each other's messages It seems that good implementations can be constructed so that they don't. ==> NOT dependent on user-tags TAGS ARE SUPERFLUOUS FOR C-C + keep for consistency with p-p? + but, what is the implemntor supposed to do with them? + ignore them? + what do we tell users about tags in c-c? We propose - NO TAGS in c-c! -- Why have them - for debugging. Anything else? Does lack of tags affect non-blocking c-c? Yes. Steve Huss: Want to make sure that system must insure that successive broadcasts will never result in confusion of messages. Marc: What is the exact semantics of c-c, in particular when is the burden on the user of MPI to insure no ambiguity vs.l when this is guaranteed by the system? This is particularly important if generalize to allow non-blocking c-c. PRP: Might want to guarantee that ordering is guaranteed between c-c. Snir: Certainly want to guarantee that if parameters in c-c are different, then there is no ambiguity. Further may guarantee even if have same parameters. Proposal - no tags and no ambiguity (so matching is via parameters and sequencing.) [System responsiblity vs. user responsiblity] Steve: Wants MPI to guarantee preservation of order Kapenga/Pierce: Order may not be meaningful Gropp: Already have similar issue for p-p. Snir: So what is behavior when order is not meaningful? Cownie: Kapenga made important point that multithreaded libraries cannot use c-c without context information. Various: MIght be able to do this by using duplicated groups. Again, issues is what responsibility falls on system and what on user. One possible solution is to require user to provide ordering of c-c. -- Functions --------- bcast barrier gather global reductions cshift/eoshift scans all-to-all bcast index -- 3.1 Introduction ---------------- Note that non-blocking c-c is not included VOTE: Include non-blocking c-c? ---- Yes: 0 No: uncounted Abstain: 9 Note that groups carry no topology information. Weeks proposes adding a perfect shuffle c-c function. Ranka proposes adding a permutation c-c function. Mention of variations on this. Rik observes that the arbitrary buffer descriptor versions will be extremely complex in implementation. He also objected that this cannot be implemented using p-p. Marc responed that it is possible using p-p because can send message to oneself. Discuss this in detail when we get to the reduction section. 3.2 (Group Functions) & 3.3 ( ----------------------------- LATER 3.4 Synchronization ------------------- Tag is removed, so examples are now wrong and must be replaced. Jon wants the examples removed as they contain semantics that are not guaranteed. Marc suggests moving to appendix with the hope that this would eventually contain a full specification of c-c in terms of p-p. Agreed. VOTE: Remove tag in all c-c? ---- Yes: 15 No: 5 Abstain: 5 VOTE: Accept just semantics of 3.4 ---- Yes: 23 No: 0 Abstain: 2 --- break 3:15 p.m. - 3:35 p.m. --- Ho has paper on collective-communication library. He will make it generally available. 3.5 (Data move functions) ------------------------- Otto: max size to shift? NO Issues: + What if inbuf is outbuf? + Periodicity in topology? [Hempel] + Tie topology to chsift [Hemple] + cshift as sendreceive(source, destination, ...) [Flower] Allowing inbuf and outbuf to be the same violates F77. Rik - user-level double buffering a pain. various: back and forth on responsibility of user vs. system. Steve Huss: Proposes a cshift with only 1 buffer. Disallowing partially overlapping buffers. Note that this new cshift will have an INOUT argument. Does this same argument apply to other reduction functions? Have to check one by one. VOTE: Allow 2nd cshift with only one buffer. No partial overlap on ---- orig cshift Yes: 21 No: 0 Abstain: 6 Note that earlier encompassing vote imples that we will have types in cshift. Marc remarks that changes to measuring size in bytes affects statement in text that buffers have same number of units. eoshift ------- Second single buffer form? zero filling - another argument? VOTE: Accept cshift, eoshift proposals with amendment of 2nd form of each. ---- Yes: 20 No: 2 Abstain: 4 bcast ----- VOTE: Accept bcast ---- Yes: 22 No: 0 Abstain: 2 gather ------ Note - len is number of bytes, not what is written. Long general, rambling discussion of gather. Proposal - separate IN argument to gather of sizes; separate functions to find sizes. Proposal - version of gather with list of outbufs Proposal [Flower] - all-to-all gather(???) Cownie moved to direct gather and scatter back to subcommittee for further consideration. Accepted. 3.6 (Global Compute Operations) ------------------------------- Issues: + inbuf=outbuf problem? 2nd version + have types, so don't need (R,I)MAX, etc. + return value to all? 3rd version + maxloc (etc.) return location and value + restriction on user defined functions + vectorized user function Returned to committee for further work. scan - nothing to say for today correctness - nothing to say for today Finished with collective communication for today. ------------------------------------------------------------------------------- Report From the Process Topology Subcommittee ------------------------------------------------------------------------------- Rolf Hempel presided. No presentation, but need to discuss the direction to go. First question - is topology going to be part of group management at all? Rolf remarks that vast majority of applications have a natural topology. There are implementation efficiencies - e.g. avoiding tables for mapping to processors. Marc: User visible mapping of processes to processors is not likely to be valuable. Trend in hardware is to hide hardware topology. What is the advantage of topology? Convenient in writing programs when topology is natural to problem. This information may be useful for implementation on particular systems. What do vendors have to say about the utility of such information. Cownie: The point is to be as flat as possible. Snir: Topology can certainly be built as a superstructure. What is value of integration? Convenience, safety. Safety does not appear important and he does not see sufficient value to convenience. Major issue is relation of topologies and groups. Can store topology as part of group (current proposal) or as a superstructure (Marc's preference.) Certainly convenient to have standardized method to build e.g. row group, column group, etc. This is what is in current draft, but the issue is one of integration. But is it more efficient or even substantially more convenient. Three possibilies: Not in MPI, in MPI but not integrated, integrated into group mechanism in MPI. Various repetitive arguments for each of these positions. Straw vote: Topologies in MPI? ---------- If yes integrated with groups (e.g. eoshift), OR separate library, environmental inquiry In: 25 Out: 4 Abstain: 5 integrated: 4 Separate: 23 Abstain: 7 Back to discussion in the subcommittee. Meeting tonight: p-p room 1 c-c room 2 ============================================================================= Friday, April 2, 1993 ------------------------------------------------------------------------------- Report From the Profiling Subcommittee ------------------------------------------------------------------------------- Profiling - Jim Cownie ---------------------- Based on draft section ?? of document. This is a single level approach and is basically static, i.e. based on selecting actual functions at link time. Questions: 1) Is one level OK? - other option is chain of function pointers 2) Debugging support? - Dump all message envelopes - Status of active handles Single level has no extra cost, but limits as might want to use the single level for another purpose e.g. as a network intermediary. Yes, that limit could be significant. Provide multi-level interface in the same manner (i.e. publishing alternative names for each level)? [Problem is setting the multiple interfaces.] Provide environmental facilities for exporting a multi-level dynamic approach? General agreement that single level is better than extra cost for everyone. Note that single level actually can support full dynamic approach as the MPI routines can be replace by functions that do function pointer swizzling. Debugging support? ------------------ A small amount of discussion last night. What is needed and what can be provided? Besides items mentioned above, need to be able to decode the opaque objects in MPI. Useful to have a recording mechanism. ------------------------------------------------------------------------------- Report From the Language Binding Subcommittee ------------------------------------------------------------------------------- Scott Berryman Gross assumptions: No Fortran 90 binding C++ binding = C binding Specification says nothing about language interoperability {The confusion bomb went off!} [sender deals with native message; XDR in general; underspecified buffer descriptors - incl. lengths or incl. lang. spec.; general vs. limited translation; know transl in hetero - don't know in homo; this is not just a language issue - it is a language implementation issue][Need concrete proposals] ------------------------------------------------------------------------------- Report From the Point-to-Point Communication Subcommittee (continued) ------------------------------------------------------------------------------- Marc Snir Byte vs. Element Count + Need ADD_VEC, ADD_INDEX with byte displacement + Most usage will be element displacement -- [R] [III] [R] [III] [R] [III] [R] [III] (array of records) so odd displacement -- Have two different components + Block start len - number of elements data type + VEC start len - total number of elements stride - number of elements between blocks lenblk - number of elements per block data type + HVEC start len - total number of elements stride - number of BYTES between blocks lenblk - number of elements per block data type [///] [////] [ ][///] [////] [ ][///] [////] [ ] VEC(..., 5, 3, 2, REAL) HVEC(..., 5, 3*size_of_real, 2, REAL) INDEX start array_of_indices - element index (start has index zero) type HINDEX start array_of_indices - element displacement in bytes type HVEC, HINDEX more general but less convenient and more error prone Returned length of received messages ------------------------------------ Number of elements IS always meaningful (even for message containing multiple type.) And number of elements sent is always the same as the number of elements received, while the number of bytes sent/received need not be the same. Moreover the number of elements is not too hard to compute. Proposal: Use the element count except for displacements in Hxx buffer components. Organizations present: 22 VOTE: proposal as above ---- Yes: 19 No: 0 Abstain: 3 Probe ----- 1) Use to decide where to receive message (allocate memory) 2) Use to debug. Propose to support 1. Do we need probe? Why not receive in a system buffer and return a pointer? Problems - system buffer is untyped; memory management Proposal: MPI_PROBE(source, tag, context, flag, return_status_handle) MPI_RETURN_STATUS(handle, len, source, tag) Assuming no other concurrent receive (single thread) MPI_RECV executed with source/tag returned by PROBE and same context will return message found by probe. Multithread programs need suitable critical region. What is returned in the len field? Should be the number of elements, but unfortunately don't know this without buffer descriptor. So return number of bytes? This may not be useful for deciding size of buffer in a hetero env. So possibilities: 1) len = -1 and provide DECODE(buff_desc, msg_status_handle) 2) Number of bytes ("on the wire") 3) Number of elements (cost of including this info in envelope) Could have 2) and also provide DECODE. The only inconvenience is the difference in units between DECODE and RECEIVE. Another alternative is to provide a data type as part of the PROBE rather than a buffer_descriptor - this would cover the most common case of uniform buffer_descriptors. Cost of rebuilding buffer_descriptor for receive because have to add in pointer for actual storage space. {{???Is this right}} Oliver object proposal again 4) PROBE(source, tag, context, type, return_status_handle) returning number of elements. Straw Vote: Probe? ---------- Yes: 23 No: 2 Straw Vote: 1) probe - no length 8 ---------- 2) probe - simple type length 17 3) probe - byte count 11 4) probe - element count 6 5) decode function 22 Cancel ------ MPI_CANCEL(handle) Either communication succeeds or CANCEL succeeds, but not both nor neither. (This is to cancel a non-blocking send.) + Need CANCEL to recover committed resource + Implementation is not trivial Is it valid for CANCEL to always fail? No, if there is a send with no corresponding posted receive, then CANCEL must succeed. Recognition that CANCEL may be a very expensive operation. For example it may require an interrupt drive mechanism. What is the effect (cost) on normal communication? Straw Vote: CANCEL? ---------- Yes: 14 No: 6 Type Mismatch ------------- Suppose send 4 bytes and receive integer or conversely. Is this unsafe or erroneous? Various proposals: type mismatch is always erroneous; type mismatch is never erroneous; BYTE type is an escape hatch. Another proposal is to allow type conversion as well. Straw Vote: ---------- 1. type mismatch always erroneous 2 2. type mismatch erroneous except BYTE 12 3. never erroneous 9 ------------------------------------------------------------------------------- Report on MPI -1.2 ------------------------------------------------------------------------------- Jim Cownie presented. Procedure clarification: All official votes require majority of ABSTENTIONS. Added Features Insecure send for users who lack confidence Tags are opaque objects Message data is opaque All lengths are in bits, measure as floats (for sufficient precision) Messages are not passed in envelopes (they are too small), but packing crates. Context proposal number server - returns 64 bit integer (expected to last at least 1 week) Group simplification: maximum number of elements in a group is 1 All communications occur in a group