Sehr geehrter Herr Strohmeier, 
lieber Herr Professor Meuer,


ich bin nun wieder im Haus und konnte mich um die "Kuerzungsaktion"
selbst kuemmern. 

Entsprechend dem Telefongespraech mit Ihnen, Herr Strohmeier, 
sende ich Ihnen nun unser noch einmal gekuerztes Paper. 
Entgegen der zunaechst zwanzig Seiten sind wir nun bei 12 3/4 Seiten
angekommen, was hoffentlich bei Ihner Formattierung  nicht ueber 
die 13. Seite hinausgehen wird (wir haben weitere zwei Abbildungen
und ein gutes Stueck Text entfernt bzw. umgearbeitet!). 

Besten Dank fuer Ihr Entgegenkommen (und ein nicht zu arbeitsreiches 
Wochenende) wuenscht Ihnen

Wolfgang Nagel

________________________________________________
Wolfgang E. Nagel  
                       
Forschungszentrum Juelich (KFA)
Zentralinstitut fuer Angewandte Mathematik (ZAM)
D-52425 Juelich

Telefon:    (+49)-2461-61-6146
Fax:        (+49)-2461-61-6656

e-mail: w.nagel@kfa-juelich.de
www: http://www.kfa-juelich.de/zam/PT/PT.html       

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
\documentclass[a4paper,12pt]{article}
\usepackage[dvips]{epsfig}
\setlength{\parindent}{0pt}
\addtolength{\textheight}{2cm}
\addtolength{\textwidth}{2cm}
\addtolength{\voffset}{-1cm}
\addtolength{\hoffset}{-1cm}
\begin{document}
\title{VAMPIR: Visualization and Analysis\\
of MPI Resources}
\author{ W.E.~Nagel, A.~Arnold, M.~Weber\\
  Central Institute for Applied Mathematics\\
  Research Centre J\"ulich (KFA)\\
  D-52425 J\"ulich, Germany\\
  (\{w.nagel,a.arnold,m.weber\}@kfa-juelich.de)\\
\\
H.-Ch.~Hoppe, K.~Solchenbach\\
PALLAS GmbH\\
Herm\"ulheimer Str. 10\\
D-50321 Br\"uhl, Germany\\
(\{hch,karls\}@pallas.de)}
\date{}
\maketitle

\newif\ifcolor
\colorfalse

\begin{abstract}
  \noindent Performance analysis most often is based on the detailed
  knowledge of program behavior. One option to get this information is
  tracing. Based on the research tool {\it PARvis}, the visualization 
  environment {\it VAMPIR} was developed at KFA which now supports
  the new message passing standard {\it MPI}. {\it VAMPIR} translates 
  a given trace file into a variety of graphical views, e.g., 
  state diagrams, activity charts, time-line
  displays, and statistics.  Moreover, it supports an animation mode
  that can help to locate performance bottlenecks, and it provides
  flexible filter operations to reduce the amount of information
  displayed. The most interesting part of {\it VAMPIR} is the powerful 
  zooming feature that allows to identify problems at any level of detail.
\end{abstract}
\section{Introduction}

On massively parallel computer systems, performance analysis and
debugging can become an extremely complicated process. Over the years,
experience has shown that user-friendly tools supporting this process
are extremely helpful and can drastically shorten time-to-solution
for a given problem. 
The complications arise because of the fact that traditional methods used on
sequential computers like profiling or debugging step-for-step execution
either deliver not enough information or present too much intrusion.  A method
that has proven usability to a certain degree is {\em tracing}.  The
structure of a typical tracing system is shown in Fig. \ref{Tracing}.

\begin{figure}[tbp]
  \begin{center}
    \leavevmode \epsfig{file=color/tracing.eps, width=10cm}
    \caption{Principle of tracing}
    \label{Tracing}
  \end{center}
\end{figure}

Tracing is based on {\em instrumenting} a program before it is executed. 
Instrumentation extends parts of the program specified by the user in a way
that data records are written into a protocol whenever these parts are executed. 
The records usually contain a time stamp, the number of the processor that has 
generated the record, an event type identifier, and a list of additional
parameters depending on the event's type.  Events that can be
instrumented could be subprogram entries and exits or the sending or
receiving of a message.  Intrusion is reduced by writing the data records
into a buffer located on the processor's local memory.  I/O activity only
takes place when this buffer overflows and has to be flushed to disk.  After
program end, the individual record streams are merged into a single stream
that is sorted chronologically.  Analysis can then be done off-line.

The problem of tracing is the large amount of data usually generated. 
Especially, when a program is traced for the first time, it is not known
which parts of the program will be of interest; most people will enable all 
tracing options which quite often result in very huge trace files.  
Therefore, there is a need for a flexible and powerful tool that enables 
the programmer to quickly get
an overview of the program run without disabling analyzation on the level of
single events.

This paper describes the X Window based visualization environment 
{\it VAMPIR} which has been developed at KFA J\"ulich to support 
performance analysis of parallel programs. 
Like most of the other performance analysis tools available for parallel
computers ({\it Paragraph} \cite{Int93} or {\it Pablo} \cite{Ree92}),
{\it VAMPIR} is used on a post-mortem basis, and it translates a given
trace file into a variety of graphical system views which provide a
reasonable basis for system understanding and program optimization.
{\it VAMPIR} is based on the visualization environment {\it PARvis} 
\cite{Arn93,NaAr93,NaAr94,Mue95} running on a large variety of
workstation platforms.  It has been extended to 
support additional panels and filter functions for the new 
message passing standard {\it MPI}. 

\section{The Message Passing Interface (MPI)}

The growing interest in parallel computing, and notably in the
message--passing programming model, pushes the demand for a
standardized application programming interface supported by all major
parallel system vendors. Starting in 1993, a group of computer
vendors, library writers and application programmers from the US and
Europe collaborated to design a standard portable message--passing
interface called MPI. The final specification of this interface was
published in May 1994 and updated in June 1995 (\cite{MPI95});
\cite{using-mpi} gives a good introduction from the application
programmer's point of view.

A number of portable and vendor--specific MPI implementations have
since been developed, showing that MPI can indeed be implemented
efficiently on the currently available parallel computer
platforms. There are three public--domain implementations
of MPI, and most parallel system vendors have announced MPI
implemetations of their own.

MPI draws from a number of other message--passing interfaces,
including IBM's EUI, PVM, Intel's NX, and PARMACS, adding some
advanced features:

\begin{itemize}
  \item Communication modes: MPI supports a multitude of
        point--to--point communication modes, some allowing to overlap
        communication and computation.
	
  \item Data types: in a message, MPI transmits objects of a specified
        datatype ranging from predefined elementary types to complicated,
        non--contiguous user--defined datatypes. Thus, programs do not
	need to know about datatype sizes, and automatic type conversions
	can be done on heterogeneous systems.
	
  \item Communicators: to isolate different communication spaces, the
        concept of a {\em communicator} was introduced into
        MPI. Messages must be sent and received within the same
        communicator, thus allowing to encapsulate the communication
        done by a parallel library from the application.
  
  \item Collective communication: MPI contains a complete selection of
        global communication operations including broadcast,
	reduction, gather and scatter on arbitrary datatypes.
\end{itemize}

Therefore, MPI enables portable programs and libraries to be
written. Of course, mere functional portability is not sufficient in
practice: efficiency of an application or library must be the second
focus of interest. In spite of using a standardized interface,
parallel programs will not show equal performance on different
hardware platforms --- same as with sequential programs. Thus, careful
adjustments -- performance tuning -- are necessary to optimize a
parallel application for a given parallel system.

For parallel programs on massively parallel systems, performance
tuning is much more complicated than in the sequential case, because
additional system parameters like the ratio of computational power to
communication speed come into play, and currently no automatic tools
analogous to optimzing compilers are available. To reap the maximum
benefit from MPI, powerful and easy to handle performance analysis and
visualization tools are of increasing importance.

Users working with different message-passing libraries on several
parallel systems will not have the time to fully understand and tune 
their message-passing codes for every platform. 
With the dissemination of {\it MPI} this will probably change.
The powerful features of {\it MPI} will offer a range of flexibility 
that allows to get the maximum performance from any kind of
parallel hardware supporting message passing. To achieve this in a 
convenient way users will ask for tools able to display the communication
structure of their programs at almost every time scale.
{\it VAMPIR} will have the functionality to satisfy these demands.


\section{The VAMPIR Environment}

Performance analysis and program optimization are often based on
different categories of system views (Fig.~\ref{systemview},
\cite{Mue95}):

\begin{figure}[!bp]
  \begin{center}
    \leavevmode
    \epsfig{file=color/systemview.eps, width=10cm}
    \caption{The VAMPIR visualization options}
    \label{systemview}
  \end{center}
\end{figure}

\begin{itemize}
\item single time system snapshots: panels that show system activities
at a particular point of time;
\item animation: option to look at a sequence of system snapshots to
investigate the dynamic behavior;
\item statistics: the component that summarizes system behavior for the
time under investigation;
\item time-line system view: detailed view of system activities, which
are visualized on a time axis.
\end{itemize}

Each category is supported by the VAMPIR environment; the current
prototype can generate traces on the Intel iPSC/860, Intel Paragon and
CRAY T3D systems, whereas the product version will work for any
standard--compliant MPI implementation.

For user convenience, VAMPIR provides a configuration file where user
preferences (color, layout, fonts etc.) are stored between runs. This
file enables the tool to come up with the exact same settings of a
previous session, and different configurations may be saved and loaded
at will. A detailed description of all VAMPIR features can be found in
\cite{Arn93,ArRo95,Mue95}.

\par {\it VAMPIR} is implemented in ANSI C
and uses the OSF/Motif libraries. The current implementation already
supports a variety of different hardware platforms (IBM RS/6000, Sun
Sparc, DEC MIPS computers (Ultrix), DEC Alpha, HP, and Silicon
Graphics).

\section{Program Instrumentation}

The MPI standard (\cite{MPI95}) specifies a profiling interface
that every standard--compliant MPI implementation must provide. Using this
interface, wrapper routines can be registered that trace the execution
of every MPI routine.

VAMPIR provides a TraceGenerator library on top of this profiling
interface to generate traces of MPI communicator, point--to--point and
collective communication routines. This part of VAMPIR can work with
each standard--compliant MPI implementation, and supports both C and
Fortran~77 applications.

To trace additional information like subroutine entry/exit, the {\em
PARvis.inst} instrumentation tool for Fortran~77 has been developed at
KFA J\"ulich based on the {\em Paff} \cite{Ber89} preprocessor. The
command

\begin{center}
\tt PARvis.inst [options] file\_name [file\_name]
\end{center}

automatically instruments the Fortran~77 programs specified on the
command line. Flexible options are provided to generate wrapper
routines for system and application routines, and tracing of a particular
routine can be switched off by just marking that routine as
non--traceable. Control directives are supported to start and stop the
trace gathering, and in addition an upper bound on the tracefile
length can be specified by the user. All directives start with the
prefix \verb|CKFA$ TRACE|.

For C applications, a library interface to the TraceGenerator will be
supplied that allows to insert instrumentation instructions either
manually or with the help of the C preprocessor.


\section{Visualization of System Activities}

The system activities can be visualized in the 
{\it Global\_Display/Node Style} display 
(Fig.~\ref{sysact}). 
Here, every processor is displayed as a box. The size and arrangement of the
boxes depend on the number of processors and the geometry of the 
window and are automatically calculated by {\it VAMPIR}. Each box is
partitioned into a lower and an upper part. The lower part describes
the current activity on the nodes, whereas the upper part (called {\it
  statistics field}) shows the time portion (in percent) spent on a
particular activity for the period under investigation (here: {\it
  Calculation}).  For monitoring reasons, the background color
reflects the current value printed out, and the corresponding
percentual values are listed on the right.

\begin{figure}[htbp]
  \begin{center}
    \leavevmode
    \ifcolor
     \epsfig{file=realcolor/sysact.eps, width=15cm}
    \else
     \epsfig{file=color/sysact.eps, width=15cm}    
    \fi
    \caption{System activity snapshot at a single point of time}
    \label{sysact}
  \end{center}
\end{figure}


Fig.~\ref{sysact} represents one example of an actual system snapshot
at a special point of time. The {\it Step}-button in the {\it VAMPIR}
{\it Move Control} area can be used to show the system activity changes.
Typically, the number of events to be displayed is rather large, so
the animation mode can be used to animate the sequence of system
snapshots. The step width for the animation mode can be either an
event or a given time period; the time difference between two
movements (i.e., the animation speed) and the number of movements
after which the animation should stop can be adjusted in the panel
{\it Settings/Steppings}. This animation feature can be used to
analyze the program behavior in time, to identify critical program
sections, and to find the hot spots of the run.


\section{Statistics}
\par The Node display mode already contains a small statistics field,
but due to its limited size only the time portion of {\it one} state
can be monitored. Quite often, one would like to get a more detailed
idea of how the time is spent on each of the nodes. To analyze the
complete state distribution, it is possible to switch the display mode
to the {\it statistics display}. Press F6 or select the menu option
{\it Global\_Display/Chart Style}, and another window will come
up (Fig.~\ref{stat123}), which shows a
statistics of the complete trace file in a pie chart style. The colors
chosen for the individual states are just the same as those which are
used as the background color for the state field in the CPU display.
The most important activities can be identified for all nodes, and
differences in the node behavior will be clear immediately. 
As can be seen from that panel, most time is
spent in {\it Calculation} on all nodes, and significant portions of
time are also spent at a barrier.

\begin{figure}[t]
  \begin{center}
    \leavevmode
    \epsfig{file=color/stat123.eps, width=13cm}    
    \caption{Time distribution statistics for the program run}
    \label{stat123}
  \end{center}
\end{figure}

\ \\
When a lot of CPUs are involved in a parallel system, the individual
statistics in this display can become very small and uninformative. To
relieve this unfortunate situation, {\it VAMPIR} can open additional windows
containing statistics for only {\it one} CPU. To select the CPUs you
want statistics for, simply click at them with the left mouse button,
and their frame color will be inverted. You can also drag over a
couple of CPUs to select several CPUs with one action. In the example
shown in Fig.~\ref{stat123}, the actual time distribution spent in
user subroutines ({\it Calculation}, pie chart in the left sub-window:
most time is spent in subroutine VELO) as well as for node
communication ({\it Communication}, histogram in the upper right
sub-window), and Paragon emulation ({\it Paragon}, histogram in the
lower right sub-window) is shown for node 2. The user can toggle
between table, pie chart, and histogram in all chart windows. The
histograms may be linear or logarithmic, and zooming is supported.


\section{Time-line Displays}

Based on the data visualization options presented above, we now
concentrate on the interaction of parallel activities and possible
bottlenecks. At this point, the user is interested in seeing a
sequence of activities on all nodes, and the interdependences between
these different program parts.


\par The problem with most other visualization tools like
{\it Paragraph} \cite{Int93} or
{\it Pablo} \cite{Ree92} is that these tools
are based on the {\it Replay Technique}:
Whenever the user wants to have just another information about a special
part of the program, the whole trace file is analyzed once again, even
if the file contains several hundreds of Mbytes (see
Fig.~\ref{zoom1}). The magnification glass has
to scan the whole trace file several times whenever the user would like
to see a different information or just another time frame.\par 

\begin{figure}[tbp]
  \begin{center}
    \leavevmode
    \epsfig{file=color/zoom1.eps, width=9cm}    
    \caption{Zooming and the {\it Replay Technique}}
    \label{zoom1}
  \end{center}
\end{figure}

\par This is different in the
{\it VAMPIR}-environment: here, the user
can specify the size of the magnification glass, and all details within
the magnification glass can be seen without any further I/O-activity
(Fig.~\ref{zoom3}). For example, statistics for
all activities inside the chosen time window can be generated within
milliseconds. Moreover,  the user can use a powerful zooming feature to
analyze the program behavior on any level of detail; each zoom-operation
also takes only a few milliseconds, even if several Mbytes of tracing
information are under investigation. Of course, a hierarchical
{\it unzoom}-operation is provided for user
convenience.\par 

\begin{figure}[!tbh]
  \begin{center}
    \leavevmode
    \epsfig{file=color/zoom3.eps, width=9cm}    
    \caption{VAMPIR realization: Make zooming as easy as possible}
    \label{zoom3}
  \end{center}
\end{figure}

\begin{figure}[p]
  \begin{center}
    \leavevmode
    \epsfig{file=color/timelall.eps, height=20cm}    
    \caption{Time-line zooming and message identification}
    \label{timel4y}
  \end{center}
\end{figure}


\par In {\it VAMPIR}, the
{\it Global\_Display/Timeline} panel is used
to display this type of information. As can be seen from the upper part
of Fig.~\ref{timel4y}, colors are used to
represent different kinds of activities, and it is possible to show
system activities over time on each of the nodes. In this example, the
program is running in phases where the subroutine VELO is executed
several times. The black parts are hundreds of messages (represented by
one line each) which are sent between the nodes. Based on the
information displayed in this window, it is quite easy to identify
critical program sections where problems may have occurred.\par 

\par The zooming feature can now be used to go into detail. As
shown in the middle part of Fig.~\ref{timel4y},
the period of interest\footnote{The time offset is specified in
the lower left corner of the panel.\par } (400 {---}
560 ms) was zoomed-in by just specifying the time frame with the mouse.
Here, one of the time{--}step iterations can be seen, and the load
imbalance causes long synchronization times at the barrier called
GSYNC.\par 

\par The zooming feature also can be used to get deeper and deeper into
the analysis process, to understand program behavior, and finally to
identify problems. The lower part of Fig.~\ref{timel4y} shows a data
communication exchange part of the program (at about 525 ms) where different
communication patterns inform the user about his communication activities.
In the message passing programming model, communication and data exchange
are solely based upon the sending and receiving of messages. Regardless of
the network{'}s topology (which is hidden to the application programmer in
most cases), it is obvious that the visualization of message transfers and
patterns plays an important role in the performance analysis and debugging
of parallel programs. Therefore, {\it VAMPIR} includes means to display and
inquire information about message-passing transfers. These tools are not
isolated from the other part of {\it VAMPIR}: message events are read
through the same trace file interface into {\it VAMPIR}, and the message
visualization tools work hand-in-hand with the features described so far. It
is possible to mouse-click a message that pops up another panel showing all
information related to that message, including the transfer rate in MByte/s
(i.e. about 20 MByte/s). The information for this message is coming out of
the wrapper of the {\em MPI\_SEND/MPI\_RECV} communication routine, and the
overhead involved is quite low.  Depending on the instrumentation used, it
is also possible to visualize the communication patterns that higher-level
communication routines like reductions internally use.  The ability to look
into the implementation in this way is a key feature to understand why
programs that use a standardized message-passing library like {\em MPI}
behave differently on different machines. \par

\par Moreover, detailed information about the activities on one node or
a selection of nodes can be obtained. The lower left part of
Fig.~\ref{timel4y} documents that even calls to
{\it gdhigh} (a few microseconds inside the
communication library routine) easily can be identified. A case study on
Intel Paragon \cite{WiNa94} describes a situation where the
{\it VAMPIR} environment was extremely
helpful in identifying performance bottlenecks in the communication
library; based on the optimization process, the output performance
({\it hippi-output}) was increased by a
factor of more than five within a few hours.\par 

In addition, the zooming operation can be used to identify typical
communication patterns. It is obvious that the visualization of such
communication patterns gives knowledge about implementation aspects of
the system and of your own program, and it is very helpful to
understand synchronization delays and related side effects which
sometimes significantly influence the performance of real
applications.

To evaluate the overall message traffic that took place over a period of
time, a matrix of communication can be opened (Fig. \ref{MP_Stat}) that
shows different statistic values for the messages that were passed between
each pair of sender and receiver.  Specifically, the following parameters
can be shown:

\begin{figure}[htbp]
  \begin{center}
    \leavevmode
    \ifcolor
     \epsfig{file=realcolor/mp_stat.eps, width=10cm}
    \else 
     \epsfig{file=color/mp_stat.eps, width=10cm}
    \fi
    \caption{Statistics of the message passing communication rate}
    \label{MP_Stat}
  \end{center}
\end{figure}
\ \\
$\bullet$\quad The total number of messages passed between the processors\\
$\bullet$\quad The total number of bytes passed between the processors\\
$\bullet$\quad The maximum, minimum and average length of messages\\
$\bullet$\quad The maximum, minimum and average data rate that was reached\\


This display simplifies the detection of unbalanced communication and
performance reductions because of too many short messages, what usually
results in a low average data rate.


\section{Additional Features}

\par {\it VAMPIR} accesses several external tools to perform some of its
tasks. These tools must be located in a directory included in your
{\it PATH} environment variable:

\begin{itemize}
\item {\it lpr} or {\em lp}, the standard UNIX printing facilities, to 
  print lists and window snapshots.
\item {\it import}, a screen snapshot utility from the {\it
    ImageMagick} package to export or print window contents.  {\it
    ImageMagick} is delivered as part of VAMPIR; it can also be downloaded 
    from {\it ftp.zam.kfa-juelich.de}, directory {\it pub/graphics/ImageMagick}.
\item If you have trace files compressed with {\it gzip} or {\it
    compress}, {\it VAMPIR} can extract them automatically if their
  counterparts {\it gunzip} or {\it uncompress}, respectively, are
  available.
\end{itemize}

There are quite a few other enhanced features that cannot be described
in detail in this paper; the most important ones are mentioned
below:
\begin{itemize}
\item filter functions: {\it VAMPIR} allows
to simultaneously display up to 512 nodes. Typically, this number is
much too large to be handled meaningfully; therefore, powerful filter
functions are available to reduce the number of nodes, either
automatically or manually by the user.
%%\item network activity: communication messages are sometimes moving over
%%the same hardware connections, leading to hot spots on the network. This
%%component is able to display communication patterns on the underlying
%%network.
\item movie support: after each animation step, control is optionally
given back to a user-command (i.e., a shell script). This allows to
generate movies unattended by the user, just by specifying a single
command in a sub-panel.
\end{itemize}

\section{Summary and Conclusions}

This paper describes the {\it VAMPIR}-environment which provides some
powerful features to discover parallel program behavior on several
parallel systems like Intel Paragon and CRAY T3D. Experience has
shown, that for debugging, as well as for performance optimization
purposes, the supported time-line displays in combination with the
statistics features are the strength of the system. With the extremely
flexible zooming function in the time-line displays, analysis
operations are supported which can drastically improve the
understanding of observed performance problems.

VAMPIR is available as a commercial product from PALLAS GmbH; for
further information, see the WWW page \verb|http://www.pallas.de| or
send mail to \verb|info@pallas.de|.


\begin{thebibliography}{MMMMM}

\bibitem[Arn93]{Arn93} A.~Arnold, {\it PARvis: Eine X-basierte
    Umgebung zur Visualisierung von parallelen Programmen in
    Multiprozessorsystemen}, J\"ul-2848, Forschungszentrum J\"ulich
  (KFA), 1993.

\bibitem[ADN95]{ADN95} A.~Arnold, U.~Detert, and W.E.~Nagel,  {\it Performance
optimization of parallel programs: tracing, zooming, understanding}, 
In:~Proc.~Spring 1995 Cray Users Group Meeting, pp.~252--258.

\bibitem[ArRo95]{ArRo95} A.~Arnold and M.~R\"oth, {\it PARvis - an
    X-based visualization environment for parallel programs (User{'}s
    guide)}, Forschungszentrum J\"ulich (KFA), to be published.

\bibitem[Ber89]{Ber89} R.~Berrendorf, {\it Der
    FORTRAN-Parser PAFF als wiederverwendbares Modul f\"ur
    Programmier-Tools}, J\"ul-Spez-537, Forschungszentrum J\"ulich
  (KFA), 1989.

\bibitem[Int93]{Int93} {\it Paragon application tools
    user{'}s guide}, Intel Corporation, 1993.

\bibitem[MPI95]{MPI95} Message-Passing Interface Forum, {\it MPI: A Message--Passing 
                     Interface Standard}, Version 1.1, University of Tennessee, 
                     Knoxville, Tennessee, 1995; Available as {\tt
                     ftp://ftp.mcs.anl.gov/pub/mpi/mpi-1.jun95/mpi-report.ps}.

\bibitem[GLS94]{using-mpi} W. Gropp, E. Lusk and A. Skjellum, 
  {\em Using MPI}. MIT Press, Cambridge, Massachusetts, 1994.

\bibitem[Mue95]{Mue95} Ch.~M\"ullender, {\it
    Visualisierung der Speicheraktivit\"aten von parallelen Programmen
    in Systemen mit virtuell gemeinsamem Speicher}, J\"ul-2911,
  Forschungszentrum J\"ulich (KFA), 1994.

\bibitem[NaAr93]{NaAr93} W.E.~Nagel und A.~Arnold, {\it
    PARvis: Ein Werkzeug zur Visualisierung von parallelen Prozessen
    auf Mehrprozessorsystemen}, Proc. 7. ITG/GI Fachtagung MMB{'}93
  (Kurzberichte und Werkzeugvorstellung) pp.~178--187, 1993.

\bibitem[NaAr94]{NaAr94} W.E.~Nagel und A.~Arnold, {\it
    Performance visualization of parallel programs: The PARvis
    environment}, In:~Proc.~1994 Intel Supercomputer Users Group
  Conference (ISUG{'}94), pp.~24--31.

\bibitem[Ree92]{Ree92} D.A.~Reed, R.A.~Aydt,
  T.M.~Madhyastha, R.J.~Noe, K.A.~Shields, and B.W.~Schwartz, {\it An
    overview of the Pablo performance analysis environment}, Technical
  Report, Dept. of Computer Science, University of Illinois,
  Urbana-Champaign, 1992.

\bibitem[WiNa94]{WiNa94} R.~Williams and W.~E.~Nagel,
  {\it Optimization of output bandwidth from a Paragon}, Technical
  Report CCSF-44, Caltech Concurrent Supercomputing Facilities,
  Pasadena, CA, 1994 (also on WWW:
  http://www.ccsf.caltech.edu/~roy/hippipap/paper.html).
\end{thebibliography}


\end{document}