.ds TH "\h'-.50m'\v'-.40m'^\v'.40m'\h'.50m'
.ds TD "\h'-.65m'\v'-.40m'~\v'.40m'
.ds CT "\(mu\h'-.66m'\(ci
.ds NI "\(mo\h'-.65m'/
.nr EP 1
.nr VS 14
.EQ
gsize 11
delim @@
... define | '^\(br^'
... define || '\(br ^ \(br '
define notin '\(mo back 30 / '
define times '\(mu'
define ctimes '\(mu back 80 \(ci ^ '
define hhat ' up 5 hat down 5 '
define bl '\0'
.EN
.TL
Solving Banded Systems on a Parallel Processor *
.AU
.ps 11
.in 0
J. J. Dongarra and L. Johnsson
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
9700 South Cass Avenue\h'.20i'
Argonne, Illinois 60439-4844\h'.20i'
and
Computer Science Department 
P.O. Box 2158
Yale University
New Haven, Connecticut 06520
.nr PS 11
.ps 11
.nr VS 16
.vs 16
.nr PD 0.5v
.FS
* Work supported in part by the Applied Mathematical Sciences
subprogram of the Office of Energy Research, 
U.S. Department of Energy under Contracts W-31-109-Eng-38
and in part by the 
Office of Naval Research under Contract N00014-84-K-0043.
.FE
.ps 10
.AB
.B
Abstract
.R
.PP
In this paper we examine ways of solving dense, banded systems on different 
parallel processors.  We start with some considerations for processors with 
vector instructions, then discuss various algorithms for the solution of 
large dense banded systems on a parallel processor.  We analyze the behavior 
of the parallel algorithms on distributed storage architectures configured as 
rings, two-dimensional meshes with end-around connections (tori), boolean 
n-cube configured architectures, and bus-based and switch-based machines with
shared storage.  We also present measurements for two bus-based architectures
with shared storage, namely, the Alliant FX/8 and the Sequent Balance 21000.

.NH
Introduction
.PP
The solution of banded systems of equations is important in many areas of 
scientific computing. We restrict our attention here to matrices that are 
symmetric positive definite or diagonally dominant.  With the coming of 
parallel processors, algorithms for the solution of these problems have been 
the subject of many studies [].  In this paper we look at an implementation 
that uses a partitioning scheme essentially equivalent to incomplete nested
dissection.  This approach divides the work into sections, reduces the 
sections independently in parallel, performs nearest neighbor communication 
among processors to compute part of the solution, and carries out independent
operations to determine the complete solution to the original problem.
.NH
Sequential Case
.PP
LINPACK [] includes routines to solve banded systems of linear equations.
The matrices can be of a general nature or symmetric positive definite.
The algorithms for the banded problem in LINPACK are based on the vector
operation from the BLAS [] called a SAXPY, (@ y ~ <-~ y ~+~ alpha x @).
This operation forms the computational kernel in the form of a rank-1 update 
of a submatrix of size @m sup 2@ for a banded matrix @A@ of order @N@ and half 
bandwidth @m@ during the @k sup th@ step of the reduction.
.PP
Since much data is updated at each step, this algorithm suffers in terms of 
performance on most sequential computers.  This effect of poor performance 
is most pronounced on vector architectures that have vector registers.
On such machines the largest bottleneck is in memory traffic.
.PP
To substantially reduce memory accesses, the algorithm can be reorganized 
using matrix vector multiplication as the computational kernel, instead of 
rank-1 updates [dgk,de]. This technique reduces the number of 
storage accesses from @O(N m sup 2 )@ to @O(N m)@ through vectorization.
.PP
The table below shows the performance differences from this approach.

.ce
Performance for the CRAY X-MP-1
.ce
@ order ~ 5000@, symmetric positive definite matrix
.sp 2
.TS
center;
c c s s
c c c c
n n n n.
	MFLOPS
	LINPACK	Improved	Improved
Half bandwidth	CAL BLAS	Fortran MV	CAL MV
_

8	4	5	11
16	6	12	27
22	7	17	41
44	12	40	84
62	15	54	108
128	26	86	141
200	38	103	149
.TE

.PP
We can exploit a modest amount of parallelism by performing, say, the 
matrix-vector operations in terms of independent inner products. 
An additional amount of parallelism can be exploited by factorizing the 
matrix from both the top and the bottom simultaneously.   When the 
factorization reaches the middle, a small block matrix must be factored. 
The back-substitution then proceeds from the middle and works toward the 
top and bottom in parallel.  It is important to note that the number of 
floating-point operations is exactly the same in this approach as in the 
conventional factorization.  (This two-way factorization resurfaces from 
time to time. Jim Wilkinson describes this as an approach used in the early 
days of computing to reduce overhead associated with the looping [].
LINPACK uses this algorithm to solve symmetric positive definite
tridiagonal systems of equations. Evans and Hatzopolus [] describe
the algorithm as a folding technique.)
.PP
We consider this algorithm as the best we can do on a sequential computer.
The amount of work, in terms of operations performed, is 
@ m(m+1)(N - (2m+1)/3)@ additions and multiplications for the factorization of 
symmetric positive definite matrices and @ m sup 2 (2N-1) - (4m sup 2 -1)m /3 @
for an arbitrary banded matrix of size @N@ and half bandwidth @m@.  The number
of additions/subtractions and multiplications for each solve is 
@2m(2N - m - 1)@.
.NH
Parallel Algorithms
.PP
For computer architectures that provide a large number of processors, 
algorithms can be employed that exploit the independence of operations 
in eliminating single variables and the independence of operations in 
eliminating different variables [Wing-80, Johnsson-80c, Gentleman-82, 
Liu-85, Johnsson-85a, johnsson-85d, Worley-85, George-86].  For a narrow
banded matrix the potential concurrency is largely due to the 
independence of operations in eliminating different variables, whereas 
for a matrix with a half bandwidth equal to @sqrt n @ or larger the 
concurrency due to the independence of operations in the elimination 
of a single variable dominates.  
.PP
If only the complexity of the parallel arithmetic is considered, as would be 
reasonable for a shared-memory machine with very high communication bandwidth
which allow conflict-free access
to data, then the speedup from concurrent elimination of single
variables is linear.  The speedup is measured as the time for sequential
elimination divided by the time for concurrent elimination.  On the
other hand, the speedup from exploiting the independence of operations in 
eliminating different variables always is sublinear, since more
operations are required to find the solution.  The maximum
speedup in the latter case is in the range @ 1/2(N/m)/log(N/m) @ to 
@ (N/m)/log(N/m)@ for @N/m@ processors and a minimum efficiency of 
@O(1/log(N/m))@.  In the former case the maximum speedup is @m sup 2@ 
and the efficiency equal to @1@ [johnsson-85a].
.PP
The graph representation of a banded matrix constitutes a @ perfect 
~elimination~ graph @ [rose-73], and Gaussian elimination without 
pivoting constitutes a perfect elimination order.  Hence, there is no 
@ fill-in @.  Concurrent elimination of different variables, on the 
other hand, generally does not constitute a perfect elimination order; 
the total number of arithmetic operations performed and the storage 
required by the concurrent algorithm is higher than for the sequential 
algorithm. However, the time to solve the banded system should be 
greatly reduced for sufficiently many processors.  In the case of a 
tridiagonal system, odd-even cyclic reduction requires approximately 
@17N@ operations compared to @8N@ for Gaussian elimination, but the 
parallel arithmetic compleixty for cyclic reduction is @ 11 logN@ in 
@2logN@ steps (or @12 logN@ in @ logN@ steps [Hockney-81]).  
Classical Gaussian elimination is a sequential method and requires
@8N@ operations in @2N@ steps.  The two-way Gaussian elimination is 
also a perfect elimination order.  The parallel arithmetic complexity 
is @ 4N@ during @N@ steps.
.PP
Unfortunately, for many architectures, ignoring the cost of
communication and the effects of vector features and storage hierarchy
is not a good approximation of reality.  Communication bandwidth,
overhead in communication, routing conflicts, and bank conflicts, as 
well as overhead for vector instructions (should such be available), 
are important factors in choosing an optimum algorithm.  We will
focus on the 
communication issue.  We carry out a simple analysis for two algorithms 
exploiting the independence of operations in eliminating a single
variable, and one algorithm exploiting the independence of operations 
in eliminating different variables.  The communications overhead 
(startup) is @ tau  @ and the transmission time for a floating-point 
number is @ t sub c @.  We discuss ring, mesh, hypercube and 
shared-memory architectures.                                   

.SH 
4. Concurrent Elimination of a Single Variable
.SH
4.1 One-dimensional partitioning

.PP
A one-dimensional partitioning of the matrix in the column or row direction is 
feasible in the case of a ring of processors, as shown in Figure 2.  

.cs 1 24
.ps 8
.nf
            _____     _____    _____                     _____
            | M |     | M |    | M |                     | M |
            |...|     |...|    |...|                     |...|
         ___| P |_____| P |____| P |__  .  .  .  .  .  __| P |__
         |  |___|     |___|    |___|                     |___|  |
         |______________________________________________________|

.fi
.ps 12
.cs 1
     Figure 2.  A ring of processors with distributed storage

.PP
The partitioning can be done either @ cyclically @ or @ consecutively 
@ [johnsson-85c].  In cyclic partitioning by columns, column @ j @ 
is allocated to processor @ j mod P @, where @ P @ is the number of 
processors labeled from @ 0 @ to @ P-1@.  In consecutive partitioning,
column @ j @ is assigned to processor @ (jP/(m+1))mod P @ (assuming 
for simplicity that @m+1@ is a multiple of @ P @).
Cyclic and consecutive partitionings are illustrated in Figure 3.

.cs 1 24
.ps 8
.nf
      -------------                   -------------
      | | | | | | | \e                 |   |   |   | \e
      | | | | | | | | \e               |   |   |   |   \e
      |0|1|2|0|1|2|0|1|               | 0 | 1 | 2 | 0 |
      | | | | | | | |                 |   |   |   |   |
      \e | | | | | | |                 \e   |   |   |   |
        \e | | | | |                     \e |   |   |   |
          \e | | | |                       \e   |   |
            \e | | |                         \e |   |
              \e | |                           \e   |
                \e |                             \e |
                  \e                               \e

                                                        
          Cyclic                         Consecutive   

.fi
.ps 12
.cs 1
.ce
Figure 3.  Cyclic and Consecutive one-dimensional partitioning

.PP
Consecutive partitioning is akin to a block-oriented algorithm.  It
leads to a lower utilization of the array, but requires fewer 
communications in a packet-oriented mode of operation where a 
block corresponds to a packet.  
.PP
In cyclic partitioning a processor first computes the pivot column of 
the left factor @ L @, then performs its part of the rank-1 update of 
the @m@ by @m@ submatrix; in other words, it updates @(m+1)/P-1@ 
columns of length @m@.  After the processor storing the pivot column 
has computed a new column of @L@, it transmits the column to the processor
with the next higher index modulo @P@ (to the right).  The receiving 
processor forwards the column to the next processor, and performs a 
rank-1 update on @(m+1)/P@ columns of length @m@.  Assuming a time 
@t sub a@ for addition/subtraction and multiplication and a time @t sub d@
for division, and no vector features in the processors, the arithmetic 
complexity is approximately @(t sub d + (2(m+1)/P +1)mt sub a )N @ for a
nonsymmetric matrix.  The communication complexity is approximately 
@ (mt sub c + tau )N @.  We assume that the architecture is of the 
@ MIMD @ type [flynn-66] and that communication and arithmetic
operations are pipelined.  The speedup is approximately @ P @ if only 
the arithmetic complexity is considered, but may be poor if the 
communication complexity dominates, i.e., if @ (mt sub c + tau )N >> 
(t sub d + (2(m+1)/P +1)mt sub a )N @ then @P >> 
2 m sup 2 t sub a /(mt sub c + tau ) @.

.PP
In consecutive partitioning the processor holding the pivot block 
factors it and transmits the factor to the "next" processor in a way  
analogous to the cyclic partitioning.  The arithmetic complexity is 
approximately @(t sub d + (3(m+1)/P-P)(2m+1)/2Pt sub a )N @ for an 
unsymmetric matrix and the communication complexity 
@(mt sub c + (P/m) tau ) N @.
Pipelining of arithmetic and communication operations is assumed for 
these complexity estimates. The arithmetic complexity is approximately 
@ 1.5 @ times that of cyclic partitioning for @ P << m @, but the number
of startups in communication is reduced by a factor of @ P/m @.  

.SH
4.2 Two-dimensional partitioning

.PP
In the case of a two-dimensional mesh-configured set of processors, a 
two-dimensional partitioning of the matrix can be beneficial 
[johnsson-84d].  This partitioning can also be made cyclically or 
consecutively.   We assume for simplicity that the matrix has the same
number of super- and subdiagonals, and that the processor mesh has 
@ sqrt { P } @ nodes in each coordinate direction.  Figure 4  
illustrates the matrix partitionings.


.cs 1 24
.ps 8
.nf
-------------------                      ----------------------\e
|00|01|02|00|01|02| \e                    |      |      |      |  \e
|--|--|--|--|--|--|--\e                   |  00  |  01  |  02  | 00\e
|10|11|12|10|11|12|10| \e                 |      |      |      |    \e
|--|--|--|--|--|--|--|  \e                |------|------|------|------\e
|20|21|22|20|21|22|20|21|\e               |      |      |      |      | \e
|--|--|--|--|--|--|--|--|  \e             |  10  |  11  |  12  |  10  | 11\e
|00|01|02|00|01|02|00|01|02|\e            |      |      |      |      |     \e
|--|--|--|--|--|--|--|--|--|  \e          |------|------|------|------|------\e
|10|11|12|10|11|12|10|11|12|10|\e         |      |      |      |      |      | \e
|--|--|--|--|--|--|--|--|--|--| \e        |  20  |  21  |  22  |  20  |  21  | 22\e         
|20|21|22|20|21|22|20|21|22|20|21|\e      |      |      |      |      |      |     \e
|--|--|--|--|--|--|--|--|--|--|--| \e     |------|------|------|------|------|------\e
 \e|01|02|                                 \e    |      |
  \e-|--|                                   \e00 |  01  |
   \e12|                                     \e |      |
    \e-|                                      \e------|

                                                        
          Cyclic                                           Consecutive   
                                                          

.fi
.ps 12
.cs 1
.ce
Figure 4.  Cyclic and consecutive two-dimensional partitioning

.PP
In cyclic partitioning the processor holding the pivot element sends it in the
column direction (or row direction) to compute a set of elements of @ L @, 
then proceeds to compute the elements of a column of @ L @ that it stores 
(@ (m+1)/ sqrt { P } -1 @).  The processor that receives the pivot element 
forwards it and computes @ (m+1)/ sqrt { P } @ elements of the column of @ L @ 
that corresponds to the pivot element.  The communication is confined within 
columns.  As the elements of @ L @ are computed, they are sent to the "next" 
processor within the row.  The mesh of processors performs a rank-1 update;
each processor updates @(m+1) sup 2 /P @ elements.  The arithmetic complexity 
for the cyclic partitioning is approximately 
@ (t sub d + (2(m+1)/ sqrt { P } + 1)(m+1)/ sqrt { P }t sub a )N @ for an
unsymmetric matrix, and the communication complexity is
@ ((m+1)/ sqrt { P } t sub c + 2 tau )N @ assuming concurrent 
communication in the row and column directions.  If communication is 
restricted to one direction at a time,  then the data transfer time is 
doubled and there are a total of @ N @ additional startups.  
The arithmetic complexity is approximately the same as in the 
one-dimensional cyclic partitioning, but the communication complexity is 
different.  The number of elements transmitted over a channel between a pair 
of processors is reduced by a factor of @ sqrt { P } @, but the number of 
communication startups is doubled.
.PP
In the case of two-dimentional consecutive partitioning the processor holding 
the pivot block factors it and passes the left factor to the next processor 
(right) in the same row, and the right factor the next processor (down) in the 
same column.  The receiving processors forwards the factors and performs a 
forward solve.  Then a block rank-1 update, a rank @ (m+1)/ sqrt { P } @
update, is performed by all processors.  The arithmetic complexity of 
this block algorithm is approximately @ (t sub d +(11(m+1) sup 2 /P -9(m+1)/ sqrt { P } +1)/3 t sub a )N @ 
and the communication complexity 
@ ((3(m+1)/ sqrt { P } - 1)/2 t sub c + 2 tau sqrt { P } /(m+1) )N @.
The arithmetic complexity is approximately twice (@ 11/6 @) that of 
cyclic partitioning.  The number of startups is reduced by a factor 
of @(m+1)/ sqrt {P}  @ compared to the two-dimensional cyclic 
partitioning, and by a factor of @ sqrt {P} /2 @ compared to the
one-dimensional consecutive partitioning.  The time for transfer of 
matrix elements is approximately @1.5@ times that of the cyclic 
two-dimensional partitioning.

.SH
4.3 Boolean n-cube
.PP
In a boolean n-cube multiprocessor [seitz-85] the processing nodes from
the corners of a n-dimensional boolean cube.  Each node has a fanout of 
@ n @, the diameter is @ n @,  the average distance between nodes 
@ n/2 @,  the total number of edges @ n 2 sup (n-1) @, and the number 
of disjoint paths between any pair of nodes @n@ [saad-85a] 
[johnsson-85c].  A boolean cube can simulate a ring and two-dimensional 
mesh preserving the nearest neighbor connection property by using a 
binary-reflected Gray code [Reingold-77], as observed for example by 
[johnsson-85c], [McBryan-85], and [Fox-85b].  Of the partitioning 
strategies discussed above, consecutive two-dimentional partitioning is 
preferable if the startup times dominate the time for arithmetic, 
whereas cyclic two-dimentional partitioning is preferable if startup 
times can be ignored.  
.PP
In the n-cube there is also the potential for reducing the data transfer
time by using the additional communication paths between a pair of 
nodes [saad-85a] [johnsson-85b] [Ho-85k].  Moreover, a boolean n-cube 
can simulate many other graphs than rings and meshes with no or at 
most a constant slowdown.  In particular, complete binary trees 
[johnsson-85b] [bhatt-85a] [deshpande-86] can be embedded effectively, 
and algorithms based on the divide-and-conquer strategy, such as (block)
cyclic reduction and nested dissection based elimination schemes, are 
candidates for efficient boolean n-cube algorithms.   The partitioning 
algorithm described in the next section belongs in this category of algorithms.
.PP
Since the partitioning method leads to a reduced block tridiagonal
system, we now indicate how a (block) tridiagonal system can be solved 
effectively on a boolean n-cube.  The reader is referred to 
[johnsson-84c][johnsson-85e2] for details.  The mapping of the block 
rows makes use of the binary-reflected Gray code for the initial
assignment of partitions to processors.  A block cyclic reduction 
algorithm is used to exploit concurrency.  By the Gray code embedding 
and the first reduction step only implies nearest neighbor
communication.  It can be shown that the communication in subsequent 
steps is always between processors at distance @ 2 @ from each other.
This property can be used for an @ in-place @ algorithm.  It is also 
possible to divide the distance two communication into two nearest 
neighbor communications such that after an exchange step the 
equations participating in the next reduction step are in a subcube of 
half size.  Hence, during the reduction process subsets of equations 
are recursively assigned to subcubes such that any subset is mapped to 
a subcube in a binary-reflected Gray code order.  The recursive
assignment requires a simple exchange operation between certain 
adjacent processors.  Each processor can compute whether it will 
perform an exchange operation and with what processor from its address 
and the reduction (backsubstitution) step to be performed.  The same 
local information suffices to determine the communication for the reduction
(backsusbstitution) operations themselves.  The exchange algorithm moves even 
equations to even processors ( and odd equations to odd processors).  
The considered bit-field is reduced by one for each reduction step.
One step of the exchange algorithm is illustrated for a 3-cube in Figure 5.  


.cs 1 24
.ps 8
.nf
                      7---------- 6          6---------- 7
                     /|        /            /|        /
                    / |       / |          / |       / |
                 0 /__|_____1/  |       0 /__|_____1/  |
                  |  4|_____|___|5       |  4|_____|___|5 
                  |  /      |  /         |  /      |  /
                  | /       | /          | /       | /
                  |/________|/           |/________|/
                  3         2            2         3
                 

.fi
.cs 1
.ps 12
                              Exchange


.ce 2
Figure 5.  Recursively assigning equations to nodes through 
exchange operations preserving adjacency

.PP
In the n-cube algorithms based on binary-reflected Gray codes, full advantage
can be taken of truncated cyclic reduction.  Each reduction step is carried
out on all relevant equations during the same time step.  Hence, truncating the
reduction after @ r < n @ steps reduces the total time proportionally.
.PP
The block cyclic reduction algorithm as outlined above exploits the independence
of operations in the elimination of variables corresponding to different blocks.
It is also possible to exploit the independence of operations for the elimination 
of a single variable if additional processors are available.  A two-dimensional
mesh is suitable for the latter form of concurrency.  By assigning @ log 3m sup 2 @
dimensions of the boolean cube for the embedding of  @ 3m times m @ meshes and 
@ log P@ dimensions for the partitions both forms of concurrency can be maximally
exploited.
.PP 
We will analyze the complexity of the partitioning method in some detail in
section 5.  For further details see [johnsson-85a][johnsson-85e].

.SH
4.4 Global storage
.PP
For global storage architectures there are two common basic approaches: (1) the use of a
common bus for processor-to-storage communication, 
(Figure 6), and (2) switch-based architectures, (Figure 7).
.ps 8
.cs 1 24
.nf

             ___      ___      ___                       ___
            |   |    |   |    |   |                     |   |
            | P |    | P |    | P |  .  .  .  .  .      | P |
            |___|    |___|    |___|                     |___|  
              |________|________|_________________________|
                                     |
                            _________|_________
                            |                 |
                            |                 |
                            |                 | 
                            |     Storage     |
                            |                 |
                            |                 |
                            |_________________|

.fi
.ps 12
.cs 1
              Figure 6.  Global storage architecture with a bus.
                  
.ps 8
.cs 1 24
.nf

             ___      ___      ___                       ___
            |   |    |   |    |   |                     |   |
            | P |    | P |    | P |  .  .  .  .  .      | P |
            |___|    |___|    |___|                     |___|  
           ---|--------|--------|-----                 ---|---
           |                                                 |
           |                                                 |
           |                                                 |
           |                    Switch                       |
           |                                                 |
           |                                                 |
           |                                                 |
           ---|--------|--------|-----                 ---|---
             _|_      _|_      _|_                       _|_
            |   |    |   |    |   |                     |   |
            | M |    | M |    | M |  .  .  .  .  .      | M |
            |___|    |___|    |___|                     |___|  


.fi
.ps 12
.cs 1
              Figure 7.  Global storage architecture with a switch.

.PP
The bus type of global-storage architecture is typical when the number of processing
elements is few, as in CRAY's computers, the Sequent, the Encore, and the 
Alliant FX/8.  The switch type of global storage architectures is typical for architectures
conceived as highly parallel, as in the NYU Ultracomputer [Schwartz-80] [gottlieb-83] 
the RP3 by IBM [Pfister-85], and the BBN butterfly [Crowther-85].
.PP
In an architecture with global storage and limited local storage the computations for 
exploiting the independence of operations in the elimination of a single variable can
be organized such that the processors first share in the computation of a new column
of @ L @, then share the computation for the rank-1 update.  The arithmetic complexity
is @ (t sub d + (2m+1)m/P t sub a )N @.  The communication complexity for the
bus architecture is @(max( (m+1) sup 2 /P t sub cp, (m+1) sup 2 t sub cs ) + t sub cp +
(t sub cs / t sub cp P + 1) tau )2N @, where @t sub cp @ is the time to transfer one
floating-point number to or from a processor and @ t sub cs @ the same time for the
storage.   For the switch-based architecture the communication time is 
@((m+1) sup 2 /P t sub c + 2 tau )2N @. 
More accurate models of the Alliant FX series are presented by 
Jalby and Meier [Jalby-86], 
who also describes
efficient dense matrix routines for that architecture.  Sorensen [Sorensen-84] has devised 
effective matrix routines for the HEP [Smith-81], which is a switch-based architecture with 
local storage.
.SH
4.5 Summary of the complexity analysis for the concurrent elimination of a single variable
.PP
The complexity estimates for the concurent elimination of a single variable on different
architectures can be summarized as follows

.KS
.TS
center;
l|c c c.
\f3partitioning	rel. time for	rel. time for	rel. time for
	arithmetic	comm. startups	data transfer\f1

_

cyclic 1-d	1.0	@m/2 sqrt {P} @	@ sqrt {P}  @

cyclic 2-d	1.0	@m/ sqrt {P} @	@1 @

consecutive 1-d	1.5	@ sqrt {P} /2 @	@ sqrt {P}  @
      
consecutive 2-d	2	1	@ (3 - sqrt {P} /m)/2 @

global storage, bus	1.0	@ 2m/ sqrt {P} @	@ 2m sqrt {P}   @

global storage, switch	1.0	@ 2 @	@ 2m/ sqrt {P}   @
           
.TE
.KE

.PP
The conclusion from the analysis so far is that for the distributed storage
architectures the cyclic partitioning yields the lowest arithmetic complexity
because of less idle time.  The total number of arithmetic operations is
independent of the partitioning.  Cyclic two-dimensional partitioning has the
smallest time for data transfer because of the highest degree of pipelining.
However, cyclic partitioning suffers from a large number of communication
startups and with significant startup times the consecutive partitioning may
yield a lower total complexity.  

The algorithms have been described for a consistant use of the cyclic and the 
consecutive storage schemes.  For a given set of architectural parameters the
minimum time is likely to be achieved by a combination of the two strategies.


.SH
5. Concurrent Elimination of Variables
.PP
For the concurrent elimination of different variables the algorithm in [johnsson-85e]
partitions the system of equations into sets of consecutive equations 
(see Figure 8).  The computations 
in different sets can be performed independently to a very large extent.  The
algorithm is a variant of substructured elimination or incomplete nested dissection.
.PP
The algorithm proceeds in four phases:

.nf
.in .5i
Phase 1	- Factor each partitioned matrix.
Phase 2	- Apply factors to pieces of the matrix to decouple the solution.
Phase 3	- Form a reduced matrix and solve this matrix problem forming part
		   of the solution.
Phase 4	- Backsubstitute to determine the remaining parts of the solution.

.in -.5i
.fi
Each of these phases is discussed in detail in the following
subsections.

.cs 1 24
.nf
.ps 8


    -----|----------
    \e C  |           \e
      \ei |             \e
        \e|               \e
          \e       A        \e
            \e      i         \e
              \e             .  \e
                \e           .  A \e
                  \e         .   i12\e
                    \e .............  \e
                      \e  A  :  A  : B  \e
                        \e i21   i22  i   \e
                          \e--------------- \e
                            \e  C  |          \e
                              \e i+1            \e
                                \e                \e
.fi
.fi
.ps 12
.cs 1
.ce
Figure 8
.ce
Partitioning of the Matrix
.sp 2
.SH
5.1 Phase 1
.PP
In Phase 1, decomposition of each section of the matrix (block @A sub i sub 11 @)
is performed.  This phase can be carried out on each section totally independently 
of the other sections; there is no communication between sections.  At the end of 
this phase each section has the LU decomposition as part of the matrix.  The amount 
of work for each section is @ m(m+1)(k - (2m+1)/3)@ addition and multiplications for 
symmetric positive definite matrices and @m sup 2 (2k-1) - (4m sup 2 -1)m/3 @ for an 
arbitrary banded matrix of size @k@ and half bandwidth @m@, where @k@ is the order of 
the matrix @A sub i sub 11 @.
.SH
5.2 Phase 2
.PP
In Phase 2, the factors generated in Phase 1 are applied to the matrix.  This can be 
viewed as premultiplying each section of the matrix in Figure 8 by 
@diag ( {A sub i sub 11 } sup -1 , ~ I sub i sub 22 )@.  The resulting matrix has fill-in 
from @G sub i ~<-~ A sub i sub 11 C sub i@ and 
@F sub i ~<-~ { A sub i sub 11 } sup -1 A sub i sub 12 @.  The fill-in is diagrammed in 
Figure 9.

.cs 1 24
.ps 8
.nf

    -----|----------         -----
   |\e    |           \e      |     |
   |  \e  |             \e    |     |
   |    \e|               \e  |     |
   |  G  |\e       I        \e| F   |
   |   i |  \e      i         \e i  |
   |     |    \e             .  \e  |
   |     |      \e           .    \e|
   |     |        \e         .     |\e
    -----           \e .............  \e
                      \e  A  :  A  : B  \e
                        \e i21   i22  i   \e
                          \e -------------- \e          -----
                            \e  G  |          \e       |     |
                            | \e i+1            \e     |     |
                            |   \e |              \e   |     |
.fi
.ps 12
.cs 1
.ce
Figure 9
.ce
Matrix After Phase 2

Again in this stage there is no communication between sections.
The operation count for this phase is @2m(2k - m - 1)(2m+r)@ for @r@ right-hand sides.

.SH
5.3 Phase 3
.PP
Phase 3 has two parts. First, the system is decoupled and a reduced system is solved, 
forming part of the original solution.  This part of Phase 3 involves zeroing submatrix 
@A sub i sub 21 @.  Zeroing can be accomplished by simple block elimination with the block
above @A sub i sub 21 @.  This results in some fill-in below @G sub i @ of the form 
@G' sub i ~<-~ G' sub i ~-~ A sub i sub 21 G sub i @ and a modification of 
@ A sub i sub 22 @ such that 
@A sub i sub 22 ~<-~ A sub i sub 22 ~-~ A sub i sub 21 F sub i @.  These operations are 
independent for each @i@ and can proceed in parallel for all sections.  No communication 
is required.
.PP
The second part of Phase 3 involves zeroing @ B sub i @ using the block from below.
This results in a fill-in above @ F sub i+1@ of the form 
@F' sub i+1 ~<-~ F' sub i+1 ~-~ B sub i F sub i+1@ and modification of @A sub i sub 22 @
such that @ A sub i sub 22 ~<-~ A sub i sub 22 ~-~ B sub i G sub i+1 @.
Here interpartition communication must take place to perform the update of
block @A sub i sub 22 @ and forming of blocks @ F sub i+1 @.
.PP
At this point the matrix has the form shown in Figure 10.

.cs 1 24
.ps 8
.nf

    -----|----------         -----
   |\e    |           \e      |     |
   |  \e  |             \e    |     |
   |    \e|               \e  |     |
   |  G  |\e       I        \e| F   |
   |   i |  \e      i         \e i  |
   |     |    \e             .  \e  |
   |     |      \e           .    \e|
   |     |        \e         .     | 
    -----           \e .............                   -----
   |  G' |                  :  A  :                  |F'   |
   |   i |                  :   i22                  | i+1 |
    -----                   -------------- \e          -----
                            \e  G  |          \e       |     |
                            | \e i+1            \e     |     |
                            |   \e |              \e   |     |
.fi
.ps 12
.cs 1
.ce
Figure 10
.ce
Matrix after the first part of phase 3
.sp
.PP
The last set of @ m @ rows of each partition together form a system of @ mP @ equations
in the last @ m @ variables of each partition.  The matrix has been decoupled, and a 
reduced block tridiagonal matrix can be formed. This block tridiagonal matrix has the form
.sp 10
such that when this block tridiagonal system is solved, the partial solution to the 
original matrix problem at these positions will be formed.
.PP
The amount of work for this portion of the algorithm depends on the size of the block 
tridiagonal system and the method used to solve the system.  The blocks are of size 
@m times m @, and there will be @P@ block rows, where @P@ is the number of partitions 
made in the original matrix.  If we fix the order of the original matrix and look at 
what happens as we increase the number of partitions, we see that more effort will go 
into solving the reduced system.
.PP
The communication complexity and the parallel arithmetic complexity for the solution
of the block tridiagonal system depend on the architecture [johnsson-85e].  The number
of block row communications in sequence depends on the method chosen and the architecture.
It varies from @ alpha log P@, where @ alpha @ falls in the interval @1 - 8 @ for 
block cyclic reduction and suitable architectures such as binary tree, shuffle-exchange,
and boolean cube interconnected processors.  For shared-memory architectures it is
necessary that the bus/switch/storage bandwidth is proportional to @2P@ to realize 
@2 log P@ communication start-ups.  For two-way block Gaussian elimination the number 
of block row communications is @P@ for a linear array and @2P@ for a shared-memory 
architecture.  The parallel arithmetic complexity of block cyclic reduction is
@(26m sup 3 /3 + 8m sup 2 r)(logP-2) + 16m sup 3 /3 ~+~ lower~ order ~terms@ and for
two-way Gaussian elimination @(8m sup 3 /3 + 3m sup 2 r)(P-1) ~+~ lower ~order ~terms@, @r@ is the number of right hand sides.
The parallel arithmetic complexity of block cyclic reduction is always lower than that
of two-way block Gaussian elimination.  On most architectures the number of
communication startups in sequence is also less for block cyclic reduction.
.PP
Detailed complexity estimates for both block Gaussian elimination and block cyclic 
reduction are given in [johnsson-85e] for a variety of architectures.  

.SH
5.4 Phase 4
.PP
Given the solution to the block tridiagonal system, the complete solution to the original 
matrix can be found by a simple backsubstitution. Phase 4 can be done without 
communication of information from one section to another.  This phase requires @4 m k r@ 
operations, where r is the number of right-hand sides.

.SH
5.5 Complexity of concurrent elimination of different variables through partitioning
.PP
The parallel arithmetic complexity corresponding to the above partitioning strategy 
is given below:
.sp
.TS
center;
l l
l l
l l
l l.
Phase 1  	@m(m+1)(N/P-m-(2m+1)/3) @ for a symmetric matrix, 
	@m sup 2 (2(N/P-m)-1) - (4m sup 2 -1)m/3 @ otherwise.
Phase 2  	@m(2(N/P-m)-m-1)(3m+2r)@
Phase 3  	@2m(m+1)(m+r) ~+~ @block trid. solve for a symmetric matrix
	@2m(m+1)(2m+r) ~+~ @block trid. solve for an unsymmetric matrix
Phase 4  	@4m(N/P-m)@

where	@P@ is the number of partitions,
	@N@ is the order of the matrix, and
	@m@ is the half band width and 
	@r@ is the number of right hand sides
    @P < N/m @
.TE
                                                                         
.fi

.PP
For the above complexity estimates it is assumed that there is one processor per 
partition.  The complexity of phases 1, 2 and 4 has terms inversely proportional to 
the number of partitions whereas the complexity of phase 3 has terms that increase 
as a function of @ P @.  Hence, there exist a trade-off between time for the
solution of the reduced system and the time required in the substructured elimination.
The optimum number of partitions for for linear arrays and shared-memory systems
with a bandwidth independent of @ P @ is of order @ sqrt{ N/m}@ whereas for boolean
cubes, binary trees, shuffle-exchange networks and shared-memory systems with a
bandwidth proportional to @ P @ the optimum number of partitions is of order @N/m@
[johnsson-85e].

The parallel arithmetic complexities can be reduced further by
performing operations on blocks concurrently, i.e., by parallelizing
dense matrix factorization, solve and multiplication along the lines
described for banded systems in section 4.  Such parallelization leads
to increased communication complexity.

.SH
5.6 Algorithm properties
.PP
The partitioning strategy yields algorithms that have two significant properties:
.sp
(1) If the original matrix is symmetric, then the reduced block tridiagonal
matrix is symmetric [johnsson-85e].  This is not the case with the partitioning 
employed by Sameh et al. []. (Also if the original is diagonally dominant, so is the
reduced system.)
.sp
(2) If the matrix is symmetric, then the condition number of the partitioned matrices,
(@ A sub i sub 11 @), is never worse than that of the original matrix.
This follows from the eigenvalue interlacing properties of symmetric matrices.
.PP
The reduced system can be solved using block cyclic
reduction. The advantage is that this phase can easily be parallelized.
.PP 
The partitioning algorithm described here uses an elimination order that can be obtained
through incomplete nested dissection [George-73] [George-81].  A banded matrix with half 
bandwidth @m@ corresponds 
to a graph in which each node is connected to all its preceding @m@ nodes as well as to 
all its succeeding @m@ nodes, except for the first and last sets of @m@ nodes.  
Separators are of size @m@.  The block matrices @A sub i sub 22@ correspond to the
edges between the nodes of a separator and the adjacent left and top triangular blocks to 
edges to one set of @n/P-m@ nodes and the adjacent right and bottom triangular matrix to
a disjoint set of @n/P-m@ nodes.  Choosing separators as bisectors recursively yields
a complete binary tree as an elimination tree, (Figure 11).

.cs 1 24
.ps 8
.nf

                                         _____
                                         | m |
                                         |___|
                                           ^
                                          / \e
                                       /       \e
                                    /             \e                        
                                 /                   \e                       
                              /                         \e
                         __/__                           __\e__
                         | m |                           | m |
                         |___|                           |___|
                           ^                               ^
                         /   \e                           /   \e
                       /       \e                       /       \e
                     /           \e                   /           \e                   
                 __/__           __\e__           __/__           __\e__
                 | m |           | m |           | m |           | m |
                 |___|           |___|           |___|           |___|
                  / \e             / \e             / \e             / \e
                 /   \e           /   \e           /   \e           /   \e
                /     \e         /     \e         /     \e         /     \e
             __/__   __\e__   __/__   __\e__   __/__   __\e__   __/__   __\e__
             |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
             | k |   | k |   | k |   | k |   | k |   | k |   | k |   | k |
             |___|   |___|   |___|   |___|   |___|   |___|   |___|   |___|


.fi
.ps 12
.cs 1
.ce
Figure 11.  Elimination tree generated by recursive bisection 
.ce
(incomplete nested dissection)
.PP
The elimination order of the algorithm corresponds to the elimination of the leaf 
variables prior to the elimination of any variables of the internal nodes of the elimination
tree.  No particular order is implied for the elimination of variables in different
leaf nodes, but the variables within a leaf node of the elimination tree are elimination
in order of increasing (or decreasing) labels.  The elimination of the variables
corresponding to the internal nodes from the leafs towards the root corresponds to
the elimination order of block cyclic reduction, whereas an elimination in inorder
yields block Gaussian elimination [Johnsson-85e].

.SH
6. Comparison of the complexity of the different startegies for concurrent elimination
.PP
The number of communication startups for the partitioning method described in section 5
is potentially a factor of @ alpha m log P /(2N sqrt {P} )@ less than that of the 
consecutive two-dimensional partitioning, which has the fewest startups of the 
algorithms exploiting the independence of operations for the elimination of a single 
variable.  The data transfer may be a factor of @ alpha mlogP /(N sqrt { P } )@ times 
that of the consecutive two-dimensional partitioning.  To fully realize this potential 
it is necessary that the communication bandwidth of the architecture is proportional
to @P@, which is most easily acccomplished in a network architecture.
.PP
We conclude that with respect to the communication complexity,
algorithms exploiting the 
independence of operations for eliminating of different variables offers substantial
reductions compared to algorithms exploiting the independence of operations for
eliminating of single variables for most banded matrices.
.PP
The comparison with respect to the arithmetic complexity is not so favorable.  The
substructured elimination cause substantial fill-in.  
Indeed, the arithmetic complexity
of Phases 1 and 2 alone is approximately 
@ 4 @ times 
that of the sequential algorithm.
for the general case
(for @N/P-m@ equations), and 
@ 7 @ times
for the symmetric case.
Hence,
the efficiency (speedup / (number of processors)) is limited, and the number of 
processing elements required to achieve
any speedup is expected to be at least @ 5 @ to @ 10 @ if communication complexity
and the time for the reduced system solve are included.  A very modest speedup can
be expected on current global storage with bus architectures
because of the number of processors available.  The 
situation for distributed storage architectures is more favorable in that they offer
a larger number of processing elements.
.PP
Both kinds of concurrency can clearly be exploited simultaneously.  The arithmetic
complexity is reduced and the communication complexity increased.  Exploiting both
forms of concurrency may represent a problem with respect to available compilers
on some machines unless some other device is used to
gain access to the parallel features, [dongarra and sorensen]
(such as the Alliant).
.SH
7. Experiments
.PP
The partitioning algorithm exploiting the independence of operations 
during the elimination of different variables has been implemented on 
two shared memory systems, the Sequent Balance 21000 and an Alliant
FX/8, and for the case @m=1@ (tridiagonal systems) also on the Intel 
iPSC/d7 and the Connection Machine [Hillis-85]. The Sequent has 10 
processors and the Alliant 8 processors.  The Alliant also has vector capability
on each of the processors 
with a vector register length of 32.

.SH
7.1 Concurrent elimination of variables

.SH
7.1.1 Phase 1. Factorization

.PP
The factorization is implemented using LINPACK routines.  The peak performance of
the band algorithm from LINPACK on the Alliant FX/8 is about 3 Mflops [], which is also achieved
for some phases of the algorithm.  
The vector features and the cache of the
architecture have interesting effects on the performance as indicated
by the following data for one processor.

.ce 2
Alliant FX/8, one processor used
Factorization, time in sec

     m           P=1     P=2     P=4     P=8     P=16

     2 N=1000    0.18    0.18    0.20    0.18    0.18
       N=2000    0.40    0.38    0.38    0.40    0.40
       N=4000    0.77    0.77    0.77    0.78    0.80
       N=8000    1.53    1.55    1.55    1.55    1.60

     8 N=1000    0.42    0.42    0.40    0.40    0.37
       N=2000    0.83    0.83    0.85    0.83    0.83
       N=4000    1.67    1.68    1.70    1.70    1.75
       N=8000    3.37    3.40    3.43    3.47    3.63

    16 N=1000    0.80    0.80    0.77    0.68    0.53
       N=2000    1.62    1.58    1.57    1.53    1.45
       N=4000    3.20    3.18    3.22    3.23    3.30
       N=8000    6.47    6.47    6.50    6.63    6.97

    32 N=1000    1.72    1.63    1.40    0.98    0.33
       N=2000    3.47    3.42    3.17    2.70    1.90
       N=4000    7.05    6.85    6.77    6.30    5.38
       N=8000   14.55   14.68   14.65   14.92   15.55

.PP
For a small bandwidth the effect on the execution time due to an
increased number of partitions is small.  The total number of
arithmetic operations is approximatley @2 m sup 2 (N-(P-1)m)@.  The
overhead in managing loop iterations seems to dominate.  But,
for a half bandwidth of @ 32@, the number of arithmetic operations
is reduced significantly.  The reduced number of arithmetic operations
is part of the explanation for the reduced execution time for increasing
@P@ even if only one processor is used for the factorization.  However,
the reduction is much larger than predicted from the number of
arithemtic operations alone as seen from the table below.  The running
time is normalized to the time for @P=1@ for each @N@ and @ m @.  The
measured relative time is given first, the predicted @ (1- (P-1)m/N@
second.

.ce 3
Alliant FX/8, one processor used
Factorization
Relative execution time; measured:predicted

  m             P=2           P=4          P=8          P=16

  2 N=1000  (1   :1   )   (1.11:1   )   (1   :0.98)   (1   :0.97)
    N=2000  (0.95:1   )   (0.95:1   )   (1   :0.99)   (1   :0.98)
    N=4000  (1   :1   )   (1   :1   )   (1.01:1   )   (1.04:0.99)
    N=8000  (1.01:1   )   (1.01:1   )   (1.01:1   )   (1.05:1   )

  8 N=1000  (1   :0.99)   (0.95:0.98)   (0.95:0.94)   (0.88:0.88) 
    N=2000  (1   :1   )   (1.02:0.99)   (1   :0.97)   (1   :0.94)
    N=4000  (1.01:1   )   (1.02:0.99)   (1.02:0.99)   (1.05:0.97)
    N=8000  (1.01:1   )   (1.02:1   )   (1.03:0.99)   (1.08:0.99)

 16 N=1000  (1   :0.98)   (0.96:0.95)   (0.85:0.89)   (0.66:0.76)
    N=2000  (0.98:0.99)   (0.97:0.98)   (0.94:0.94)   (0.90:0.88)
    N=4000  (0.99:1   )   (1.01:0.99)   (1.01:0.97)   (1.03:0.94)
    N=8000  (1   :1   )   (1   :0.99)   (1.02:0.99)   (1.08:0.97)

 32 N=1000  (0.95:0.97)   (0.81:0.90)   (0.57:0.78)   (0.19:0.52)
    N=2000  (0.99:0.98)   (0.91:0.95)   (0.78:0.89)   (0.55:0.76)
    N=4000  (0.97:0.98)   (0.96:0.98)   (0.89:0.94)   (0.76:0.88)
    N=8000  (1.01:1   )   (1.01:0.99)   (1.03:0.97)   (1.07:0.94)

.PP
The effect on performance of an increased value of @P@ is most dramatic
for @m=32@, but quite significant even for @m=16@.  For large values
of @N@ (@N=8000 @ in our experiments) the preformance degrades with
an increased number of partitions, but for fewer than @N=4000@ equations
the improvement is markedly better than predicted from arithmetic work
alone.  Figure 12 illustrates this point.
 

Figure 12.  The relative time for factorization on 1 processor as a 
            function of @P@.

.PP
The execution rate, number of floating point operations per second is increasing with 
decreasing @N@ for all values of @P@ and @m@.  The largest dependence on
@N@ is measured for @m=32@ and @P=16@.  On the Alliant FX/8
using only Fortran the maximum observed
floating-point rate was @4 Mflops@ for 1 processor and the lowest
was @40 kflops@ a difference by a factor of @90@.  For
8 processors the maximum floating-point speed for factorization using
LINPACK varied from 193 kflops (@m=2@) to 16.06 Mflops
(@N=2000, m=32@).  The performance can be improved considerably by
careful data management and assembly coding [Jalby-86].

 
                         Alliant FX/8
                   Factorization,  1 Processor
        Floating-point operations per second (kflops/sec)

          m            P=1     P=2    P=4    P=8    P=16

          2  N=1000     44      44     40     44     43
             N=2000     40      42     42     40     39
             N=4000     42      42     42     41     40
             N=8000     42      41     41     41     40

          8  N=1000    303     301    313    305    314
             N=2000    308     307    298    301    294
             N=4000    306     304    299    298    286
             N=8000    304     301    298    293    279
 
         16  N=1000    633     626    635    686    794
             N=2000    629     641    638    639    643
             N=4000    638     640    629    620    593
             N=8000    632     631    627    611    575
 
         32  N=1000   1164    1202   1335   1725   4039
             N=2000   1168    1172   1236   1731   1779
             N=4000   1156    1183   1184   1244   1390
             N=8000   1123    1110   1106   1074   1008

.PP
The rate of floating-point operations is shown in Figure 13.


Figure 13.  The achieved floating-point speed on 1 proccesor during
            factorization.

.PP
The minimum execution time for factorization on 2, 4 and 8 procesors are
given in the next table.  The numbers within parenthesis gives the number
of partitions for which the minimum time was achieved.

	
                          Alliant FX/8
             Execution time for factorization sec

          m             2 proc.         4 proc.    8 proc.

          2  N=1000    0.08 (4)         0.05 (16)   0.033 (16)
             N=2000    0.18 (4)         0.13 (16)   0.083 (16)
             N=4000    0.38 (2,4,8,16)  0.25 (16)   0.17  (16)
             N=8000    0.77 (2,4)       0.52 (16)   0.33  (16)
          
          8  N=1000    0.20 (8,16)      0.10 (4,8,16)   0.067 (8,16)
             N=2000    0.42 (4,8)       0.22 (4,8)      0.12  (8,16)
             N=4000    0.85 (2,4,8)     0.45 (4,8)      0.25  (8)
             N=8000    1.70 (2)         0.92 (4,8)      0.52  (8)
 
         16  N=1000    0.26 (16)        0.13 (16)       0.083 (16)
             N=2000    0.75 (16)        0.38 (16)       0.25  (16)
             N=4000    1.62 (2)         0.88 (4,8)      0.57  (8,16)
             N=8000    3.28 (2)         1.80 (4)        1.18  (8)
 
         32  N=1000    0.18 (16)        0.10 (16)       0.083 (16)
             N=2000    1.00 (16)        0.57 (16)       0.38  (16)
             N=4000    2.80 (16)        1.65 (16)       1.18  (16)
             N=8000    7.80 (2)         4.72 (16)       3.47  (8)


.PP
The speedup is summarised in the following table.  Though the execution
times are repeatable the total execution times for @N@ and @m@ small is
the cause of some inprecision in the speedup figures.  The efficiency
for @P=4@ is in the range @75% - 90%@.  For @P=8@ the efficiency is
rather unsatisfactory, @55% - 85%@.


                         Alliant FX/8
                             Speedup

          m             2 proc.    4 proc.    8 proc.

          2  N=1000      2.17       3.40       5.46
             N=2000      2.11       2.92       4.58
             N=4000      2.03       3.08       4.53
             N=8000      1.98       2.94       4.64
          
          8  N=1000      2.10       4.20       6.27
             N=2000      1.98       3.77       6.92
             N=4000      1.96       3.71       6.68
             N=8000      1.98       3.66       6.48
 
         16  N=1000      2.04       4.08       6.39
             N=2000      1.93       3.82       5.80
             N=4000      1.96       3.61       5.58
             N=8000      1.97       3.59       5.48
 
         32  N=1000      1.83       3.30       3.98
             N=2000      1.90       3.33       5.00
             N=4000      1.92       3.26       4.56
             N=8000      1.87       3.08       4.19

.PP 
???????

.PP
For the Sequent Balance 21000 that does not have vector features the measured 
performance increase for phase 1 as a function of the half bandwidth was a factor 
of 6 for @m=2 @ to @ m=32 @.  The actual floating-point performance increased 
from about @5 ~kflops@ to about @30 ~kflops@ per processor.  The execution time was
proportional to the size of a partition.
.PP
The speedup as a function of the number of partitions for phase 1 was proportional
to @P@ for all values of @m@ and @N@ for the Alliant, except for @N=8k@ and @m=2@
which produced anomolous data, probably due to cache conflicts.  The speedup 
for the Sequent Balance 21000 was superlinear for @m=32@.  The speedup increased
with decreasing system size.  The speedup for @m=2@ was slightly sublinear for
@N=2k@ and strongly sublinear for smaller systems.  We cannot account for this 
phenomenon.
.SH
7.1.2 Phase 2. Local Solution.
.PP
The essential computation in phase 2 is a solve operation in each
partition.  Fill-in is generated and the total amount of work increases
with @P@ approximately as @6m sup 2 N(1-1/P)@.  The tables below shows
the measured running times for phase 2 with relative running times
normalized to @P=2@ given within parenthesis.


                              Alliant FX/8
                    Solve, 1 Processor, time in sec

    m            P=2(1)    P=4(1.5)       P=8(1.75)      P=16(1.875)

    2  N=1000   0.13 (1)    0.18 (1.38)     0.18 (1.38)     0.20 (1.54)
       N=2000   0.27 (1)    0.37 (1.37)     0.40 (1.48)     0.42 (1.56)
       N=4000   0.55 (1)    0.72 (1.31)     0.80 (1.45)     0.83 (1.51)
       N=8000   1.10 (1)    1.47 (1.34)     1.62 (1.47)     1.67 (1.52)

    8  N=1000   0.50 (1)    0.68 (1.36)     0.73 (1.46)     0.73 (1.46)
       N=2000   1.05 (1)    1.43 (1.36)     1.57 (1.50)     1.58 (1.50)
       N=4000   2.12 (1)    2.98 (1.41)     3.28 (1.55)     3.33 (1.57)
       N=8000   4.33 (1)    6.10 (1.41)     6.87 (1.59)     7.02 (1.62)

   16  N=1000   1.23 (1)    1.62 (1.32)     1.63 (1.33)     1.37 (1.11)
       N=2000   2.55 (1)    3.58 (1.40)     3.70 (1.45)     3.47 (1.36)
       N=4000   5.20 (1)    7.50 (1.44)     8.22 (1.58)     7.88 (1.52)
       N=8000  10.57 (1)   15.45 (1.46)    17.38 (1.64)    17.58 (1.66)

   32  N=1000   3.62 (1)    4.40 (1.22)     3.67 (1.01)     1.78 (0.49)
       N=2000   7.55 (1)   10.57 (1.40)    10.07 (1.33)     7.55 (1.00)
       N=4000  15.37 (1)   22.20 (1.44)    24.37 (1.59)    21.28 (1.38)
       N=8000  31.03 (1)   46.15 (1.49)    52.65 (1.70)    52.07 (1.68)
 

.PP
As seen from the table the growth rate predicted from the arithmetic
complexity alone is fairly accurate, except @N < 8000@ and @M>8@.  The 
approximate rate of floating-point operations varies as shown in the 
next table.

                        Alliant FX/8
                      Solve, 1 Processor
      Floating-point operations per second (kflops/sec)

             m            P=2    P=4    P=8    P=16

             2  N=1000    122    131    151    142
                N=2000    118    129    138    139
                N=4000    116    133    139    143
                N=8000    116    130    138    143

             8  N=1000    406    436    449    427
                N=2000    391    425    441    444
                N=4000    390    416    433    445
                N=8000    383    407    419    433
                
            16  N=1000    619    668    691    666
                N=2000    612    638    683    695
                N=4000    608    624    648    687
                N=8000    602    614    629    649
 
            32  N=1000    782    862    915    740
                N=2000    790    804    879    953
                N=4000    796    806    813    891
                N=8000    799    796    793    816

.PP
The floating-point rate is generally increasing with @m@ and @P@,
and decreasing as a function of @N@.  For @m=2@ the rate is about
@3@ times higher than the rate for factorization, approximately
@ 40% @ higher for @m=8 @, about the same for @m=16@, and only
@65%@ of the rate for factorization for @m=32@.  The rate varies
by a factor of @ 8 @ (compared to a variation by a factor @ 90 @
for the factorization).  The floating-point rate is shown in Figure 14.  


Figure 14.  The floating-point speed of for forward and backward solve
            on one processor.

.PP
The speed-up for the solve routine if the number of partitions matches
the number of processing elements is expected to be 1.5, 3, 7 and 15
respectively for @P=2@, @P=4@,@P=8@, and @P=16@ compared to the one processor
case (with the corresponding value of @P@).  Note that the total work 
increases with @P@.  The speed-up with @P@ being equal to the number of
partitions is @P/(3m)@ compared to the sequential algorithm, which 
amounts to a slow-down for the cases reported here.  The figures in 
parenthesis in the table for the execution time for the different number
of processors give the number of partitions for which the minimum
execution time was achieved.  The fact that @P=16@ seemingly always is
the best of the considered values is in part due to the fact that the
compiler failed to properly parallelize one loop.

                         Alliant FX/8
                   Execution time for solve, sec

          m             2 proc.         4 proc.        8 proc.

          2  N=1000    0.10 (16)       0.067(16)       0.033(16)
             N=2000    0.23 (8,16)     0.15 (16)       0.10 (16)
             N=4000    0.47 (16)       0.30 (16)       0.20 (16)
             N=8000    0.93 (16)       0.62 (16)       0.40 (16)
          
          8  N=1000    0.40 (16)       0.20 (16)       0.12 (16)
             N=2000    0.88 (16)       0.47 (16)       0.30 (8,16)
             N=4000    1.87 (16)       1.02 (16)       0.62 (8,16)
             N=8000    3.97 (16)       2.13 (16)       1.27 (8,16)
 
         16  N=1000    0.78 (16)       0.42 (16)       0.25 (16)
             N=2000    1.92 (16)       1.07 (16)       0.70 (16)
             N=4000    4.47 (16)       2.47 (16)       1.58 (16)
             N=8000    9.73 (16)       5.27 (16)       3.37 (16)
 
         32  N=1000    0.98 (16)       0.53 (16)       0.32 (16)
             N=2000    4.27 (16)       2.45 (16)       1.65 (16)
             N=4000   12.25 (16)       6.87 (16)       4.58 (16)
             N=8000   28.35 (16)      15.52 (16)      10.45 (16)


.PP
The speedup is summarised in the following table.


                         Alliant FX/8
                       Speedup of solve

          m             2 proc.    4 proc.    8 proc.

          2  N=1000     1.30       1.94      3.93
             N=2000     1.17       1.80      2.70
             N=4000     1.17       1.83      2.75
             N=8000     1.18       1.77      2.75
          
          8  N=1000     1.25       2.50      4.17
             N=2000     1.19       2.23      3.50
             N=4000     1.13       2.08      3.42
             N=8000     1.09       2.03      3.41
 
         16  N=1000     1.58       2.93      4.92
             N=2000     1.33       2.38      3.64
             N=4000     1.16       2.11      3.29
             N=8000     1.09       2.01      3.14
 
         32  N=1000     1.82       3.36      5.56
             N=2000     1.77       3.08      4.58
             N=4000     1.25       2.24      3.36
             N=8000     1.09       2.00      2.97

.PP
The speedup for 2 processors is @80% - 100%@ of the theoretical value,
for 4 processor it is @60% - 100%@, and for 8 processors @40% - 80%@. 
The ability of the compiler and the architecture to take a advantage of
the available concurrency is rather unsatisfactory.
.PP
Part of the reason for the descrepancy between the expected speed-up and
the measured speed-up for the solve phase is due to the chracteristics
of the compiler for the Alliant FX/8.  It parallelizes one loop level 
(outermost unless instructed otherwise).  It fails to parallelize the 
solve code for two partitions, since the code handles the fill-in
corresponding to a distinct set of solution variables in each 
iteration.  Writing the code such that each loop iteration only contains
the operation for one partition forces the introduction either of
conditionals to handle the first and last partition which are special
cases, or sequential code outside the loop for these partitions.
In either case the compiler either fails completely to parallelize,
or has to be "tricked" into parallelizing the code properly.  For 
sufficiently large values of @P@ this problem should be of minor
influence.

.PP
The efficiency of the partitioning method is critically dependent on
the time for the solve routine compared to the factorization routine for
small values of @P@.  Most of the time spent in the solve routine is
due to fill-in, i.e., the price paid for the partitioning strategy
(or the nested dissection elimination order).  For all partitions but 
the first and last the ratio of the arithmetic complexity for the solve 
to that of factorization is approximately @ 3 @.  For one processor the 
ratio is @3(1-1/P)@.

                        Alliant FX/8
                  Solve/Factorization time, 1 processor

          m          P=2 (1.5) P=4 (2.25)   P=8 (2.625)

          2  N=1000    1.25       1.34        1.00
             N=2000    1.28       1.15        1.20
             N=4000    1.24       1.20        1.18
             N=8000    1.21       1.19        1.21

          8  N=1000    2.00       2.00        1.79
             N=2000    2.10       2.14        2.50
             N=4000    2.20       2.27        2.48
             N=8000    2.34       2.32        2.44
 
         16  N=1000    3.00       3.23        3.01
             N=2000    2.56       2.82        2.80
             N=4000    2.76       2.81        2.77
             N=8000    2.97       2.93        2.86
 
         32  N=1000    5.44       5.30        3.86
             N=2000    4.27       4.30        4.34
             N=4000    4.38       4.16        3.88
	     N=8000    3.63       3.29        3.01

.PP
As can be seen from the table the relative solve time is lower than
expected for @m = 2@, higher than expected for @m=8@ and @P=2@, and
slightly lower than expected for @m=8, P>2@.  For @m>8@ the relative 
solve time is higher than expected, and significantly so for @m=32@.  
Consequently, the LINPACK based parallel algorithm performing concurrent
elimination of variables is less time consuming than predicted for small
bandwidths.
.PP
and for the Sequent Balance 21000

            
                       Sequent Balance 21000

                         phase 2/ phase 1

           m   N=2048    N=1024    N=512    N=256

           2    2.78      2.89      2.5      2.9

           8    4.13      4.3       4.0      4.1

          16    4.46      4.4       4.4      4.7

          32    4.61      4.6       4.6       -

With the exception of the @N = 8k@ case Phase 2 on the Alliant requires less
time than predicted, if overhead is ignored.  Note that the
ratio is relatively independent of @N@, as expected, and approximately also
independent of @m@.  It is also intersting to note that the ratio is approximately
the same for @m=2@ and @m=32@, but is higher for intermediate values (by 30 - 50 %).
For the Sequent Balance 21000 the ratio is generally higher than expected, and
increases as a function of @m@, with a relatively small increase for @m > 8@.
.SH
7.1.3 Phase 3.  Solution of the reduced block trdiagonal system
.PP
For phase 3 and block Gaussian elimination the expected solution time is
proportional to @P@ and @m sup 3@.  The measured performance is 
approximately linear in @P@, but doubling the bandwidth only quadrupled
the solution time on the Alliant.  Being inherently sequential, with the
exception for the possibility to perform two-way elimination, the
execution time should be independent of the number of processing
elements, since no parallelization of block matrix (LINPACK) operations
is performed.  The measurements concur with the predicted behavior.

                            Alliant FX/8
                Time for Block Gaussian elimination

                  m     P=2       P=4       P=8      P=16

                  2      0        0         0          0

                  8      0        0.03      0.06       0.13

                 16      0.01     0.08      0.22       0.50

                 32      0.03     0.32      0.90       2.00

.PP
The block cyclic reduction solver requires the same number of arithmetic
operations for @P=2@ and @P=4@.  The total number of arithmetic
opertions for @P=8@ is higher.  The block cyclic reduction solver
requires approximately @1.26@ times as many operations as the block
Gaussian elimination solver, but the parallel arithmetic complexity is
lower.  However, the Alliant compiler did not manage to parallelize our
cyclic reduction code.  The running time for @P=4@ was approximately
@1.15@ times that of the Gaussian elimination code, and for @P=8@ it
was approximately @1.52@ for 1 processor, @1.42@ for 2 processors,
and @1.40@ for 4 or more processors.
.PP
For the Sequent Balance 21000 the dependence is approximately linear in @P@.
The dependence on @m@ is approximately cubic.


                        Sequent Balance 21000

                 Time for Block Gaussian elimination

                        m     P=4        P=8      

                        2    0.005      0.075

                        8    0.583      1.25

                       16    3.60       7.65

                       32   25.25      53.63      


.SH
7.1.4. Phase 4.  Back susbtitution
.PP         
The time for phase 4 is small relative to the time for the other phases and
offers no particular insight.

.SH
7.1.5 Total time for the partitioning method

.PP
The total time for the solution of the different banded systems using
block Gaussian elimination for the reduced system is given
in the table below.

.ce 3
Alliant FX/8
P= # of processors, time in sec

          m          P=1      P=2       P=4       P=8

          2 N=1000  0.27      0.25      0.18      0.12
            N=2000  0.52      0.48      0.32      0.23
            N=4000  1.00      0.92      0.60      0.42
	    N=8000  1.98      1.83      1.20      0.80

          8 N=1000  0.50      0.75      0.42      0.30
            N=2000  1.00      1.47      0.82      0.52
            N=4000  2.00      2.97      1.62      0.98
            N=8000  4.02      5.98      3.22      1.93

         16 N=1000  0.93      1.68      0.98      0.75
            N=2000  1.87      3.45      1.92      1.35
            N=4000  3.67      6.98      3.81      2.55
            N=8000  7.38     13.98      7.55      4.97
	       
         32 N=1000  1.92      4.25      2.67      2.12
            N=2000  3.87      8.53      5.35      3.97
            N=4000  7.82     18.43     10.73      7.63
            N=8000 16.08     39.55     22.03     15.55

.PP
Our implementation of the partitioning strategy on the Alliant FX/8
yields a speedup for for @P>1, m=2 @, for @P>_ 4, m=8 @, and for
@P>= 8, m=16@.  For half bandwidth @ 32 @ there is a slight speedup for 
@N>= 4000 @, and a slight slow-down for smaller systems.  The total time
for the partitioning method on the Alliant is shown as a function of 
@P@ with @N@ as a parameter in Figures 15 and 16 for @m= 2@ and @m=32@ 
repsectively.  

Figure 15.  Total solution time on the Alliant FX/8 for @m=2@.

Figure 16.  Total solution time on the Alliant FX/8 for @m=32@.

.SH
7.2 One-dimensional partitioning
.PP
As described in section 4 an alternative to the concurrent elimination
of different variables a one-dimensional partitioning can be used for
the concurrent elimination of a single variable.  The following table
of solution times gives the result from performing the factorization
in parallel by performing column operations in parallel.


.ce 3
Alliant FX/8
One-dimensional partitioning
P= # of processors, time in sec

          m          P=1      P=2      P=4      P=8

          2 N=1000  0.28     0.25     0.23     0.23
            N=2000  0.55     0.48     0.45     0.43
            N=4000  1.07     0.97     0.85     0.87
            N=8000  2.12     1.87     1.70     1.70

          8 N=1000  0.58     0.40     0.31     0.32
            N=2000  1.13     0.80     0.63     0.60
            N=4000  2.23     1.55     1.25     1.15
            N=8000  4.42     3.12     2.48     2.30

         16 N=1000  1.08     0.67     0.48     0.42
            N=2000  2.08     1.30     0.93     0.80
            N=4000  4.17     2.62     1.88     1.58
            N=8000  8.37     5.22     3.73     3.13

         32 N=1000  2.07     1.18     0.78     0.62
            N=2000  4.10     2.35     1.55     1.22
            N=4000  8.75     5.00     3.23     2.48
            N=8000 19.77    11.45     7.80     5.97

.PP
For a small bandwidth the parallelism for the column partitioning
is very limited, and as seen for the case @m=2@ the execution time
doe snot decrease beyond the time for 2 processors.  For 
@m=8@ the time decreases for up to 4 processors.  The speedup is
by a factor of up to about 1.8.  For a half bandwidth of @m=32@ the
speedup is up to 3.4.
.PP
Comparing the execution times for our implementations of the 
partitioning method as described in section 5, with the 
one-dimensional partitioning we conclude that for a small
bandwidth the concurrent elimination of multiple variables
yields a lower execution time for any number of processors, 
whereas for a half bandwidth of @ 8 @ it has a lower execution
time first for @8@ processors, the maximum on the Alliant FX/8.
For a larger bandwidth the one-dimensional partitioning always
gave a lower execution time.  


Figures 17 and 18 shows the corresponding data for the Sequent 
Balance 21000.                           


Figure 17.  Total solution time on the Sequent for @N=256@ as a function of @p@ and @m@..

Figure 18.  Total solution time on the Sequent for @N=256@ asa function of @p@ and @m@..
.PP
The results of the experiments are listed below.
.ce
Sequent Balance 21000
.ce
@n ~=~ 2000@, @m~=~ 32@
.TS
center;
c|c c c c
l|n n n n.
	1	2	4	8
_
Factor	113	55	28	13
Solve	535	269	129	59
Reduce	.5	4	11	25
Backsolve	3	5.3	2.8	2.4
Total	658	343	180	112
.TE
.sp 4
.ce
Sequent Balance 21000
.ce
@n ~=~ 2000@, @m~=~ 16@
.TS
center;
c|c c c c
l|n n n n.
	1	2	4	8
_
Factor	30	15	8.1	4.7
Solve	147	75	37	19
Reduce	.06	.6	1.6	3.7
Backsolve	1.6	2.6	1.2	.6
Total	181	97	51	32
.TE
.sp 4
.ce
Sequent Balance 21000
.ce
@n ~=~ 2000@, @m~=~ 8@
.TS
center;
c|c c c c
l|n n n n.
	1	2	4	8
_
Factor	10	5	2.7	1.3
Solve	42	22	1.3	8.5
Reduce	.02	.1	.27	.6
Backsolve	.8	1.4	.6	.3
Total	54	30	18	13
.TE
.SH
8. Conclusions
.PP
We have described and analysed algorithms for the concurrent solution of
banded systems of equations.  A concurrent algorithm based on a partitioning 
strategy equivalent to elimination by nested dissection has been 
implemented on two shared-memory machines with limited parallelism.  
The partitioning method offers a low communication complexity compared 
to concurrent elimination of single variables, and has numerical
advantages to some previously implemented partitioning methods.

The arithmetic complexity of the parallel algorithm is higher than that
of standard direct solvers for band matrix problems.  At least 5
processors are needed for the parallel arithmetic complexity to be lower
than that of standard Gaussian elimination.

Because the computation of the fill-in is relatively more efficient than
the factorization for small bandwidths the measured running time shows 
that for our LINPACK based implementation of the partitioning algorithm 
on the Alliant FX/8 it has a lower execution time than the sequential 
algorithm (using LINPACK) for any number of processors.  However, for
bandwidths approaching the length of the vector registers the
factorization routine uses the architecture more efficiently than the 
solve routine, and the computation of the fill-in is relatgively more 
expensive than predicted, moving the break-even point for the
partitioning method to a higher number of processors than predicted.
Indeed, for @m=32@ the break-even point occurs at 8 processors
in our implementation.  Part of the reason for the higher than predicted
break-even point is the fact that the compiler failed to parallelize 
some computations that can be done in parallel.  If several loop levels 
could be parallelized, then a higher processor utilization and lower 
execution time is likely.
(The current Alliant compiler only allows parallelization of one loop
level).

The one-dimensional partitioning yields a speed-up that is insignificant
for a halfbandwidth of @m=2@, but is 3.4 for 8 processors and a 
halfbandwidth of @m=32@.  Consequently, the concurrent elimination of
multiple variables yields a lower execution time for any number of
processors for a half bandwidth of 2, for 8 processors at a half
bandwidth of 8.  Since the Alliant is limitedto eight processors we
were unable to establish a break-even point for larger bandwidths.


.SH
9. Acknowledgment
.PP
We would like to thank Tom Hewitt from CRAY Research for his assistance
with the CRAY data.
.SH
10. References
.IP[]
Lawrie and Sameh,
.I
The computation and communication complexity of parallel banded system
solvers,
.R
ACM TOMS
.sp
.IP []
Dongarra and Sameh,
.I
On Some Parallel banded system solvers,
.R
Parallel Computing, vol 1 no 3, Dev 1984, pp 223-235.
.sp 
.IP []
U. Meier,
.I
A Parallel Partition Method for Solving Banded Systems of Linear
Equations,
.R
Parallel Computing, 2 (1985), pp 33-45.
.sp
.IP []
C. Ashcroft,
.I
Parallel Reduction Methods for the Solution of Banded Systems of Equations,
.R
??
.sp
.IP []
L. Johnsson,
.I
Solving Narrow Banded Systems on Ensemble Architectures,
.R
??
.sp
.IP []
J. Dongarra, J. Bunch, C. Moler, and G. Stewart,
.I
LINPACK Users' Guide,
.R
SIAM Pub, 1976.
.sp
.IP []
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
.I
BLAS,
.R
ACM TOMS
.sp
.IP []
J. Dongarra, F. Gustavson, and A. Karp,
SIAM Review
.sp
.IP []
J. Dongarra and D. Sorensen,
.I
On Environment for Implementing Explicite Parallel Processing in Fortran,
.R
ANL MCS - TM 79, 1986.
.sp
.IP []
J. Dongarra and S. Eisenstat,
.I
Squeezing the most 
.R
ACM TOMS
.sp
.IP []
J. WIlkinson, private communication, 1976.
.sp
.IP []
D. Evans and M. Hatzopoulos,
.I
The solution of certain banded systems of linear equations using the folding
algorithm,
.R
Computer Journal,
19 pp 184-187, 1976.
.sp
.IP []
Bhatt S.N, Ipsen I.C.F.,
.I
How to Embed Trees in Hypercubes,
.R
Yale University, Dept. of Computer Science,
December,1985,YALEU/CSD/RR-443.
.sp
.IP []
Crowther W., Goodhue J., Starr E., Thomas R., Milliken W.,
Blackadar T.,
.I
Performance Measurements on a 128-node Butterfly Parallel Processor,
.R
Proceedings of the 1985 International Conference on Parallel
Processing,
IEEE Computer Society,
pp 531-540,
1985.
.sp
.IP []
Desphande S.R., Jenevin R.M.,
.I
Scaleability of a Binary Tree on a Hypercube,
.R
University of Texas at Austin,
(TR-86-01),
January,
1986.
.sp
.IP []
George A., Liu J. W.,
.I
Computer Solution of Large Sparse Positive Definite Systems,
.R
Prentice-Hall, Englewood Cliffs, NJ,
1981.
.sp
.IP []
George A.,
.I
Nested Dissection of a Regular Finite Element Mesh,
.R
SIAM J. on Numer. Anal.,
vol 10,
pp 345-363,
1973.
.sp
.IP []
Gottlieb A., Grishman R., Kruskal C.P., McAuliffe K.P., Rudolph L., Snir M.,
.I
The NYU Ultracomputer - Designing an MIMD Shared Memory Parallel
Computer,
.R
IEEE Trans. Computers,
pp 175-189,
vol C-32,
1983.
.sp
.IP []
Ho C.-T., Johnsson S.L.,
.I
Tree Embeddings and Optimal Routing in Hypercubes,
.R
Yale University, Dept. of Computer Science,
In preparation,
1986.
.sp
.IP []
Jalby W., Meier U.,
.I
Optimizing Matrix Operations on a Parallel Multiprocessor with a Memory Hierarchy,
.R
Univ. of Illinois, Center for Supercomputer Research and Development,
1986.
.sp             
.IP []
Johnsson S.L.,
.I
Solving Tridiagonal Systems on Ensemble Architectures,
.R
SIAM J. Sci. Stat. Comp.,
1986,
Also available as Report YALEU/CSD/RR-436, November 1985,
.sp
.IP []
Johnsson S.L.,
.I
Solving Narrow Banded Systems on Ensemble Architectures,
.R
ACM TOMS,
11,
November,
1985,
(Also available as Report YALEU/CSD/RR-418, November 1984).
.sp
.IP []
Johnsson S.L.,
.I
Band Matrix Systems Solvers on Ensemble Architectures,
.R
University of Texas Press,
Algorithms, Architecture, and the Future of Scientific Computation,
Technical report YALEU/CSD/RR-388, Yale University, Dept. of 
Computer Science,
1985.
.sp
.IP []
Johnsson S.L.,
.I
Data Permutations and Basic Linear Algebra Computations on Ensemble 
Architectures,
.R
Yale University, Dept. of Computer Science,
YALEU/CSD/RR-367,
February,
1985.
.sp
.IP []
Johnsson S.L.,
.I
Communication Efficient Basic Linear Algebra Computations on 
Hypercube Architectures,
.R
Dept. of Computer Science, Yale University,
YALEU/CSD/RR-361,
January,
1985.
.sp
.IP []
Johnsson S.L.,
.I
Fast Banded Systems Solvers for Ensemble Architectures,
.R
Department of Computer Science, Yale University,
YALEU/CSD/RR-379,
March,
1985.
.sp
.IP []
Johnsson S.L.,
.I
Dense Matrix Operations on a Torus and a Boolean Cube,
.R
The National Computer Conference,
AFIPS,
July,
1985.
.sp
.IP []
Johnsson S.L.,
.I
Odd-Even Cyclic Reduction on Ensemble Architectures and 
the Solution Tridiagonal Systems of Equations,
.R
Department of Computer Science, Yale University,
YALE/CSD/RR-339,
October,
1984.
.sp
.IP []
McBryan O.A., Van de Velde E.F.,
.I
Hypercube Algorithms and Implementations,
.R
Courant Institute of Mathematical Sciences, New York University,
 ,
November,
1985.
.sp
.IP []
Reingold E.M., Nievergelt J., Deo N.,
.I
Combinatorial Algorithms,
.R
Prentice Hall,
1977.
.sp
.IP []
Read R.,
Rose D.J.,
.I
A Graph-Theoretic Study of the Numerical Solution
of Sparse Positive Definite Systems of Linear Equations,
.R
Graph Theory and Computations,
Academic Press,
pp 183-217,
1973.
.sp
.IP []
Pfister G.F., Brantley W.C., George D.A., Harvey S.L.,
Kleinfelder W.J., McAuliffe K.P., Melton E.A., Norton V.A., Weiss J.,
.I
The IBM Research Parallel Processor Prototype (RP3); Introduction
and Architecture,
.R
Proceedings of the 1985 International Conference on Parallel
Processing,
IEEE Computer Society,
pp 764-771,
1985.
.sp
.IP []
Saad Y., Schultz M.H.,
.I
Data Communication in Hypercubes,
.R
Dept. of Computer Science, Yale University,
RR YALEU/DCS/RR-428,
October,
1985.
.sp
.IP []
Schwartz J.T.,
.I
Ultracomputers,
.R
ACM Trans. on Programming Languages and Systems,
2,
pp 484-521,
1980.
.sp
.IP []
Smith B.J.,
.I
Architecture and Applications of the HEP Multiprocessor Computer System,
.R
Real-Time Signal Processing IV, Proc of SPIE,
pp 241-248,
1981.
.sp
.IP []
Sorensen D.C.,
.I
Buffering for Vector Performance on a pipelined MIMD machine,
.R
Parallel Computing,
.R
1,
pp 143-164,
1984.
.sp
.IP []
Johnsson S.L.
.I
"Gaussian Elimination on Sparse Matrices and Concurrency"
.R
Caltech Computer Science Department, 1980, 4087, TR:80.
.sp
.IP []
Gentleman M.W.,
.I
Implementing Nested Dissection,
.R
Dept. of Computer Science, Univ. of Waterloo,
1982,
March,
Research report CS-82-03).

.nf
@article(Wing-80,
    author="Wing O., Huang J.W.",                                
    key="Wing,Huang",
    title="A Computational Model of Parallel Solution of Linear Equations",
    journal="IEEE Trans. Computers",
    volume="C-29",
    number="7",
    pages="632-638",
    year="1980")

@techreport(Liu-85
    author="Liu J.W.H.",
    key="Liu",
    title="Computational Models and Task Scheduling for Parallel Sparse Cholesky
Factorization",
    institution="Dept. of Computer Science, York University, Downsview, Ontario",
    year="1985",
    number="Technical report CS-85-01",
    month="April")

@techreport(George-86,
    author="George A., Heath M., Liu J., Ng E.",
    key="George et al",
    title="Sparse Cholesky Factorization on a Local-Memory Multiprocessor",
    institution="Department of Computer Science, York University, Downsview, Ontario",
    year="1986",
    number="Technical report CS-86-01")

@techreport(Worley-85,
      author="Worley Patrick H., Schreiber Robert",
      key="Worley" ,
      title="Nested dissection on a Mesh-Connected Processor Array",
      institution="Stanford University, Cent. Large Scale Sci. Computation",
      number="CLaSSiC-85-08",
      year="1985")

@article(Flynn-66,
    author="Flynn M.J.",
    key="Flynn",
    title="Very High-Speed Computing Systems",
    journal="Proc. of the IEEE",
    year="1966",
    pages="1901-1909",
    volume="12")

@article(Seitz-85,
    author="Seitz C.L.",
    key="Seitz",
    title="The Cosmic Cube",
    journal="Communications of the ACM",
    year="1985",
    pages="22-33",
    volume="28",
    number="1")

@book(Hillis-85,
	author="Hillis W.D.",
	key="Hillis",
	title="The Connection Machine",
	publisher="MIT Press",
	year="1985")