.ds TH "\h'-.50m'\v'-.40m'^\v'.40m'\h'.50m' .ds TD "\h'-.65m'\v'-.40m'~\v'.40m' .ds CT "\(mu\h'-.66m'\(ci .ds NI "\(mo\h'-.65m'/ .nr EP 1 .nr VS 14 .EQ gsize 11 delim @@ ... define | '^\(br^' ... define || '\(br ^ \(br ' define notin '\(mo back 30 / ' define times '\(mu' define ctimes '\(mu back 80 \(ci ^ ' define hhat ' up 5 hat down 5 ' define bl '\0' .EN .TL Solving Banded Systems on a Parallel Processor * .AU .ps 11 .in 0 J. J. Dongarra and L. Johnsson .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' 9700 South Cass Avenue\h'.20i' Argonne, Illinois 60439-4844\h'.20i' and Computer Science Department P.O. Box 2158 Yale University New Haven, Connecticut 06520 .nr PS 11 .ps 11 .nr VS 16 .vs 16 .nr PD 0.5v .FS * Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contracts W-31-109-Eng-38 and in part by the Office of Naval Research under Contract N00014-84-K-0043. .FE .ps 10 .AB .B Abstract .R .PP In this paper we examine ways of solving dense, banded systems on different parallel processors. We start with some considerations for processors with vector instructions, then discuss various algorithms for the solution of large dense banded systems on a parallel processor. We analyze the behavior of the parallel algorithms on distributed storage architectures configured as rings, two-dimensional meshes with end-around connections (tori), boolean n-cube configured architectures, and bus-based and switch-based machines with shared storage. We also present measurements for two bus-based architectures with shared storage, namely, the Alliant FX/8 and the Sequent Balance 21000. .NH Introduction .PP The solution of banded systems of equations is important in many areas of scientific computing. We restrict our attention here to matrices that are symmetric positive definite or diagonally dominant. With the coming of parallel processors, algorithms for the solution of these problems have been the subject of many studies []. In this paper we look at an implementation that uses a partitioning scheme essentially equivalent to incomplete nested dissection. This approach divides the work into sections, reduces the sections independently in parallel, performs nearest neighbor communication among processors to compute part of the solution, and carries out independent operations to determine the complete solution to the original problem. .NH Sequential Case .PP LINPACK [] includes routines to solve banded systems of linear equations. The matrices can be of a general nature or symmetric positive definite. The algorithms for the banded problem in LINPACK are based on the vector operation from the BLAS [] called a SAXPY, (@ y ~ <-~ y ~+~ alpha x @). This operation forms the computational kernel in the form of a rank-1 update of a submatrix of size @m sup 2@ for a banded matrix @A@ of order @N@ and half bandwidth @m@ during the @k sup th@ step of the reduction. .PP Since much data is updated at each step, this algorithm suffers in terms of performance on most sequential computers. This effect of poor performance is most pronounced on vector architectures that have vector registers. On such machines the largest bottleneck is in memory traffic. .PP To substantially reduce memory accesses, the algorithm can be reorganized using matrix vector multiplication as the computational kernel, instead of rank-1 updates [dgk,de]. This technique reduces the number of storage accesses from @O(N m sup 2 )@ to @O(N m)@ through vectorization. .PP The table below shows the performance differences from this approach. .ce Performance for the CRAY X-MP-1 .ce @ order ~ 5000@, symmetric positive definite matrix .sp 2 .TS center; c c s s c c c c n n n n. MFLOPS LINPACK Improved Improved Half bandwidth CAL BLAS Fortran MV CAL MV _ 8 4 5 11 16 6 12 27 22 7 17 41 44 12 40 84 62 15 54 108 128 26 86 141 200 38 103 149 .TE .PP We can exploit a modest amount of parallelism by performing, say, the matrix-vector operations in terms of independent inner products. An additional amount of parallelism can be exploited by factorizing the matrix from both the top and the bottom simultaneously. When the factorization reaches the middle, a small block matrix must be factored. The back-substitution then proceeds from the middle and works toward the top and bottom in parallel. It is important to note that the number of floating-point operations is exactly the same in this approach as in the conventional factorization. (This two-way factorization resurfaces from time to time. Jim Wilkinson describes this as an approach used in the early days of computing to reduce overhead associated with the looping []. LINPACK uses this algorithm to solve symmetric positive definite tridiagonal systems of equations. Evans and Hatzopolus [] describe the algorithm as a folding technique.) .PP We consider this algorithm as the best we can do on a sequential computer. The amount of work, in terms of operations performed, is @ m(m+1)(N - (2m+1)/3)@ additions and multiplications for the factorization of symmetric positive definite matrices and @ m sup 2 (2N-1) - (4m sup 2 -1)m /3 @ for an arbitrary banded matrix of size @N@ and half bandwidth @m@. The number of additions/subtractions and multiplications for each solve is @2m(2N - m - 1)@. .NH Parallel Algorithms .PP For computer architectures that provide a large number of processors, algorithms can be employed that exploit the independence of operations in eliminating single variables and the independence of operations in eliminating different variables [Wing-80, Johnsson-80c, Gentleman-82, Liu-85, Johnsson-85a, johnsson-85d, Worley-85, George-86]. For a narrow banded matrix the potential concurrency is largely due to the independence of operations in eliminating different variables, whereas for a matrix with a half bandwidth equal to @sqrt n @ or larger the concurrency due to the independence of operations in the elimination of a single variable dominates. .PP If only the complexity of the parallel arithmetic is considered, as would be reasonable for a shared-memory machine with very high communication bandwidth which allow conflict-free access to data, then the speedup from concurrent elimination of single variables is linear. The speedup is measured as the time for sequential elimination divided by the time for concurrent elimination. On the other hand, the speedup from exploiting the independence of operations in eliminating different variables always is sublinear, since more operations are required to find the solution. The maximum speedup in the latter case is in the range @ 1/2(N/m)/log(N/m) @ to @ (N/m)/log(N/m)@ for @N/m@ processors and a minimum efficiency of @O(1/log(N/m))@. In the former case the maximum speedup is @m sup 2@ and the efficiency equal to @1@ [johnsson-85a]. .PP The graph representation of a banded matrix constitutes a @ perfect ~elimination~ graph @ [rose-73], and Gaussian elimination without pivoting constitutes a perfect elimination order. Hence, there is no @ fill-in @. Concurrent elimination of different variables, on the other hand, generally does not constitute a perfect elimination order; the total number of arithmetic operations performed and the storage required by the concurrent algorithm is higher than for the sequential algorithm. However, the time to solve the banded system should be greatly reduced for sufficiently many processors. In the case of a tridiagonal system, odd-even cyclic reduction requires approximately @17N@ operations compared to @8N@ for Gaussian elimination, but the parallel arithmetic compleixty for cyclic reduction is @ 11 logN@ in @2logN@ steps (or @12 logN@ in @ logN@ steps [Hockney-81]). Classical Gaussian elimination is a sequential method and requires @8N@ operations in @2N@ steps. The two-way Gaussian elimination is also a perfect elimination order. The parallel arithmetic complexity is @ 4N@ during @N@ steps. .PP Unfortunately, for many architectures, ignoring the cost of communication and the effects of vector features and storage hierarchy is not a good approximation of reality. Communication bandwidth, overhead in communication, routing conflicts, and bank conflicts, as well as overhead for vector instructions (should such be available), are important factors in choosing an optimum algorithm. We will focus on the communication issue. We carry out a simple analysis for two algorithms exploiting the independence of operations in eliminating a single variable, and one algorithm exploiting the independence of operations in eliminating different variables. The communications overhead (startup) is @ tau @ and the transmission time for a floating-point number is @ t sub c @. We discuss ring, mesh, hypercube and shared-memory architectures. .SH 4. Concurrent Elimination of a Single Variable .SH 4.1 One-dimensional partitioning .PP A one-dimensional partitioning of the matrix in the column or row direction is feasible in the case of a ring of processors, as shown in Figure 2. .cs 1 24 .ps 8 .nf _____ _____ _____ _____ | M | | M | | M | | M | |...| |...| |...| |...| ___| P |_____| P |____| P |__ . . . . . __| P |__ | |___| |___| |___| |___| | |______________________________________________________| .fi .ps 12 .cs 1 Figure 2. A ring of processors with distributed storage .PP The partitioning can be done either @ cyclically @ or @ consecutively @ [johnsson-85c]. In cyclic partitioning by columns, column @ j @ is allocated to processor @ j mod P @, where @ P @ is the number of processors labeled from @ 0 @ to @ P-1@. In consecutive partitioning, column @ j @ is assigned to processor @ (jP/(m+1))mod P @ (assuming for simplicity that @m+1@ is a multiple of @ P @). Cyclic and consecutive partitionings are illustrated in Figure 3. .cs 1 24 .ps 8 .nf ------------- ------------- | | | | | | | \e | | | | \e | | | | | | | | \e | | | | \e |0|1|2|0|1|2|0|1| | 0 | 1 | 2 | 0 | | | | | | | | | | | | | | \e | | | | | | | \e | | | | \e | | | | | \e | | | | \e | | | | \e | | \e | | | \e | | \e | | \e | \e | \e | \e \e Cyclic Consecutive .fi .ps 12 .cs 1 .ce Figure 3. Cyclic and Consecutive one-dimensional partitioning .PP Consecutive partitioning is akin to a block-oriented algorithm. It leads to a lower utilization of the array, but requires fewer communications in a packet-oriented mode of operation where a block corresponds to a packet. .PP In cyclic partitioning a processor first computes the pivot column of the left factor @ L @, then performs its part of the rank-1 update of the @m@ by @m@ submatrix; in other words, it updates @(m+1)/P-1@ columns of length @m@. After the processor storing the pivot column has computed a new column of @L@, it transmits the column to the processor with the next higher index modulo @P@ (to the right). The receiving processor forwards the column to the next processor, and performs a rank-1 update on @(m+1)/P@ columns of length @m@. Assuming a time @t sub a@ for addition/subtraction and multiplication and a time @t sub d@ for division, and no vector features in the processors, the arithmetic complexity is approximately @(t sub d + (2(m+1)/P +1)mt sub a )N @ for a nonsymmetric matrix. The communication complexity is approximately @ (mt sub c + tau )N @. We assume that the architecture is of the @ MIMD @ type [flynn-66] and that communication and arithmetic operations are pipelined. The speedup is approximately @ P @ if only the arithmetic complexity is considered, but may be poor if the communication complexity dominates, i.e., if @ (mt sub c + tau )N >> (t sub d + (2(m+1)/P +1)mt sub a )N @ then @P >> 2 m sup 2 t sub a /(mt sub c + tau ) @. .PP In consecutive partitioning the processor holding the pivot block factors it and transmits the factor to the "next" processor in a way analogous to the cyclic partitioning. The arithmetic complexity is approximately @(t sub d + (3(m+1)/P-P)(2m+1)/2Pt sub a )N @ for an unsymmetric matrix and the communication complexity @(mt sub c + (P/m) tau ) N @. Pipelining of arithmetic and communication operations is assumed for these complexity estimates. The arithmetic complexity is approximately @ 1.5 @ times that of cyclic partitioning for @ P << m @, but the number of startups in communication is reduced by a factor of @ P/m @. .SH 4.2 Two-dimensional partitioning .PP In the case of a two-dimensional mesh-configured set of processors, a two-dimensional partitioning of the matrix can be beneficial [johnsson-84d]. This partitioning can also be made cyclically or consecutively. We assume for simplicity that the matrix has the same number of super- and subdiagonals, and that the processor mesh has @ sqrt { P } @ nodes in each coordinate direction. Figure 4 illustrates the matrix partitionings. .cs 1 24 .ps 8 .nf ------------------- ----------------------\e |00|01|02|00|01|02| \e | | | | \e |--|--|--|--|--|--|--\e | 00 | 01 | 02 | 00\e |10|11|12|10|11|12|10| \e | | | | \e |--|--|--|--|--|--|--| \e |------|------|------|------\e |20|21|22|20|21|22|20|21|\e | | | | | \e |--|--|--|--|--|--|--|--| \e | 10 | 11 | 12 | 10 | 11\e |00|01|02|00|01|02|00|01|02|\e | | | | | \e |--|--|--|--|--|--|--|--|--| \e |------|------|------|------|------\e |10|11|12|10|11|12|10|11|12|10|\e | | | | | | \e |--|--|--|--|--|--|--|--|--|--| \e | 20 | 21 | 22 | 20 | 21 | 22\e |20|21|22|20|21|22|20|21|22|20|21|\e | | | | | | \e |--|--|--|--|--|--|--|--|--|--|--| \e |------|------|------|------|------|------\e \e|01|02| \e | | \e-|--| \e00 | 01 | \e12| \e | | \e-| \e------| Cyclic Consecutive .fi .ps 12 .cs 1 .ce Figure 4. Cyclic and consecutive two-dimensional partitioning .PP In cyclic partitioning the processor holding the pivot element sends it in the column direction (or row direction) to compute a set of elements of @ L @, then proceeds to compute the elements of a column of @ L @ that it stores (@ (m+1)/ sqrt { P } -1 @). The processor that receives the pivot element forwards it and computes @ (m+1)/ sqrt { P } @ elements of the column of @ L @ that corresponds to the pivot element. The communication is confined within columns. As the elements of @ L @ are computed, they are sent to the "next" processor within the row. The mesh of processors performs a rank-1 update; each processor updates @(m+1) sup 2 /P @ elements. The arithmetic complexity for the cyclic partitioning is approximately @ (t sub d + (2(m+1)/ sqrt { P } + 1)(m+1)/ sqrt { P }t sub a )N @ for an unsymmetric matrix, and the communication complexity is @ ((m+1)/ sqrt { P } t sub c + 2 tau )N @ assuming concurrent communication in the row and column directions. If communication is restricted to one direction at a time, then the data transfer time is doubled and there are a total of @ N @ additional startups. The arithmetic complexity is approximately the same as in the one-dimensional cyclic partitioning, but the communication complexity is different. The number of elements transmitted over a channel between a pair of processors is reduced by a factor of @ sqrt { P } @, but the number of communication startups is doubled. .PP In the case of two-dimentional consecutive partitioning the processor holding the pivot block factors it and passes the left factor to the next processor (right) in the same row, and the right factor the next processor (down) in the same column. The receiving processors forwards the factors and performs a forward solve. Then a block rank-1 update, a rank @ (m+1)/ sqrt { P } @ update, is performed by all processors. The arithmetic complexity of this block algorithm is approximately @ (t sub d +(11(m+1) sup 2 /P -9(m+1)/ sqrt { P } +1)/3 t sub a )N @ and the communication complexity @ ((3(m+1)/ sqrt { P } - 1)/2 t sub c + 2 tau sqrt { P } /(m+1) )N @. The arithmetic complexity is approximately twice (@ 11/6 @) that of cyclic partitioning. The number of startups is reduced by a factor of @(m+1)/ sqrt {P} @ compared to the two-dimensional cyclic partitioning, and by a factor of @ sqrt {P} /2 @ compared to the one-dimensional consecutive partitioning. The time for transfer of matrix elements is approximately @1.5@ times that of the cyclic two-dimensional partitioning. .SH 4.3 Boolean n-cube .PP In a boolean n-cube multiprocessor [seitz-85] the processing nodes from the corners of a n-dimensional boolean cube. Each node has a fanout of @ n @, the diameter is @ n @, the average distance between nodes @ n/2 @, the total number of edges @ n 2 sup (n-1) @, and the number of disjoint paths between any pair of nodes @n@ [saad-85a] [johnsson-85c]. A boolean cube can simulate a ring and two-dimensional mesh preserving the nearest neighbor connection property by using a binary-reflected Gray code [Reingold-77], as observed for example by [johnsson-85c], [McBryan-85], and [Fox-85b]. Of the partitioning strategies discussed above, consecutive two-dimentional partitioning is preferable if the startup times dominate the time for arithmetic, whereas cyclic two-dimentional partitioning is preferable if startup times can be ignored. .PP In the n-cube there is also the potential for reducing the data transfer time by using the additional communication paths between a pair of nodes [saad-85a] [johnsson-85b] [Ho-85k]. Moreover, a boolean n-cube can simulate many other graphs than rings and meshes with no or at most a constant slowdown. In particular, complete binary trees [johnsson-85b] [bhatt-85a] [deshpande-86] can be embedded effectively, and algorithms based on the divide-and-conquer strategy, such as (block) cyclic reduction and nested dissection based elimination schemes, are candidates for efficient boolean n-cube algorithms. The partitioning algorithm described in the next section belongs in this category of algorithms. .PP Since the partitioning method leads to a reduced block tridiagonal system, we now indicate how a (block) tridiagonal system can be solved effectively on a boolean n-cube. The reader is referred to [johnsson-84c][johnsson-85e2] for details. The mapping of the block rows makes use of the binary-reflected Gray code for the initial assignment of partitions to processors. A block cyclic reduction algorithm is used to exploit concurrency. By the Gray code embedding and the first reduction step only implies nearest neighbor communication. It can be shown that the communication in subsequent steps is always between processors at distance @ 2 @ from each other. This property can be used for an @ in-place @ algorithm. It is also possible to divide the distance two communication into two nearest neighbor communications such that after an exchange step the equations participating in the next reduction step are in a subcube of half size. Hence, during the reduction process subsets of equations are recursively assigned to subcubes such that any subset is mapped to a subcube in a binary-reflected Gray code order. The recursive assignment requires a simple exchange operation between certain adjacent processors. Each processor can compute whether it will perform an exchange operation and with what processor from its address and the reduction (backsubstitution) step to be performed. The same local information suffices to determine the communication for the reduction (backsusbstitution) operations themselves. The exchange algorithm moves even equations to even processors ( and odd equations to odd processors). The considered bit-field is reduced by one for each reduction step. One step of the exchange algorithm is illustrated for a 3-cube in Figure 5. .cs 1 24 .ps 8 .nf 7---------- 6 6---------- 7 /| / /| / / | / | / | / | 0 /__|_____1/ | 0 /__|_____1/ | | 4|_____|___|5 | 4|_____|___|5 | / | / | / | / | / | / | / | / |/________|/ |/________|/ 3 2 2 3 .fi .cs 1 .ps 12 Exchange .ce 2 Figure 5. Recursively assigning equations to nodes through exchange operations preserving adjacency .PP In the n-cube algorithms based on binary-reflected Gray codes, full advantage can be taken of truncated cyclic reduction. Each reduction step is carried out on all relevant equations during the same time step. Hence, truncating the reduction after @ r < n @ steps reduces the total time proportionally. .PP The block cyclic reduction algorithm as outlined above exploits the independence of operations in the elimination of variables corresponding to different blocks. It is also possible to exploit the independence of operations for the elimination of a single variable if additional processors are available. A two-dimensional mesh is suitable for the latter form of concurrency. By assigning @ log 3m sup 2 @ dimensions of the boolean cube for the embedding of @ 3m times m @ meshes and @ log P@ dimensions for the partitions both forms of concurrency can be maximally exploited. .PP We will analyze the complexity of the partitioning method in some detail in section 5. For further details see [johnsson-85a][johnsson-85e]. .SH 4.4 Global storage .PP For global storage architectures there are two common basic approaches: (1) the use of a common bus for processor-to-storage communication, (Figure 6), and (2) switch-based architectures, (Figure 7). .ps 8 .cs 1 24 .nf ___ ___ ___ ___ | | | | | | | | | P | | P | | P | . . . . . | P | |___| |___| |___| |___| |________|________|_________________________| | _________|_________ | | | | | | | Storage | | | | | |_________________| .fi .ps 12 .cs 1 Figure 6. Global storage architecture with a bus. .ps 8 .cs 1 24 .nf ___ ___ ___ ___ | | | | | | | | | P | | P | | P | . . . . . | P | |___| |___| |___| |___| ---|--------|--------|----- ---|--- | | | | | | | Switch | | | | | | | ---|--------|--------|----- ---|--- _|_ _|_ _|_ _|_ | | | | | | | | | M | | M | | M | . . . . . | M | |___| |___| |___| |___| .fi .ps 12 .cs 1 Figure 7. Global storage architecture with a switch. .PP The bus type of global-storage architecture is typical when the number of processing elements is few, as in CRAY's computers, the Sequent, the Encore, and the Alliant FX/8. The switch type of global storage architectures is typical for architectures conceived as highly parallel, as in the NYU Ultracomputer [Schwartz-80] [gottlieb-83] the RP3 by IBM [Pfister-85], and the BBN butterfly [Crowther-85]. .PP In an architecture with global storage and limited local storage the computations for exploiting the independence of operations in the elimination of a single variable can be organized such that the processors first share in the computation of a new column of @ L @, then share the computation for the rank-1 update. The arithmetic complexity is @ (t sub d + (2m+1)m/P t sub a )N @. The communication complexity for the bus architecture is @(max( (m+1) sup 2 /P t sub cp, (m+1) sup 2 t sub cs ) + t sub cp + (t sub cs / t sub cp P + 1) tau )2N @, where @t sub cp @ is the time to transfer one floating-point number to or from a processor and @ t sub cs @ the same time for the storage. For the switch-based architecture the communication time is @((m+1) sup 2 /P t sub c + 2 tau )2N @. More accurate models of the Alliant FX series are presented by Jalby and Meier [Jalby-86], who also describes efficient dense matrix routines for that architecture. Sorensen [Sorensen-84] has devised effective matrix routines for the HEP [Smith-81], which is a switch-based architecture with local storage. .SH 4.5 Summary of the complexity analysis for the concurrent elimination of a single variable .PP The complexity estimates for the concurent elimination of a single variable on different architectures can be summarized as follows .KS .TS center; l|c c c. \f3partitioning rel. time for rel. time for rel. time for arithmetic comm. startups data transfer\f1 _ cyclic 1-d 1.0 @m/2 sqrt {P} @ @ sqrt {P} @ cyclic 2-d 1.0 @m/ sqrt {P} @ @1 @ consecutive 1-d 1.5 @ sqrt {P} /2 @ @ sqrt {P} @ consecutive 2-d 2 1 @ (3 - sqrt {P} /m)/2 @ global storage, bus 1.0 @ 2m/ sqrt {P} @ @ 2m sqrt {P} @ global storage, switch 1.0 @ 2 @ @ 2m/ sqrt {P} @ .TE .KE .PP The conclusion from the analysis so far is that for the distributed storage architectures the cyclic partitioning yields the lowest arithmetic complexity because of less idle time. The total number of arithmetic operations is independent of the partitioning. Cyclic two-dimensional partitioning has the smallest time for data transfer because of the highest degree of pipelining. However, cyclic partitioning suffers from a large number of communication startups and with significant startup times the consecutive partitioning may yield a lower total complexity. The algorithms have been described for a consistant use of the cyclic and the consecutive storage schemes. For a given set of architectural parameters the minimum time is likely to be achieved by a combination of the two strategies. .SH 5. Concurrent Elimination of Variables .PP For the concurrent elimination of different variables the algorithm in [johnsson-85e] partitions the system of equations into sets of consecutive equations (see Figure 8). The computations in different sets can be performed independently to a very large extent. The algorithm is a variant of substructured elimination or incomplete nested dissection. .PP The algorithm proceeds in four phases: .nf .in .5i Phase 1 - Factor each partitioned matrix. Phase 2 - Apply factors to pieces of the matrix to decouple the solution. Phase 3 - Form a reduced matrix and solve this matrix problem forming part of the solution. Phase 4 - Backsubstitute to determine the remaining parts of the solution. .in -.5i .fi Each of these phases is discussed in detail in the following subsections. .cs 1 24 .nf .ps 8 -----|---------- \e C | \e \ei | \e \e| \e \e A \e \e i \e \e . \e \e . A \e \e . i12\e \e ............. \e \e A : A : B \e \e i21 i22 i \e \e--------------- \e \e C | \e \e i+1 \e \e \e .fi .fi .ps 12 .cs 1 .ce Figure 8 .ce Partitioning of the Matrix .sp 2 .SH 5.1 Phase 1 .PP In Phase 1, decomposition of each section of the matrix (block @A sub i sub 11 @) is performed. This phase can be carried out on each section totally independently of the other sections; there is no communication between sections. At the end of this phase each section has the LU decomposition as part of the matrix. The amount of work for each section is @ m(m+1)(k - (2m+1)/3)@ addition and multiplications for symmetric positive definite matrices and @m sup 2 (2k-1) - (4m sup 2 -1)m/3 @ for an arbitrary banded matrix of size @k@ and half bandwidth @m@, where @k@ is the order of the matrix @A sub i sub 11 @. .SH 5.2 Phase 2 .PP In Phase 2, the factors generated in Phase 1 are applied to the matrix. This can be viewed as premultiplying each section of the matrix in Figure 8 by @diag ( {A sub i sub 11 } sup -1 , ~ I sub i sub 22 )@. The resulting matrix has fill-in from @G sub i ~<-~ A sub i sub 11 C sub i@ and @F sub i ~<-~ { A sub i sub 11 } sup -1 A sub i sub 12 @. The fill-in is diagrammed in Figure 9. .cs 1 24 .ps 8 .nf -----|---------- ----- |\e | \e | | | \e | \e | | | \e| \e | | | G |\e I \e| F | | i | \e i \e i | | | \e . \e | | | \e . \e| | | \e . |\e ----- \e ............. \e \e A : A : B \e \e i21 i22 i \e \e -------------- \e ----- \e G | \e | | | \e i+1 \e | | | \e | \e | | .fi .ps 12 .cs 1 .ce Figure 9 .ce Matrix After Phase 2 Again in this stage there is no communication between sections. The operation count for this phase is @2m(2k - m - 1)(2m+r)@ for @r@ right-hand sides. .SH 5.3 Phase 3 .PP Phase 3 has two parts. First, the system is decoupled and a reduced system is solved, forming part of the original solution. This part of Phase 3 involves zeroing submatrix @A sub i sub 21 @. Zeroing can be accomplished by simple block elimination with the block above @A sub i sub 21 @. This results in some fill-in below @G sub i @ of the form @G' sub i ~<-~ G' sub i ~-~ A sub i sub 21 G sub i @ and a modification of @ A sub i sub 22 @ such that @A sub i sub 22 ~<-~ A sub i sub 22 ~-~ A sub i sub 21 F sub i @. These operations are independent for each @i@ and can proceed in parallel for all sections. No communication is required. .PP The second part of Phase 3 involves zeroing @ B sub i @ using the block from below. This results in a fill-in above @ F sub i+1@ of the form @F' sub i+1 ~<-~ F' sub i+1 ~-~ B sub i F sub i+1@ and modification of @A sub i sub 22 @ such that @ A sub i sub 22 ~<-~ A sub i sub 22 ~-~ B sub i G sub i+1 @. Here interpartition communication must take place to perform the update of block @A sub i sub 22 @ and forming of blocks @ F sub i+1 @. .PP At this point the matrix has the form shown in Figure 10. .cs 1 24 .ps 8 .nf -----|---------- ----- |\e | \e | | | \e | \e | | | \e| \e | | | G |\e I \e| F | | i | \e i \e i | | | \e . \e | | | \e . \e| | | \e . | ----- \e ............. ----- | G' | : A : |F' | | i | : i22 | i+1 | ----- -------------- \e ----- \e G | \e | | | \e i+1 \e | | | \e | \e | | .fi .ps 12 .cs 1 .ce Figure 10 .ce Matrix after the first part of phase 3 .sp .PP The last set of @ m @ rows of each partition together form a system of @ mP @ equations in the last @ m @ variables of each partition. The matrix has been decoupled, and a reduced block tridiagonal matrix can be formed. This block tridiagonal matrix has the form .sp 10 such that when this block tridiagonal system is solved, the partial solution to the original matrix problem at these positions will be formed. .PP The amount of work for this portion of the algorithm depends on the size of the block tridiagonal system and the method used to solve the system. The blocks are of size @m times m @, and there will be @P@ block rows, where @P@ is the number of partitions made in the original matrix. If we fix the order of the original matrix and look at what happens as we increase the number of partitions, we see that more effort will go into solving the reduced system. .PP The communication complexity and the parallel arithmetic complexity for the solution of the block tridiagonal system depend on the architecture [johnsson-85e]. The number of block row communications in sequence depends on the method chosen and the architecture. It varies from @ alpha log P@, where @ alpha @ falls in the interval @1 - 8 @ for block cyclic reduction and suitable architectures such as binary tree, shuffle-exchange, and boolean cube interconnected processors. For shared-memory architectures it is necessary that the bus/switch/storage bandwidth is proportional to @2P@ to realize @2 log P@ communication start-ups. For two-way block Gaussian elimination the number of block row communications is @P@ for a linear array and @2P@ for a shared-memory architecture. The parallel arithmetic complexity of block cyclic reduction is @(26m sup 3 /3 + 8m sup 2 r)(logP-2) + 16m sup 3 /3 ~+~ lower~ order ~terms@ and for two-way Gaussian elimination @(8m sup 3 /3 + 3m sup 2 r)(P-1) ~+~ lower ~order ~terms@, @r@ is the number of right hand sides. The parallel arithmetic complexity of block cyclic reduction is always lower than that of two-way block Gaussian elimination. On most architectures the number of communication startups in sequence is also less for block cyclic reduction. .PP Detailed complexity estimates for both block Gaussian elimination and block cyclic reduction are given in [johnsson-85e] for a variety of architectures. .SH 5.4 Phase 4 .PP Given the solution to the block tridiagonal system, the complete solution to the original matrix can be found by a simple backsubstitution. Phase 4 can be done without communication of information from one section to another. This phase requires @4 m k r@ operations, where r is the number of right-hand sides. .SH 5.5 Complexity of concurrent elimination of different variables through partitioning .PP The parallel arithmetic complexity corresponding to the above partitioning strategy is given below: .sp .TS center; l l l l l l l l. Phase 1 @m(m+1)(N/P-m-(2m+1)/3) @ for a symmetric matrix, @m sup 2 (2(N/P-m)-1) - (4m sup 2 -1)m/3 @ otherwise. Phase 2 @m(2(N/P-m)-m-1)(3m+2r)@ Phase 3 @2m(m+1)(m+r) ~+~ @block trid. solve for a symmetric matrix @2m(m+1)(2m+r) ~+~ @block trid. solve for an unsymmetric matrix Phase 4 @4m(N/P-m)@ where @P@ is the number of partitions, @N@ is the order of the matrix, and @m@ is the half band width and @r@ is the number of right hand sides @P < N/m @ .TE .fi .PP For the above complexity estimates it is assumed that there is one processor per partition. The complexity of phases 1, 2 and 4 has terms inversely proportional to the number of partitions whereas the complexity of phase 3 has terms that increase as a function of @ P @. Hence, there exist a trade-off between time for the solution of the reduced system and the time required in the substructured elimination. The optimum number of partitions for for linear arrays and shared-memory systems with a bandwidth independent of @ P @ is of order @ sqrt{ N/m}@ whereas for boolean cubes, binary trees, shuffle-exchange networks and shared-memory systems with a bandwidth proportional to @ P @ the optimum number of partitions is of order @N/m@ [johnsson-85e]. The parallel arithmetic complexities can be reduced further by performing operations on blocks concurrently, i.e., by parallelizing dense matrix factorization, solve and multiplication along the lines described for banded systems in section 4. Such parallelization leads to increased communication complexity. .SH 5.6 Algorithm properties .PP The partitioning strategy yields algorithms that have two significant properties: .sp (1) If the original matrix is symmetric, then the reduced block tridiagonal matrix is symmetric [johnsson-85e]. This is not the case with the partitioning employed by Sameh et al. []. (Also if the original is diagonally dominant, so is the reduced system.) .sp (2) If the matrix is symmetric, then the condition number of the partitioned matrices, (@ A sub i sub 11 @), is never worse than that of the original matrix. This follows from the eigenvalue interlacing properties of symmetric matrices. .PP The reduced system can be solved using block cyclic reduction. The advantage is that this phase can easily be parallelized. .PP The partitioning algorithm described here uses an elimination order that can be obtained through incomplete nested dissection [George-73] [George-81]. A banded matrix with half bandwidth @m@ corresponds to a graph in which each node is connected to all its preceding @m@ nodes as well as to all its succeeding @m@ nodes, except for the first and last sets of @m@ nodes. Separators are of size @m@. The block matrices @A sub i sub 22@ correspond to the edges between the nodes of a separator and the adjacent left and top triangular blocks to edges to one set of @n/P-m@ nodes and the adjacent right and bottom triangular matrix to a disjoint set of @n/P-m@ nodes. Choosing separators as bisectors recursively yields a complete binary tree as an elimination tree, (Figure 11). .cs 1 24 .ps 8 .nf _____ | m | |___| ^ / \e / \e / \e / \e / \e __/__ __\e__ | m | | m | |___| |___| ^ ^ / \e / \e / \e / \e / \e / \e __/__ __\e__ __/__ __\e__ | m | | m | | m | | m | |___| |___| |___| |___| / \e / \e / \e / \e / \e / \e / \e / \e / \e / \e / \e / \e __/__ __\e__ __/__ __\e__ __/__ __\e__ __/__ __\e__ | | | | | | | | | | | | | | | | | k | | k | | k | | k | | k | | k | | k | | k | |___| |___| |___| |___| |___| |___| |___| |___| .fi .ps 12 .cs 1 .ce Figure 11. Elimination tree generated by recursive bisection .ce (incomplete nested dissection) .PP The elimination order of the algorithm corresponds to the elimination of the leaf variables prior to the elimination of any variables of the internal nodes of the elimination tree. No particular order is implied for the elimination of variables in different leaf nodes, but the variables within a leaf node of the elimination tree are elimination in order of increasing (or decreasing) labels. The elimination of the variables corresponding to the internal nodes from the leafs towards the root corresponds to the elimination order of block cyclic reduction, whereas an elimination in inorder yields block Gaussian elimination [Johnsson-85e]. .SH 6. Comparison of the complexity of the different startegies for concurrent elimination .PP The number of communication startups for the partitioning method described in section 5 is potentially a factor of @ alpha m log P /(2N sqrt {P} )@ less than that of the consecutive two-dimensional partitioning, which has the fewest startups of the algorithms exploiting the independence of operations for the elimination of a single variable. The data transfer may be a factor of @ alpha mlogP /(N sqrt { P } )@ times that of the consecutive two-dimensional partitioning. To fully realize this potential it is necessary that the communication bandwidth of the architecture is proportional to @P@, which is most easily acccomplished in a network architecture. .PP We conclude that with respect to the communication complexity, algorithms exploiting the independence of operations for eliminating of different variables offers substantial reductions compared to algorithms exploiting the independence of operations for eliminating of single variables for most banded matrices. .PP The comparison with respect to the arithmetic complexity is not so favorable. The substructured elimination cause substantial fill-in. Indeed, the arithmetic complexity of Phases 1 and 2 alone is approximately @ 4 @ times that of the sequential algorithm. for the general case (for @N/P-m@ equations), and @ 7 @ times for the symmetric case. Hence, the efficiency (speedup / (number of processors)) is limited, and the number of processing elements required to achieve any speedup is expected to be at least @ 5 @ to @ 10 @ if communication complexity and the time for the reduced system solve are included. A very modest speedup can be expected on current global storage with bus architectures because of the number of processors available. The situation for distributed storage architectures is more favorable in that they offer a larger number of processing elements. .PP Both kinds of concurrency can clearly be exploited simultaneously. The arithmetic complexity is reduced and the communication complexity increased. Exploiting both forms of concurrency may represent a problem with respect to available compilers on some machines unless some other device is used to gain access to the parallel features, [dongarra and sorensen] (such as the Alliant). .SH 7. Experiments .PP The partitioning algorithm exploiting the independence of operations during the elimination of different variables has been implemented on two shared memory systems, the Sequent Balance 21000 and an Alliant FX/8, and for the case @m=1@ (tridiagonal systems) also on the Intel iPSC/d7 and the Connection Machine [Hillis-85]. The Sequent has 10 processors and the Alliant 8 processors. The Alliant also has vector capability on each of the processors with a vector register length of 32. .SH 7.1 Concurrent elimination of variables .SH 7.1.1 Phase 1. Factorization .PP The factorization is implemented using LINPACK routines. The peak performance of the band algorithm from LINPACK on the Alliant FX/8 is about 3 Mflops [], which is also achieved for some phases of the algorithm. The vector features and the cache of the architecture have interesting effects on the performance as indicated by the following data for one processor. .ce 2 Alliant FX/8, one processor used Factorization, time in sec m P=1 P=2 P=4 P=8 P=16 2 N=1000 0.18 0.18 0.20 0.18 0.18 N=2000 0.40 0.38 0.38 0.40 0.40 N=4000 0.77 0.77 0.77 0.78 0.80 N=8000 1.53 1.55 1.55 1.55 1.60 8 N=1000 0.42 0.42 0.40 0.40 0.37 N=2000 0.83 0.83 0.85 0.83 0.83 N=4000 1.67 1.68 1.70 1.70 1.75 N=8000 3.37 3.40 3.43 3.47 3.63 16 N=1000 0.80 0.80 0.77 0.68 0.53 N=2000 1.62 1.58 1.57 1.53 1.45 N=4000 3.20 3.18 3.22 3.23 3.30 N=8000 6.47 6.47 6.50 6.63 6.97 32 N=1000 1.72 1.63 1.40 0.98 0.33 N=2000 3.47 3.42 3.17 2.70 1.90 N=4000 7.05 6.85 6.77 6.30 5.38 N=8000 14.55 14.68 14.65 14.92 15.55 .PP For a small bandwidth the effect on the execution time due to an increased number of partitions is small. The total number of arithmetic operations is approximatley @2 m sup 2 (N-(P-1)m)@. The overhead in managing loop iterations seems to dominate. But, for a half bandwidth of @ 32@, the number of arithmetic operations is reduced significantly. The reduced number of arithmetic operations is part of the explanation for the reduced execution time for increasing @P@ even if only one processor is used for the factorization. However, the reduction is much larger than predicted from the number of arithemtic operations alone as seen from the table below. The running time is normalized to the time for @P=1@ for each @N@ and @ m @. The measured relative time is given first, the predicted @ (1- (P-1)m/N@ second. .ce 3 Alliant FX/8, one processor used Factorization Relative execution time; measured:predicted m P=2 P=4 P=8 P=16 2 N=1000 (1 :1 ) (1.11:1 ) (1 :0.98) (1 :0.97) N=2000 (0.95:1 ) (0.95:1 ) (1 :0.99) (1 :0.98) N=4000 (1 :1 ) (1 :1 ) (1.01:1 ) (1.04:0.99) N=8000 (1.01:1 ) (1.01:1 ) (1.01:1 ) (1.05:1 ) 8 N=1000 (1 :0.99) (0.95:0.98) (0.95:0.94) (0.88:0.88) N=2000 (1 :1 ) (1.02:0.99) (1 :0.97) (1 :0.94) N=4000 (1.01:1 ) (1.02:0.99) (1.02:0.99) (1.05:0.97) N=8000 (1.01:1 ) (1.02:1 ) (1.03:0.99) (1.08:0.99) 16 N=1000 (1 :0.98) (0.96:0.95) (0.85:0.89) (0.66:0.76) N=2000 (0.98:0.99) (0.97:0.98) (0.94:0.94) (0.90:0.88) N=4000 (0.99:1 ) (1.01:0.99) (1.01:0.97) (1.03:0.94) N=8000 (1 :1 ) (1 :0.99) (1.02:0.99) (1.08:0.97) 32 N=1000 (0.95:0.97) (0.81:0.90) (0.57:0.78) (0.19:0.52) N=2000 (0.99:0.98) (0.91:0.95) (0.78:0.89) (0.55:0.76) N=4000 (0.97:0.98) (0.96:0.98) (0.89:0.94) (0.76:0.88) N=8000 (1.01:1 ) (1.01:0.99) (1.03:0.97) (1.07:0.94) .PP The effect on performance of an increased value of @P@ is most dramatic for @m=32@, but quite significant even for @m=16@. For large values of @N@ (@N=8000 @ in our experiments) the preformance degrades with an increased number of partitions, but for fewer than @N=4000@ equations the improvement is markedly better than predicted from arithmetic work alone. Figure 12 illustrates this point. Figure 12. The relative time for factorization on 1 processor as a function of @P@. .PP The execution rate, number of floating point operations per second is increasing with decreasing @N@ for all values of @P@ and @m@. The largest dependence on @N@ is measured for @m=32@ and @P=16@. On the Alliant FX/8 using only Fortran the maximum observed floating-point rate was @4 Mflops@ for 1 processor and the lowest was @40 kflops@ a difference by a factor of @90@. For 8 processors the maximum floating-point speed for factorization using LINPACK varied from 193 kflops (@m=2@) to 16.06 Mflops (@N=2000, m=32@). The performance can be improved considerably by careful data management and assembly coding [Jalby-86]. Alliant FX/8 Factorization, 1 Processor Floating-point operations per second (kflops/sec) m P=1 P=2 P=4 P=8 P=16 2 N=1000 44 44 40 44 43 N=2000 40 42 42 40 39 N=4000 42 42 42 41 40 N=8000 42 41 41 41 40 8 N=1000 303 301 313 305 314 N=2000 308 307 298 301 294 N=4000 306 304 299 298 286 N=8000 304 301 298 293 279 16 N=1000 633 626 635 686 794 N=2000 629 641 638 639 643 N=4000 638 640 629 620 593 N=8000 632 631 627 611 575 32 N=1000 1164 1202 1335 1725 4039 N=2000 1168 1172 1236 1731 1779 N=4000 1156 1183 1184 1244 1390 N=8000 1123 1110 1106 1074 1008 .PP The rate of floating-point operations is shown in Figure 13. Figure 13. The achieved floating-point speed on 1 proccesor during factorization. .PP The minimum execution time for factorization on 2, 4 and 8 procesors are given in the next table. The numbers within parenthesis gives the number of partitions for which the minimum time was achieved. Alliant FX/8 Execution time for factorization sec m 2 proc. 4 proc. 8 proc. 2 N=1000 0.08 (4) 0.05 (16) 0.033 (16) N=2000 0.18 (4) 0.13 (16) 0.083 (16) N=4000 0.38 (2,4,8,16) 0.25 (16) 0.17 (16) N=8000 0.77 (2,4) 0.52 (16) 0.33 (16) 8 N=1000 0.20 (8,16) 0.10 (4,8,16) 0.067 (8,16) N=2000 0.42 (4,8) 0.22 (4,8) 0.12 (8,16) N=4000 0.85 (2,4,8) 0.45 (4,8) 0.25 (8) N=8000 1.70 (2) 0.92 (4,8) 0.52 (8) 16 N=1000 0.26 (16) 0.13 (16) 0.083 (16) N=2000 0.75 (16) 0.38 (16) 0.25 (16) N=4000 1.62 (2) 0.88 (4,8) 0.57 (8,16) N=8000 3.28 (2) 1.80 (4) 1.18 (8) 32 N=1000 0.18 (16) 0.10 (16) 0.083 (16) N=2000 1.00 (16) 0.57 (16) 0.38 (16) N=4000 2.80 (16) 1.65 (16) 1.18 (16) N=8000 7.80 (2) 4.72 (16) 3.47 (8) .PP The speedup is summarised in the following table. Though the execution times are repeatable the total execution times for @N@ and @m@ small is the cause of some inprecision in the speedup figures. The efficiency for @P=4@ is in the range @75% - 90%@. For @P=8@ the efficiency is rather unsatisfactory, @55% - 85%@. Alliant FX/8 Speedup m 2 proc. 4 proc. 8 proc. 2 N=1000 2.17 3.40 5.46 N=2000 2.11 2.92 4.58 N=4000 2.03 3.08 4.53 N=8000 1.98 2.94 4.64 8 N=1000 2.10 4.20 6.27 N=2000 1.98 3.77 6.92 N=4000 1.96 3.71 6.68 N=8000 1.98 3.66 6.48 16 N=1000 2.04 4.08 6.39 N=2000 1.93 3.82 5.80 N=4000 1.96 3.61 5.58 N=8000 1.97 3.59 5.48 32 N=1000 1.83 3.30 3.98 N=2000 1.90 3.33 5.00 N=4000 1.92 3.26 4.56 N=8000 1.87 3.08 4.19 .PP ??????? .PP For the Sequent Balance 21000 that does not have vector features the measured performance increase for phase 1 as a function of the half bandwidth was a factor of 6 for @m=2 @ to @ m=32 @. The actual floating-point performance increased from about @5 ~kflops@ to about @30 ~kflops@ per processor. The execution time was proportional to the size of a partition. .PP The speedup as a function of the number of partitions for phase 1 was proportional to @P@ for all values of @m@ and @N@ for the Alliant, except for @N=8k@ and @m=2@ which produced anomolous data, probably due to cache conflicts. The speedup for the Sequent Balance 21000 was superlinear for @m=32@. The speedup increased with decreasing system size. The speedup for @m=2@ was slightly sublinear for @N=2k@ and strongly sublinear for smaller systems. We cannot account for this phenomenon. .SH 7.1.2 Phase 2. Local Solution. .PP The essential computation in phase 2 is a solve operation in each partition. Fill-in is generated and the total amount of work increases with @P@ approximately as @6m sup 2 N(1-1/P)@. The tables below shows the measured running times for phase 2 with relative running times normalized to @P=2@ given within parenthesis. Alliant FX/8 Solve, 1 Processor, time in sec m P=2(1) P=4(1.5) P=8(1.75) P=16(1.875) 2 N=1000 0.13 (1) 0.18 (1.38) 0.18 (1.38) 0.20 (1.54) N=2000 0.27 (1) 0.37 (1.37) 0.40 (1.48) 0.42 (1.56) N=4000 0.55 (1) 0.72 (1.31) 0.80 (1.45) 0.83 (1.51) N=8000 1.10 (1) 1.47 (1.34) 1.62 (1.47) 1.67 (1.52) 8 N=1000 0.50 (1) 0.68 (1.36) 0.73 (1.46) 0.73 (1.46) N=2000 1.05 (1) 1.43 (1.36) 1.57 (1.50) 1.58 (1.50) N=4000 2.12 (1) 2.98 (1.41) 3.28 (1.55) 3.33 (1.57) N=8000 4.33 (1) 6.10 (1.41) 6.87 (1.59) 7.02 (1.62) 16 N=1000 1.23 (1) 1.62 (1.32) 1.63 (1.33) 1.37 (1.11) N=2000 2.55 (1) 3.58 (1.40) 3.70 (1.45) 3.47 (1.36) N=4000 5.20 (1) 7.50 (1.44) 8.22 (1.58) 7.88 (1.52) N=8000 10.57 (1) 15.45 (1.46) 17.38 (1.64) 17.58 (1.66) 32 N=1000 3.62 (1) 4.40 (1.22) 3.67 (1.01) 1.78 (0.49) N=2000 7.55 (1) 10.57 (1.40) 10.07 (1.33) 7.55 (1.00) N=4000 15.37 (1) 22.20 (1.44) 24.37 (1.59) 21.28 (1.38) N=8000 31.03 (1) 46.15 (1.49) 52.65 (1.70) 52.07 (1.68) .PP As seen from the table the growth rate predicted from the arithmetic complexity alone is fairly accurate, except @N < 8000@ and @M>8@. The approximate rate of floating-point operations varies as shown in the next table. Alliant FX/8 Solve, 1 Processor Floating-point operations per second (kflops/sec) m P=2 P=4 P=8 P=16 2 N=1000 122 131 151 142 N=2000 118 129 138 139 N=4000 116 133 139 143 N=8000 116 130 138 143 8 N=1000 406 436 449 427 N=2000 391 425 441 444 N=4000 390 416 433 445 N=8000 383 407 419 433 16 N=1000 619 668 691 666 N=2000 612 638 683 695 N=4000 608 624 648 687 N=8000 602 614 629 649 32 N=1000 782 862 915 740 N=2000 790 804 879 953 N=4000 796 806 813 891 N=8000 799 796 793 816 .PP The floating-point rate is generally increasing with @m@ and @P@, and decreasing as a function of @N@. For @m=2@ the rate is about @3@ times higher than the rate for factorization, approximately @ 40% @ higher for @m=8 @, about the same for @m=16@, and only @65%@ of the rate for factorization for @m=32@. The rate varies by a factor of @ 8 @ (compared to a variation by a factor @ 90 @ for the factorization). The floating-point rate is shown in Figure 14. Figure 14. The floating-point speed of for forward and backward solve on one processor. .PP The speed-up for the solve routine if the number of partitions matches the number of processing elements is expected to be 1.5, 3, 7 and 15 respectively for @P=2@, @P=4@,@P=8@, and @P=16@ compared to the one processor case (with the corresponding value of @P@). Note that the total work increases with @P@. The speed-up with @P@ being equal to the number of partitions is @P/(3m)@ compared to the sequential algorithm, which amounts to a slow-down for the cases reported here. The figures in parenthesis in the table for the execution time for the different number of processors give the number of partitions for which the minimum execution time was achieved. The fact that @P=16@ seemingly always is the best of the considered values is in part due to the fact that the compiler failed to properly parallelize one loop. Alliant FX/8 Execution time for solve, sec m 2 proc. 4 proc. 8 proc. 2 N=1000 0.10 (16) 0.067(16) 0.033(16) N=2000 0.23 (8,16) 0.15 (16) 0.10 (16) N=4000 0.47 (16) 0.30 (16) 0.20 (16) N=8000 0.93 (16) 0.62 (16) 0.40 (16) 8 N=1000 0.40 (16) 0.20 (16) 0.12 (16) N=2000 0.88 (16) 0.47 (16) 0.30 (8,16) N=4000 1.87 (16) 1.02 (16) 0.62 (8,16) N=8000 3.97 (16) 2.13 (16) 1.27 (8,16) 16 N=1000 0.78 (16) 0.42 (16) 0.25 (16) N=2000 1.92 (16) 1.07 (16) 0.70 (16) N=4000 4.47 (16) 2.47 (16) 1.58 (16) N=8000 9.73 (16) 5.27 (16) 3.37 (16) 32 N=1000 0.98 (16) 0.53 (16) 0.32 (16) N=2000 4.27 (16) 2.45 (16) 1.65 (16) N=4000 12.25 (16) 6.87 (16) 4.58 (16) N=8000 28.35 (16) 15.52 (16) 10.45 (16) .PP The speedup is summarised in the following table. Alliant FX/8 Speedup of solve m 2 proc. 4 proc. 8 proc. 2 N=1000 1.30 1.94 3.93 N=2000 1.17 1.80 2.70 N=4000 1.17 1.83 2.75 N=8000 1.18 1.77 2.75 8 N=1000 1.25 2.50 4.17 N=2000 1.19 2.23 3.50 N=4000 1.13 2.08 3.42 N=8000 1.09 2.03 3.41 16 N=1000 1.58 2.93 4.92 N=2000 1.33 2.38 3.64 N=4000 1.16 2.11 3.29 N=8000 1.09 2.01 3.14 32 N=1000 1.82 3.36 5.56 N=2000 1.77 3.08 4.58 N=4000 1.25 2.24 3.36 N=8000 1.09 2.00 2.97 .PP The speedup for 2 processors is @80% - 100%@ of the theoretical value, for 4 processor it is @60% - 100%@, and for 8 processors @40% - 80%@. The ability of the compiler and the architecture to take a advantage of the available concurrency is rather unsatisfactory. .PP Part of the reason for the descrepancy between the expected speed-up and the measured speed-up for the solve phase is due to the chracteristics of the compiler for the Alliant FX/8. It parallelizes one loop level (outermost unless instructed otherwise). It fails to parallelize the solve code for two partitions, since the code handles the fill-in corresponding to a distinct set of solution variables in each iteration. Writing the code such that each loop iteration only contains the operation for one partition forces the introduction either of conditionals to handle the first and last partition which are special cases, or sequential code outside the loop for these partitions. In either case the compiler either fails completely to parallelize, or has to be "tricked" into parallelizing the code properly. For sufficiently large values of @P@ this problem should be of minor influence. .PP The efficiency of the partitioning method is critically dependent on the time for the solve routine compared to the factorization routine for small values of @P@. Most of the time spent in the solve routine is due to fill-in, i.e., the price paid for the partitioning strategy (or the nested dissection elimination order). For all partitions but the first and last the ratio of the arithmetic complexity for the solve to that of factorization is approximately @ 3 @. For one processor the ratio is @3(1-1/P)@. Alliant FX/8 Solve/Factorization time, 1 processor m P=2 (1.5) P=4 (2.25) P=8 (2.625) 2 N=1000 1.25 1.34 1.00 N=2000 1.28 1.15 1.20 N=4000 1.24 1.20 1.18 N=8000 1.21 1.19 1.21 8 N=1000 2.00 2.00 1.79 N=2000 2.10 2.14 2.50 N=4000 2.20 2.27 2.48 N=8000 2.34 2.32 2.44 16 N=1000 3.00 3.23 3.01 N=2000 2.56 2.82 2.80 N=4000 2.76 2.81 2.77 N=8000 2.97 2.93 2.86 32 N=1000 5.44 5.30 3.86 N=2000 4.27 4.30 4.34 N=4000 4.38 4.16 3.88 N=8000 3.63 3.29 3.01 .PP As can be seen from the table the relative solve time is lower than expected for @m = 2@, higher than expected for @m=8@ and @P=2@, and slightly lower than expected for @m=8, P>2@. For @m>8@ the relative solve time is higher than expected, and significantly so for @m=32@. Consequently, the LINPACK based parallel algorithm performing concurrent elimination of variables is less time consuming than predicted for small bandwidths. .PP and for the Sequent Balance 21000 Sequent Balance 21000 phase 2/ phase 1 m N=2048 N=1024 N=512 N=256 2 2.78 2.89 2.5 2.9 8 4.13 4.3 4.0 4.1 16 4.46 4.4 4.4 4.7 32 4.61 4.6 4.6 - With the exception of the @N = 8k@ case Phase 2 on the Alliant requires less time than predicted, if overhead is ignored. Note that the ratio is relatively independent of @N@, as expected, and approximately also independent of @m@. It is also intersting to note that the ratio is approximately the same for @m=2@ and @m=32@, but is higher for intermediate values (by 30 - 50 %). For the Sequent Balance 21000 the ratio is generally higher than expected, and increases as a function of @m@, with a relatively small increase for @m > 8@. .SH 7.1.3 Phase 3. Solution of the reduced block trdiagonal system .PP For phase 3 and block Gaussian elimination the expected solution time is proportional to @P@ and @m sup 3@. The measured performance is approximately linear in @P@, but doubling the bandwidth only quadrupled the solution time on the Alliant. Being inherently sequential, with the exception for the possibility to perform two-way elimination, the execution time should be independent of the number of processing elements, since no parallelization of block matrix (LINPACK) operations is performed. The measurements concur with the predicted behavior. Alliant FX/8 Time for Block Gaussian elimination m P=2 P=4 P=8 P=16 2 0 0 0 0 8 0 0.03 0.06 0.13 16 0.01 0.08 0.22 0.50 32 0.03 0.32 0.90 2.00 .PP The block cyclic reduction solver requires the same number of arithmetic operations for @P=2@ and @P=4@. The total number of arithmetic opertions for @P=8@ is higher. The block cyclic reduction solver requires approximately @1.26@ times as many operations as the block Gaussian elimination solver, but the parallel arithmetic complexity is lower. However, the Alliant compiler did not manage to parallelize our cyclic reduction code. The running time for @P=4@ was approximately @1.15@ times that of the Gaussian elimination code, and for @P=8@ it was approximately @1.52@ for 1 processor, @1.42@ for 2 processors, and @1.40@ for 4 or more processors. .PP For the Sequent Balance 21000 the dependence is approximately linear in @P@. The dependence on @m@ is approximately cubic. Sequent Balance 21000 Time for Block Gaussian elimination m P=4 P=8 2 0.005 0.075 8 0.583 1.25 16 3.60 7.65 32 25.25 53.63 .SH 7.1.4. Phase 4. Back susbtitution .PP The time for phase 4 is small relative to the time for the other phases and offers no particular insight. .SH 7.1.5 Total time for the partitioning method .PP The total time for the solution of the different banded systems using block Gaussian elimination for the reduced system is given in the table below. .ce 3 Alliant FX/8 P= # of processors, time in sec m P=1 P=2 P=4 P=8 2 N=1000 0.27 0.25 0.18 0.12 N=2000 0.52 0.48 0.32 0.23 N=4000 1.00 0.92 0.60 0.42 N=8000 1.98 1.83 1.20 0.80 8 N=1000 0.50 0.75 0.42 0.30 N=2000 1.00 1.47 0.82 0.52 N=4000 2.00 2.97 1.62 0.98 N=8000 4.02 5.98 3.22 1.93 16 N=1000 0.93 1.68 0.98 0.75 N=2000 1.87 3.45 1.92 1.35 N=4000 3.67 6.98 3.81 2.55 N=8000 7.38 13.98 7.55 4.97 32 N=1000 1.92 4.25 2.67 2.12 N=2000 3.87 8.53 5.35 3.97 N=4000 7.82 18.43 10.73 7.63 N=8000 16.08 39.55 22.03 15.55 .PP Our implementation of the partitioning strategy on the Alliant FX/8 yields a speedup for for @P>1, m=2 @, for @P>_ 4, m=8 @, and for @P>= 8, m=16@. For half bandwidth @ 32 @ there is a slight speedup for @N>= 4000 @, and a slight slow-down for smaller systems. The total time for the partitioning method on the Alliant is shown as a function of @P@ with @N@ as a parameter in Figures 15 and 16 for @m= 2@ and @m=32@ repsectively. Figure 15. Total solution time on the Alliant FX/8 for @m=2@. Figure 16. Total solution time on the Alliant FX/8 for @m=32@. .SH 7.2 One-dimensional partitioning .PP As described in section 4 an alternative to the concurrent elimination of different variables a one-dimensional partitioning can be used for the concurrent elimination of a single variable. The following table of solution times gives the result from performing the factorization in parallel by performing column operations in parallel. .ce 3 Alliant FX/8 One-dimensional partitioning P= # of processors, time in sec m P=1 P=2 P=4 P=8 2 N=1000 0.28 0.25 0.23 0.23 N=2000 0.55 0.48 0.45 0.43 N=4000 1.07 0.97 0.85 0.87 N=8000 2.12 1.87 1.70 1.70 8 N=1000 0.58 0.40 0.31 0.32 N=2000 1.13 0.80 0.63 0.60 N=4000 2.23 1.55 1.25 1.15 N=8000 4.42 3.12 2.48 2.30 16 N=1000 1.08 0.67 0.48 0.42 N=2000 2.08 1.30 0.93 0.80 N=4000 4.17 2.62 1.88 1.58 N=8000 8.37 5.22 3.73 3.13 32 N=1000 2.07 1.18 0.78 0.62 N=2000 4.10 2.35 1.55 1.22 N=4000 8.75 5.00 3.23 2.48 N=8000 19.77 11.45 7.80 5.97 .PP For a small bandwidth the parallelism for the column partitioning is very limited, and as seen for the case @m=2@ the execution time doe snot decrease beyond the time for 2 processors. For @m=8@ the time decreases for up to 4 processors. The speedup is by a factor of up to about 1.8. For a half bandwidth of @m=32@ the speedup is up to 3.4. .PP Comparing the execution times for our implementations of the partitioning method as described in section 5, with the one-dimensional partitioning we conclude that for a small bandwidth the concurrent elimination of multiple variables yields a lower execution time for any number of processors, whereas for a half bandwidth of @ 8 @ it has a lower execution time first for @8@ processors, the maximum on the Alliant FX/8. For a larger bandwidth the one-dimensional partitioning always gave a lower execution time. Figures 17 and 18 shows the corresponding data for the Sequent Balance 21000. Figure 17. Total solution time on the Sequent for @N=256@ as a function of @p@ and @m@.. Figure 18. Total solution time on the Sequent for @N=256@ asa function of @p@ and @m@.. .PP The results of the experiments are listed below. .ce Sequent Balance 21000 .ce @n ~=~ 2000@, @m~=~ 32@ .TS center; c|c c c c l|n n n n. 1 2 4 8 _ Factor 113 55 28 13 Solve 535 269 129 59 Reduce .5 4 11 25 Backsolve 3 5.3 2.8 2.4 Total 658 343 180 112 .TE .sp 4 .ce Sequent Balance 21000 .ce @n ~=~ 2000@, @m~=~ 16@ .TS center; c|c c c c l|n n n n. 1 2 4 8 _ Factor 30 15 8.1 4.7 Solve 147 75 37 19 Reduce .06 .6 1.6 3.7 Backsolve 1.6 2.6 1.2 .6 Total 181 97 51 32 .TE .sp 4 .ce Sequent Balance 21000 .ce @n ~=~ 2000@, @m~=~ 8@ .TS center; c|c c c c l|n n n n. 1 2 4 8 _ Factor 10 5 2.7 1.3 Solve 42 22 1.3 8.5 Reduce .02 .1 .27 .6 Backsolve .8 1.4 .6 .3 Total 54 30 18 13 .TE .SH 8. Conclusions .PP We have described and analysed algorithms for the concurrent solution of banded systems of equations. A concurrent algorithm based on a partitioning strategy equivalent to elimination by nested dissection has been implemented on two shared-memory machines with limited parallelism. The partitioning method offers a low communication complexity compared to concurrent elimination of single variables, and has numerical advantages to some previously implemented partitioning methods. The arithmetic complexity of the parallel algorithm is higher than that of standard direct solvers for band matrix problems. At least 5 processors are needed for the parallel arithmetic complexity to be lower than that of standard Gaussian elimination. Because the computation of the fill-in is relatively more efficient than the factorization for small bandwidths the measured running time shows that for our LINPACK based implementation of the partitioning algorithm on the Alliant FX/8 it has a lower execution time than the sequential algorithm (using LINPACK) for any number of processors. However, for bandwidths approaching the length of the vector registers the factorization routine uses the architecture more efficiently than the solve routine, and the computation of the fill-in is relatgively more expensive than predicted, moving the break-even point for the partitioning method to a higher number of processors than predicted. Indeed, for @m=32@ the break-even point occurs at 8 processors in our implementation. Part of the reason for the higher than predicted break-even point is the fact that the compiler failed to parallelize some computations that can be done in parallel. If several loop levels could be parallelized, then a higher processor utilization and lower execution time is likely. (The current Alliant compiler only allows parallelization of one loop level). The one-dimensional partitioning yields a speed-up that is insignificant for a halfbandwidth of @m=2@, but is 3.4 for 8 processors and a halfbandwidth of @m=32@. Consequently, the concurrent elimination of multiple variables yields a lower execution time for any number of processors for a half bandwidth of 2, for 8 processors at a half bandwidth of 8. Since the Alliant is limitedto eight processors we were unable to establish a break-even point for larger bandwidths. .SH 9. Acknowledgment .PP We would like to thank Tom Hewitt from CRAY Research for his assistance with the CRAY data. .SH 10. References .IP[] Lawrie and Sameh, .I The computation and communication complexity of parallel banded system solvers, .R ACM TOMS .sp .IP [] Dongarra and Sameh, .I On Some Parallel banded system solvers, .R Parallel Computing, vol 1 no 3, Dev 1984, pp 223-235. .sp .IP [] U. Meier, .I A Parallel Partition Method for Solving Banded Systems of Linear Equations, .R Parallel Computing, 2 (1985), pp 33-45. .sp .IP [] C. Ashcroft, .I Parallel Reduction Methods for the Solution of Banded Systems of Equations, .R ?? .sp .IP [] L. Johnsson, .I Solving Narrow Banded Systems on Ensemble Architectures, .R ?? .sp .IP [] J. Dongarra, J. Bunch, C. Moler, and G. Stewart, .I LINPACK Users' Guide, .R SIAM Pub, 1976. .sp .IP [] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, .I BLAS, .R ACM TOMS .sp .IP [] J. Dongarra, F. Gustavson, and A. Karp, SIAM Review .sp .IP [] J. Dongarra and D. Sorensen, .I On Environment for Implementing Explicite Parallel Processing in Fortran, .R ANL MCS - TM 79, 1986. .sp .IP [] J. Dongarra and S. Eisenstat, .I Squeezing the most .R ACM TOMS .sp .IP [] J. WIlkinson, private communication, 1976. .sp .IP [] D. Evans and M. Hatzopoulos, .I The solution of certain banded systems of linear equations using the folding algorithm, .R Computer Journal, 19 pp 184-187, 1976. .sp .IP [] Bhatt S.N, Ipsen I.C.F., .I How to Embed Trees in Hypercubes, .R Yale University, Dept. of Computer Science, December,1985,YALEU/CSD/RR-443. .sp .IP [] Crowther W., Goodhue J., Starr E., Thomas R., Milliken W., Blackadar T., .I Performance Measurements on a 128-node Butterfly Parallel Processor, .R Proceedings of the 1985 International Conference on Parallel Processing, IEEE Computer Society, pp 531-540, 1985. .sp .IP [] Desphande S.R., Jenevin R.M., .I Scaleability of a Binary Tree on a Hypercube, .R University of Texas at Austin, (TR-86-01), January, 1986. .sp .IP [] George A., Liu J. W., .I Computer Solution of Large Sparse Positive Definite Systems, .R Prentice-Hall, Englewood Cliffs, NJ, 1981. .sp .IP [] George A., .I Nested Dissection of a Regular Finite Element Mesh, .R SIAM J. on Numer. Anal., vol 10, pp 345-363, 1973. .sp .IP [] Gottlieb A., Grishman R., Kruskal C.P., McAuliffe K.P., Rudolph L., Snir M., .I The NYU Ultracomputer - Designing an MIMD Shared Memory Parallel Computer, .R IEEE Trans. Computers, pp 175-189, vol C-32, 1983. .sp .IP [] Ho C.-T., Johnsson S.L., .I Tree Embeddings and Optimal Routing in Hypercubes, .R Yale University, Dept. of Computer Science, In preparation, 1986. .sp .IP [] Jalby W., Meier U., .I Optimizing Matrix Operations on a Parallel Multiprocessor with a Memory Hierarchy, .R Univ. of Illinois, Center for Supercomputer Research and Development, 1986. .sp .IP [] Johnsson S.L., .I Solving Tridiagonal Systems on Ensemble Architectures, .R SIAM J. Sci. Stat. Comp., 1986, Also available as Report YALEU/CSD/RR-436, November 1985, .sp .IP [] Johnsson S.L., .I Solving Narrow Banded Systems on Ensemble Architectures, .R ACM TOMS, 11, November, 1985, (Also available as Report YALEU/CSD/RR-418, November 1984). .sp .IP [] Johnsson S.L., .I Band Matrix Systems Solvers on Ensemble Architectures, .R University of Texas Press, Algorithms, Architecture, and the Future of Scientific Computation, Technical report YALEU/CSD/RR-388, Yale University, Dept. of Computer Science, 1985. .sp .IP [] Johnsson S.L., .I Data Permutations and Basic Linear Algebra Computations on Ensemble Architectures, .R Yale University, Dept. of Computer Science, YALEU/CSD/RR-367, February, 1985. .sp .IP [] Johnsson S.L., .I Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures, .R Dept. of Computer Science, Yale University, YALEU/CSD/RR-361, January, 1985. .sp .IP [] Johnsson S.L., .I Fast Banded Systems Solvers for Ensemble Architectures, .R Department of Computer Science, Yale University, YALEU/CSD/RR-379, March, 1985. .sp .IP [] Johnsson S.L., .I Dense Matrix Operations on a Torus and a Boolean Cube, .R The National Computer Conference, AFIPS, July, 1985. .sp .IP [] Johnsson S.L., .I Odd-Even Cyclic Reduction on Ensemble Architectures and the Solution Tridiagonal Systems of Equations, .R Department of Computer Science, Yale University, YALE/CSD/RR-339, October, 1984. .sp .IP [] McBryan O.A., Van de Velde E.F., .I Hypercube Algorithms and Implementations, .R Courant Institute of Mathematical Sciences, New York University, , November, 1985. .sp .IP [] Reingold E.M., Nievergelt J., Deo N., .I Combinatorial Algorithms, .R Prentice Hall, 1977. .sp .IP [] Read R., Rose D.J., .I A Graph-Theoretic Study of the Numerical Solution of Sparse Positive Definite Systems of Linear Equations, .R Graph Theory and Computations, Academic Press, pp 183-217, 1973. .sp .IP [] Pfister G.F., Brantley W.C., George D.A., Harvey S.L., Kleinfelder W.J., McAuliffe K.P., Melton E.A., Norton V.A., Weiss J., .I The IBM Research Parallel Processor Prototype (RP3); Introduction and Architecture, .R Proceedings of the 1985 International Conference on Parallel Processing, IEEE Computer Society, pp 764-771, 1985. .sp .IP [] Saad Y., Schultz M.H., .I Data Communication in Hypercubes, .R Dept. of Computer Science, Yale University, RR YALEU/DCS/RR-428, October, 1985. .sp .IP [] Schwartz J.T., .I Ultracomputers, .R ACM Trans. on Programming Languages and Systems, 2, pp 484-521, 1980. .sp .IP [] Smith B.J., .I Architecture and Applications of the HEP Multiprocessor Computer System, .R Real-Time Signal Processing IV, Proc of SPIE, pp 241-248, 1981. .sp .IP [] Sorensen D.C., .I Buffering for Vector Performance on a pipelined MIMD machine, .R Parallel Computing, .R 1, pp 143-164, 1984. .sp .IP [] Johnsson S.L. .I "Gaussian Elimination on Sparse Matrices and Concurrency" .R Caltech Computer Science Department, 1980, 4087, TR:80. .sp .IP [] Gentleman M.W., .I Implementing Nested Dissection, .R Dept. of Computer Science, Univ. of Waterloo, 1982, March, Research report CS-82-03). .nf @article(Wing-80, author="Wing O., Huang J.W.", key="Wing,Huang", title="A Computational Model of Parallel Solution of Linear Equations", journal="IEEE Trans. Computers", volume="C-29", number="7", pages="632-638", year="1980") @techreport(Liu-85 author="Liu J.W.H.", key="Liu", title="Computational Models and Task Scheduling for Parallel Sparse Cholesky Factorization", institution="Dept. of Computer Science, York University, Downsview, Ontario", year="1985", number="Technical report CS-85-01", month="April") @techreport(George-86, author="George A., Heath M., Liu J., Ng E.", key="George et al", title="Sparse Cholesky Factorization on a Local-Memory Multiprocessor", institution="Department of Computer Science, York University, Downsview, Ontario", year="1986", number="Technical report CS-86-01") @techreport(Worley-85, author="Worley Patrick H., Schreiber Robert", key="Worley" , title="Nested dissection on a Mesh-Connected Processor Array", institution="Stanford University, Cent. Large Scale Sci. Computation", number="CLaSSiC-85-08", year="1985") @article(Flynn-66, author="Flynn M.J.", key="Flynn", title="Very High-Speed Computing Systems", journal="Proc. of the IEEE", year="1966", pages="1901-1909", volume="12") @article(Seitz-85, author="Seitz C.L.", key="Seitz", title="The Cosmic Cube", journal="Communications of the ACM", year="1985", pages="22-33", volume="28", number="1") @book(Hillis-85, author="Hillis W.D.", key="Hillis", title="The Connection Machine", publisher="MIT Press", year="1985")