Implementation and Performance on Intel Touchstone Delta

Next: Implementation and Performance Up: Implementation and Performance Previous: Implementation and Performance

Implementation and Performance on Intel Touchstone Delta

The Intel Touchstone Delta computer system is 16 32 mesh of i860 processors with a wormhole routing interconnection network [41], located at the California Institute of Technology on behalf of the Concurrent Supercomputing Consortium. The Delta's communication characteristics are described in [43].

In order to implement Algorithm 1, it was natural to rely on the ScaLAPACK 1.0 library (beta version) [26]. This choice requires us to exploit two key design features of this package. First, the ScaLAPACK library relies on the Parallel Block BLAS (PB-BLAS)[15], which hides much of the interprocessor communication. This hiding of communication makes it possible to express most algorithms using only the PB-BLAS, thus avoiding explicit calls to communication routines. The PB-BLAS are implemented on top of calls to the BLAS and to the Basic Linear Algebra Communication Subroutines (BLACS)[25]. Second, ScaLAPACK assumes that the data is distributed according to the square block cyclic decomposition scheme, which allows the routines to achieve well balanced computations and to minimize communication costs. ScaLAPACK includes subroutines for LU, QR and Cholesky factorizations, which we use as building blocks for our implementation. The PUMMA routines [16] provide the required matrix multiplication.

Figure 2: Performance of ScaLAPACK 1.0 (beta version) subroutines on 256 () PEs Intel Touchstone Delta system.

The matrix inversion is done in two steps. After the LU factorization has been computed, the upper triangular matrix is inverted, and is obtained by substitution with . Using blocked operations leads to performance comparable to that obtained for LU factorization.

The implementation of the QR factorization with or without column pivoting is based on the parallel algorithm presented by Coleman and Plassmann [18]. The QR factorization with column pivoting has a much larger sequential component, processing one column at a time, and needs to update the norms of the column vectors at each step. This makes using blocked operations impossible and induces high synchronization overheads. However, as we will see, the cost of this step remains negligible in comparison with the time spent in the Newton iteration. Unlike QR factorization with pivoting, the QR factorization without pivoting and the post- and pre-multiplication by an orthogonal matrix do use blocked operations.

Figure 2 plots the timing results obtained by the PUMMA package using the BLACS for the general matrix multiplication, and ScaLAPACK 1.0 (beta version) subroutines for the matrix inversion, QR decomposition with and without column pivoting. Corresponding tabular data can be found in the Appendix.

To measure the efficiency of Algorithm 1, we generated random matrices of different sizes, all of whose entries are normally distributed with mean 0 and variance 1. All computations were performed in real double precision arithmetic. Table 1 lists the measured results of the backward error, the number of Newton iterations, the total CPU time and the megaflops rate. In particular, the second column of the table contains the backward errors and the number of the Newton iterations in parentheses. We note that the convergence rate is problem-data dependent. From Table 1, we see that for a 4000-by-4000 matrix, the algorithm reached 7.19/23.12=31% efficiency with respect to PUMMA matrix multiplication, and 7.19/8.70=82% efficiency with respect to the underlying ScaLAPACK 1.0 (beta) matrix inversion subroutine. As our performance model shows, and tables 9, 10, 11, 12, and 14 confirm, efficiency will continue to improve as the matrix size increases. Our performance model is explained in section 4. Figure 3 shows the performance of Algorithm 1 on the Intel Delta system as a function of matrix size for different numbers of processors.

Table 1: Backward accuracy, timing in seconds and megaflops of Algorithm 1 on a 256 node Intel Touchstone Delta system.

Figure 3: Performance of Algorithm 1 on the Intel Delta system as a function of matrix size for different numbers of processors.

Table 2 gives details of the total CPU timing of the Newton iteration based algorithm, summarized in Table 1). It is clear that the Newton iteration (sign function) is most expensive, and takes about 90% of the total running time.

Table 2: Performance Profile on a 256 processor Intel Touchstone Delta system (time in seconds)

To compare with the standard sequential algorithm, we also ran the LAPACK driver routine DGEES for computing the Schur decomposition (with reordering of eigenvalues) on one i860 processor. It took 592 seconds for a matrix of order 600, or 9.1 megaflops/second. Assuming that the time scales like , we can predict that for a matrix of order 4000, if the matrix were able to fit on a single node, then DGEES would take 175,000 seconds (48 hours) to compute the desired spectral decomposition. In contrast, Algorithm 1 would only take 1,436 seconds (24 minutes). This is about 120 times faster! However, we should note that DGEES actually computes a complete Schur decomposition with the necessary reordering of the spectrum. Algorithm 1 only decomposes the spectrum along the pure imaginary axis. In some applications, this may be what the users want. If the decomposition along a finer region or a complete Schur decomposition is desired, then the cost of the Newton iteration based algorithms will be increased, though it is likely that the first step just described will take most of the time [13].

Next: Implementation and Performance Up: Implementation and Performance Previous: Implementation and Performance

Jack Dongarra
Mon Jan 9 11:07:50 EST 1995