ORNL Progress Report 95/96

At the Oak Ridge National Laboratory David W. Walker has investigated the parallel implementation and performance of a left-looking algorithm for performing the out-of-core LU factorization of dense matrices. Use of out-of-core storage adds an extra layer to the hierarchical memory. In order to manage flexible and efficient access to this extra layer of memory an extra level of partitioning over matrix columns is introduced into the standard ScaLAPACK algorithm. This divides the matrix into panels of contiguous columns, or superblocks. The resulting hybrid algorithm is left-looking at the outermost loop level, but uses a right-looking algorithm to factor the individual superblocks. This permits the trade-offs between I/O cost, communication cost, and load imbalance overhead to be controlled at the application level by varying the parameters of the data distribution and the superblock width.

The out-of-core LU factorization algorithm has been implemented on an Intel Paragon parallel computer. The implementation makes use of a small library of parallel I/O routines called the BLAPIOS, together with ScaLAPACK and PBLAS routines. From a preliminary performance study the following observations were made.

On the Paragon the version of the algorithm that stores the matrix in pivoted form is faster than the version that stores matrices in unpivoted form.
On the Paragon the parallel I/O system cannot efficiently and reliably manage large numbers of open files if the volume of data being read is sufficiently large. Logically distributed files are, therefore, implemented using a single file partitioned among the processes.
A broad qualitative understanding of the performance of the algorithm has been gained. Increasing the superblock width by increasing should decrease I/O costs, but this was found to be true only up to a point on the Paragon because when the volume of parallel I/O becomes too great, I/O performance starts to degrade. Thus, although it might be expected that the optimal approach would be a make the superblock as large as possible, this will not be fastest on all systems.

Future work will follow two main directions. The out-of-core algorithm will be implemented on other platforms, such as the IBM SP-2, symmetric multiprocessors, and clusters of workstations. The use of the MPI-IO library will be considered as a means of providing portability for our code, rather than implementing the BLAPIOS directly on each machine. A more sophisticated analytical performance model will be developed, and used to interpret the timings. The IBM SP-2 will be of particular interest as each processor is attached to its own disk. Hence, unlike the Paragon implementation, it may prove appropriate on the IBM SP-2 to implement logically distributed matrices as physically distributed matrices.

As network bandwidths continue to improve, networks of workstations may prove to be a good environment for research groups needing to perform very large LU factorizations. Such a system is cost-effective compared with supercomputers such as the Intel Paragon, and is under the immediate control of the researchers using it. Moreover, disk storage is cheap and easy to install. Consider the system requirements if a 100,000 x 100,000 matrix is to be factored in 24 hours. In a balanced system a program might be expected to spend 8 hours computing, 8 hours communicating over the network, and 8 hours doing I/O. Such a computation would require about 6.7 x 10^14 floating-point operations, or 23 Gflop/s. If there are Np workstations arranged as a P x Q mesh and each has 128 Mbytes of memory, then the maximum superblock width is 80 Np elements. The I/O per workstation is then,

                                         3      
                                   1    M      1
                              8 x  _ x _____ x ___
                                   2   80Np    Np

or 50000/Np^2 Gbyte per workstation.

The total amount of data communicated between processes can be approximated by the communication volume of the matrix multiplication operations that asymptotically dominate. The total amount of communication is approximately (2/3)(M^3/w_sb) elements, where w_sb is the superblock width. Assuming again that the superblock width is w_sb = 80Np, the total amount of communication is approximately (1/120)(M^3/Np) elements. So for 16 workstations, each would need to compute at about 1.5 Gflop/s, and perform I/O at about 6.8 Mbyte/s. A network bandwidth of about 145 Mbyte/s would be required. Each workstation would require 5 Gbyte of disk storage. These requirements are close to the capabilities of current workstation networks.

Closely related work by Eduardo D'Azevedo at ORNL has focused on out-of-core Cholesky factorization. The out-of-core LU factorization uses a block column/panel arrangment to facilitate row pivoting for maintaining numerical stability, In symmetric Cholesky factorization where pivot is not required to maintain stability, a large block oriented arrangment may also yield good performance. A large in core block size increases the work to i/o ratio (O(N^3) work for O(N^2) data transfer). A block oriented out-of-core Cholesky factorization of a 16800x16800 matrix with in core block size of 8400x8400 on 7x7 processors on the Paragon xps35 took 1461sec. The significant breakdown of runtime are pdpotrf (4 times) 85sec, pdtrsm (6 times) 675sec, pdsyrk (6 times) 283sec, pdgemm (4 times) 286sec. This gives an overall performance of about 22Mflops per processor, assuming Cholesky factorization requires O(N^3/3) flops, The data in the disk file was arranged as in-core blocks to minimize the number of expensive lseeks and facilitate reading/writing large contiguous blocks of data.