In this paper we have described a parallel left-looking algorithm for performing the out-of-core LU factorization of dense matrices. Use of out-of-core storage adds an extra layer to the hierarchical memory. In order to manage flexible and efficient access to this extra layer of memory an extra level of partitioning over matrix columns has been introduced into the standard ScaLAPACK algorithm. This is represented by the superblocks in the hybrid algorithm that we have described. The hybrid algorithm is left-looking at the outermost loop level, but uses a right-looking algorithm to factor the individual superblocks. This permits the trade-offs between I/O cost, communication cost, and load imbalance overhead to be controlled at the application level by varying the parameters of the data distribution and the superblock width.
We have implemented the out-of-core LU factorization algorithm on an Intel Paragon parallel computer. The implementation makes use of a small library of parallel I/O routines called the BLAPIOS, together with ScaLAPACK and PBLAS routines. From a preliminary performance study we have observed the following.
Future work will follow two main directions. We will seek to implement our out-of-core algorithm on other platforms, such as the IBM SP-2, symmetric multiprocessors, and clusters of workstations. The use of the MPI-IO library will be considered as a means of providing portability for our code, rather than implementing the BLAPIOS directly on each machine. We will also develop a more sophisticated analytical performance model, and use it to interpret our timings. The IBM SP-2 will be of particular interest as each processor is attached to its own disk. Hence, unlike our Paragon implementation, it may prove appropriate on the IBM SP-2 to implement logically distributed matrices as physically distributed matrices.
As network bandwidths continue to improve,
networks of workstations may prove to be a good environment for research
groups needing to perform very large LU factorizations. Such a system
is cost-effective compared with supercomputers such as the Intel
Paragon, and is under the immediate control of the researchers using it.
Moreover, disk storage is cheap and easy to install. Consider the
system requirements if we want to factor a matrix in
24 hours. In a balanced system we might expect to spend 8 hours
computing, 8 hours communicating over the network, and 8 hours doing I/O.
Such a computation would require about
floating-point
operations, or 23 Gflop/s.
If there are
workstations
and each has 128 Mbytes of memory, then the maximum superblock width
is
elements. The I/O per workstation is then,
or Gbyte per workstation.
The total amount of data communicated between
processes can be approximated by the communication volume of
the matrix multiplication
operations that asymptotically dominate. The total amount of
communication is approximately
elements,
where
is the
superblock width. Assuming again that the superblock width is
, the total amount of communication is approximately
elements.
So for 16 workstations,
each would need to compute
at about 1.5 Gflop/s, and perform I/O at about 6.8 Mbyte/s. A network
bandwidth of about 145 Mbyte/s would be required. Each workstation would
require 5 Gbyte of disk storage.
These requirements are close to the capabilities of current workstation
networks.