**Banded Matrix-Vector Multiplication**

First, we consider the parallelization of the operation on a linear array of
processors when is a banded matrix with ,
upper and lower bandwidths, and we assume that matrices are
stored using a sparse scheme [Rice:85a]. For simplicity, we
describe the case . The proposed implementation is
based on a decomposition of matrix into an upper
(including the diagonal of ) and lower triangular
matrices, such as . Furthermore, we
assume that row and are stored
in processor **i**. Without loss of generality, we can assume and . The vector can then be expressed as . The products and are computed within and iterations, respectively.
The computation involved is described in Figure 8.3. In
order to compute the complexity of the above algorithm, we assume
without any loss of generality, that has **K** non-zero elements,
and . Then it can be shown that the time complexity is

and the memory space required for each subdomain is .

**Figure 8.3:** The Pseudo Code for Banded Matrix-Vector Multiplication

**Banded Matrix-Matrix Multiplication**

Second, we consider the implementation of , on a ring of processors when
, are banded matrices with upper,
and lower bandwidths, respectively. Again, we describe the
realization for . The case is
straightforward generalization. The processor **i** computes column
of matrix and holds one row of matrix (denoted by
) and a column of matrix (denoted by ).

The algorithm consists of two phases as in banded-matrix vector
multiplication. Without loss of generality, we can assume , and . In the first phase, each node starts by
calculating , then each node **i**
passes to node **i-1**, this phase is repeated times. In the second phase, each node restores
and passes it to node **i+1**. This phase is repeated times. The implementation proposed for this operation is
described in Figure 8.4.

**Figure 8.4:** The Pseudo Code for Banded Matrix-Matrix Multiplication

Without loss of generality, we assume that are the number of non-zero elements for the matrices , respectively, and denote by and . Then we can show that the parallel execution time is given by

The above realization has been implemented on the nCUBE-1 [Chrisochoides:90a]. Tables 8.1 and 8.2 indicate the performance of BLAS 2 computation for a block tridiagonal matrix where each block is dense. In these experiments, each processor has the same computation to perform. The results indicate very satisfactory performance for these type of data.

**Table 8.1:** Measured maximum total elapsed time (in seconds) for
multiplication of a block tridiagonal matrices with a vector.

**Table 8.2:** Measured maximum elapsed time (in seconds) for multiplication
of a block tridiagonal matrix by a block tridiagonal matrix.

Wed Mar 1 10:19:35 EST 1995