next up previous contents index
Next: Array Descriptor for In-core Up: In-core Dense Matrices Previous: The Two-dimensional Block-Cyclic Distribution

Local Storage Scheme and Block-Cyclic Mapping

 

The block-cyclic distribution scheme is a mapping of a set of blocks onto the processes. The previous section informally described this mapping as well as some of its properties. To be complete, we must now explain how the blocks that are mapped to the same process are arranged and stored in the local process memory. In other words, we shall describe the precise mapping that associates to a matrix entry identified by its global indexes the coordinates of the process that owns it and its local position within that process's memory.

Suppose we have an array of length N to be stored on P processes. By convention, the array entries are numbered 1 through N and the processes are numbered 0 through P-1. First, the array is divided into contiguous blocks of size NB. When NB does not divide N evenly, the last block of array elements will only contain tex2html_wrap_inline14699 entries instead of NB. By convention, these blocks are numbered starting from zero and dealt out to the processes like a deck of cards. In other words, if we assume that the process 0 receives the first block, the tex2html_wrap_inline14705 block is assigned to the process of coordinate tex2html_wrap_inline14707. The blocks assigned to the same process are stored contiguously in memory. The mapping of an array entry globally indexed by I is defined by the following analytical equation:
displaymath14669
where I is a global index in the array, l is the local block coordinate into which this entry resides, p is the coordinate of the process owning that block, and finally x is the coordinate within that block where the global array entry of index I is to be found. It is then fairly easy to establish the analytical relationship between these variables. One obtains:
 equation2409
These equations allow to determine the local information, i.e. the local index tex2html_wrap_inline14721 as well as the process coordinate p corresponding to a global entry identified by its global index I and conversely. Table 4.3 illustrates this mapping for the block layout when P=2 and N=16, i.e., NB=8. At most one block is assigned to each process.

  table2416
Table 4.3: One-dimensional block mapping example for P=2 and N=16

This example of the one-dimensional block distribution mapping can be expressed in HPF  by using the following statements:

      REAL :: X( N )
!HPF$ PROCESSORS PROC( P )
!HPF$ DISTRIBUTE X( BLOCK( NB ) ) ONTO PROC

Table 4.4 illustrates Equation 4.1 for the cyclic layout, i.e., NB=1 when P=2 and N=16.

  table2428
Table 4.4: One-dimensional cyclic mapping example for P=2 and N=16

This example of the one-dimensional cyclic distribution mapping can be expressed in HPF  by using the following statements:

      REAL :: X( N )
!HPF$ PROCESSORS PROC( P )
!HPF$ DISTRIBUTE X( CYCLIC ) ONTO PROC

Table 4.5 illustrates Equation 4.1 for the block-cyclic layout when P=2, NB=3 and N=16.

  table2440
Table 4.5: One-dimensional block-cyclic mapping example for P=2, NB=3 and N=16

This example of the one-dimensional cyclic distribution mapping can be expressed in HPF  by using the following statements:

      REAL :: X( N )
!HPF$ PROCESSORS PROC( P )
!HPF$ DISTRIBUTE X( CYCLIC( NB ) ) ONTO PROC

There is in fact no real reason to always deal out the blocks starting with the process 0. In fact, it is sometimes useful to start the data distribution with the process of arbitrary coordinate SRC, in which case Equation 4.1 becomes:
 equation2451

Table 4.6 illustrates Equation 4.2 for the block-cyclic layout when tex2html_wrap_inline14791, tex2html_wrap_inline14793, tex2html_wrap_inline14795 and tex2html_wrap_inline14797.

  table2461
Table 4.6: One-dimensional block-cyclic mapping example for P=2, SRC=1, NB=3 and N=16

This example of the one-dimensional block-cyclic distribution mapping can be expressed in HPF  by using the following statements:

      REAL :: X( N )
!HPF$ PROCESSORS PROC( P )
!HPF$ TEMPLATE T( N + P*NB )
!HPF$ DISTRIBUTE T( CYCLIC( NB ) ) ONTO PROC
!HPF$ ALIGN X( I ) WITH T( SRC*NB + I )

In the two-dimensional case, assuming the matrix is partitioned in tex2html_wrap_inline14817 blocks and that the first block is given to the process of coordinates (RSRC, CSRC), the analytical formula given above for the one-dimensional case are simply reused independently in each dimension of the tex2html_wrap_inline12182 process grid. For example, the matrix entry (I,J) is thus to be found in the process of coordinates tex2html_wrap_inline14629 within the local (l,m) block at the position (x,y) given by:
displaymath14670

These formula specify how an tex2html_wrap_inline14831 by tex2html_wrap_inline14833  matrix A is mapped and stored on the process grid. It is first decomposed into tex2html_wrap_inline14837 by tex2html_wrap_inline14839  blocks starting at its upper left corner. These blocks are then uniformly distributed across the process grid in a cyclic manner.

Every process owns a collection of blocks, which are contiguously stored by column in a two-dimensional ``column major'' array.

This local storage convention allows the ScaLAPACK software to use efficiently the local memory hierarchy by calling the BLAS on subarrays that may be larger than a single tex2html_wrap_inline14837 by tex2html_wrap_inline14839 block. We present in figure 4.5  the mapping of a 5tex2html_wrap_inline148455 matrix partitioned into 2tex2html_wrap_inline148452 blocks mapped onto a 2tex2html_wrap_inline148452 process grid (i.e., tex2html_wrap_inline14851, tex2html_wrap_inline14651, and tex2html_wrap_inline14855). The local entries of every matrix column are contiguously stored in the processes' memories.

 figure2490
Figure 4.6: A tex2html_wrap_inline14857 matrix decomposed into tex2html_wrap_inline14859   blocks mapped onto a tex2html_wrap_inline14859 process grid

In figure 4.5, the process of coordinates (0,0) owns four blocks. The matrix entries of the global columns 1, 2 and 5 are contiguously stored in that process's memory. Finally, these columns are themselves continuously stored forming a conventional two-dimensional local array. In that local array A, the entry A(2,3) contains the value of the global matrix entry tex2html_wrap_inline14869. This example would be expressed in HPF  as:

      REAL :: A( 5, 5 )
!HPF$ PROCESSORS PROC( 2, 2 )
!HPF$ DISTRIBUTE A( CYCLIC( 2 ), CYCLIC( 2 ) ) ONTO PROC

Determining the number of     rows or columns of a global dense matrix that a specific process receives is an essential task for the user. ScaLAPACK provides a tool routine, NUMROC, to perform this function. The notation LOCtex2html_wrap_inline12112() and LOCtex2html_wrap_inline12114() is used to reflect these local quantities throughout the leading comments of the source code and is reflected in the sample argument description in section 4.3.5. The values of LOCtex2html_wrap_inline12112()  and LOCtex2html_wrap_inline12114()  computed by NUMROC are precise calculations.

However, if users want a general idea of the size of a local array, they can perform the following ``back of the envelope'' calculation to receive an upper bound on the quantity.

An upper bound on the value of LOCtex2html_wrap_inline12112() can be calculated as:
displaymath14671
or equivalently as
displaymath14672

Similarly, an upper bound on the value of LOCtex2html_wrap_inline12114() can be calculated as
displaymath14673
or equivalently as
displaymath14674

Note that this calculation can yield a gross overestimate of the amount of space actually required.


next up previous contents index
Next: Array Descriptor for In-core Up: In-core Dense Matrices Previous: The Two-dimensional Block-Cyclic Distribution

Susan Blackford
Tue May 13 09:21:01 EDT 1997