|
ScaLAPACK
2.0.2
ScaLAPACK: Scalable Linear Algebra PACKage
|
00001 SUBROUTINE PCGBDCMV( LDBW, BWL, BWU, TRANS, N, A, JA, DESCA, NRHS, 00002 $ B, IB, DESCB, X, WORK, LWORK, INFO ) 00003 * 00004 * 00005 * 00006 * -- ScaLAPACK routine (version 1.7) -- 00007 * University of Tennessee, Knoxville, Oak Ridge National Laboratory, 00008 * and University of California, Berkeley. 00009 * November 15, 1997 00010 * 00011 * .. Scalar Arguments .. 00012 CHARACTER TRANS 00013 INTEGER BWL, BWU, IB, INFO, JA, LDBW, LWORK, N, NRHS 00014 * .. 00015 * .. Array Arguments .. 00016 INTEGER DESCA( * ), DESCB( * ) 00017 COMPLEX A( * ), B( * ), WORK( * ), X( * ) 00018 * .. 00019 * 00020 * 00021 * Purpose 00022 * ======= 00023 * 00024 * 00025 * ===================================================================== 00026 * 00027 * Arguments 00028 * ========= 00029 * 00030 * 00031 * TRANS (global input) CHARACTER 00032 * = 'N': Solve with A(1:N, JA:JA+N-1); 00033 * = 'C': Solve with conjugate_transpose( A(1:N, JA:JA+N-1) ); 00034 * 00035 * N (global input) INTEGER 00036 * The number of rows and columns to be operated on, i.e. the 00037 * order of the distributed submatrix A(1:N, JA:JA+N-1). N >= 0. 00038 * 00039 * BWL (global input) INTEGER 00040 * Number of subdiagonals. 0 <= BWL <= N-1 00041 * 00042 * BWU (global input) INTEGER 00043 * Number of superdiagonals. 0 <= BWU <= N-1 00044 * 00045 * A (local input/local output) COMPLEX pointer into 00046 * local memory to an array with first dimension 00047 * LLD_A >=(bwl+bwu+1) (stored in DESCA). 00048 * On entry, this array contains the local pieces of the 00049 * This local portion is stored in the packed banded format 00050 * used in LAPACK. Please see the Notes below and the 00051 * ScaLAPACK manual for more detail on the format of 00052 * distributed matrices. 00053 * 00054 * JA (global input) INTEGER 00055 * The index in the global array A that points to the start of 00056 * the matrix to be operated on (which may be either all of A 00057 * or a submatrix of A). 00058 * 00059 * DESCA (global and local input) INTEGER array of dimension DLEN. 00060 * if 1D type (DTYPE_A=501), DLEN >= 7; 00061 * if 2D type (DTYPE_A=1), DLEN >= 9 . 00062 * The array descriptor for the distributed matrix A. 00063 * Contains information of mapping of A to memory. Please 00064 * see NOTES below for full description and options. 00065 * 00066 * AF (local output) COMPLEX array, dimension LAF. 00067 * Auxiliary Fillin Space. 00068 * Fillin is created during the factorization routine 00069 * PCDBTRF and this is stored in AF. If a linear system 00070 * is to be solved using PCDBTRS after the factorization 00071 * routine, AF *must not be altered* after the factorization. 00072 * 00073 * LAF (local input) INTEGER 00074 * Size of user-input Auxiliary Fillin space AF. Must be >= 00075 * NB*(bwl+bwu)+6*max(bwl,bwu)*max(bwl,bwu) 00076 * If LAF is not large enough, an error code will be returned 00077 * and the minimum acceptable size will be returned in AF( 1 ) 00078 * 00079 * WORK (local workspace/local output) 00080 * COMPLEX temporary workspace. This space may 00081 * be overwritten in between calls to routines. WORK must be 00082 * the size given in LWORK. 00083 * On exit, WORK( 1 ) contains the minimal LWORK. 00084 * 00085 * LWORK (local input or global input) INTEGER 00086 * Size of user-input workspace WORK. 00087 * If LWORK is too small, the minimal acceptable size will be 00088 * returned in WORK(1) and an error code is returned. LWORK>= 00089 * 00090 * INFO (global output) INTEGER 00091 * = 0: successful exit 00092 * < 0: If the i-th argument is an array and the j-entry had 00093 * an illegal value, then INFO = -(i*100+j), if the i-th 00094 * argument is a scalar and had an illegal value, then 00095 * INFO = -i. 00096 * 00097 * ===================================================================== 00098 * 00099 * 00100 * Restrictions 00101 * ============ 00102 * 00103 * The following are restrictions on the input parameters. Some of these 00104 * are temporary and will be removed in future releases, while others 00105 * may reflect fundamental technical limitations. 00106 * 00107 * Non-cyclic restriction: VERY IMPORTANT! 00108 * P*NB>= mod(JA-1,NB)+N. 00109 * The mapping for matrices must be blocked, reflecting the nature 00110 * of the divide and conquer algorithm as a task-parallel algorithm. 00111 * This formula in words is: no processor may have more than one 00112 * chunk of the matrix. 00113 * 00114 * Blocksize cannot be too small: 00115 * If the matrix spans more than one processor, the following 00116 * restriction on NB, the size of each block on each processor, 00117 * must hold: 00118 * NB >= 2*MAX(BWL,BWU) 00119 * The bulk of parallel computation is done on the matrix of size 00120 * O(NB) on each processor. If this is too small, divide and conquer 00121 * is a poor choice of algorithm. 00122 * 00123 * Submatrix reference: 00124 * JA = IB 00125 * Alignment restriction that prevents unnecessary communication. 00126 * 00127 * 00128 * ===================================================================== 00129 * 00130 * 00131 * Notes 00132 * ===== 00133 * 00134 * If the factorization routine and the solve routine are to be called 00135 * separately (to solve various sets of righthand sides using the same 00136 * coefficient matrix), the auxiliary space AF *must not be altered* 00137 * between calls to the factorization routine and the solve routine. 00138 * 00139 * The best algorithm for solving banded and tridiagonal linear systems 00140 * depends on a variety of parameters, especially the bandwidth. 00141 * Currently, only algorithms designed for the case N/P >> bw are 00142 * implemented. These go by many names, including Divide and Conquer, 00143 * Partitioning, domain decomposition-type, etc. 00144 * 00145 * Algorithm description: Divide and Conquer 00146 * 00147 * The Divide and Conqer algorithm assumes the matrix is narrowly 00148 * banded compared with the number of equations. In this situation, 00149 * it is best to distribute the input matrix A one-dimensionally, 00150 * with columns atomic and rows divided amongst the processes. 00151 * The basic algorithm divides the banded matrix up into 00152 * P pieces with one stored on each processor, 00153 * and then proceeds in 2 phases for the factorization or 3 for the 00154 * solution of a linear system. 00155 * 1) Local Phase: 00156 * The individual pieces are factored independently and in 00157 * parallel. These factors are applied to the matrix creating 00158 * fillin, which is stored in a non-inspectable way in auxiliary 00159 * space AF. Mathematically, this is equivalent to reordering 00160 * the matrix A as P A P^T and then factoring the principal 00161 * leading submatrix of size equal to the sum of the sizes of 00162 * the matrices factored on each processor. The factors of 00163 * these submatrices overwrite the corresponding parts of A 00164 * in memory. 00165 * 2) Reduced System Phase: 00166 * A small (max(bwl,bwu)* (P-1)) system is formed representing 00167 * interaction of the larger blocks, and is stored (as are its 00168 * factors) in the space AF. A parallel Block Cyclic Reduction 00169 * algorithm is used. For a linear system, a parallel front solve 00170 * followed by an analagous backsolve, both using the structure 00171 * of the factored matrix, are performed. 00172 * 3) Backsubsitution Phase: 00173 * For a linear system, a local backsubstitution is performed on 00174 * each processor in parallel. 00175 * 00176 * 00177 * Descriptors 00178 * =========== 00179 * 00180 * Descriptors now have *types* and differ from ScaLAPACK 1.0. 00181 * 00182 * Note: banded codes can use either the old two dimensional 00183 * or new one-dimensional descriptors, though the processor grid in 00184 * both cases *must be one-dimensional*. We describe both types below. 00185 * 00186 * Each global data object is described by an associated description 00187 * vector. This vector stores the information required to establish 00188 * the mapping between an object element and its corresponding process 00189 * and memory location. 00190 * 00191 * Let A be a generic term for any 2D block cyclicly distributed array. 00192 * Such a global array has an associated description vector DESCA. 00193 * In the following comments, the character _ should be read as 00194 * "of the global array". 00195 * 00196 * NOTATION STORED IN EXPLANATION 00197 * --------------- -------------- -------------------------------------- 00198 * DTYPE_A(global) DESCA( DTYPE_ )The descriptor type. In this case, 00199 * DTYPE_A = 1. 00200 * CTXT_A (global) DESCA( CTXT_ ) The BLACS context handle, indicating 00201 * the BLACS process grid A is distribu- 00202 * ted over. The context itself is glo- 00203 * bal, but the handle (the integer 00204 * value) may vary. 00205 * M_A (global) DESCA( M_ ) The number of rows in the global 00206 * array A. 00207 * N_A (global) DESCA( N_ ) The number of columns in the global 00208 * array A. 00209 * MB_A (global) DESCA( MB_ ) The blocking factor used to distribute 00210 * the rows of the array. 00211 * NB_A (global) DESCA( NB_ ) The blocking factor used to distribute 00212 * the columns of the array. 00213 * RSRC_A (global) DESCA( RSRC_ ) The process row over which the first 00214 * row of the array A is distributed. 00215 * CSRC_A (global) DESCA( CSRC_ ) The process column over which the 00216 * first column of the array A is 00217 * distributed. 00218 * LLD_A (local) DESCA( LLD_ ) The leading dimension of the local 00219 * array. LLD_A >= MAX(1,LOCr(M_A)). 00220 * 00221 * Let K be the number of rows or columns of a distributed matrix, 00222 * and assume that its process grid has dimension p x q. 00223 * LOCr( K ) denotes the number of elements of K that a process 00224 * would receive if K were distributed over the p processes of its 00225 * process column. 00226 * Similarly, LOCc( K ) denotes the number of elements of K that a 00227 * process would receive if K were distributed over the q processes of 00228 * its process row. 00229 * The values of LOCr() and LOCc() may be determined via a call to the 00230 * ScaLAPACK tool function, NUMROC: 00231 * LOCr( M ) = NUMROC( M, MB_A, MYROW, RSRC_A, NPROW ), 00232 * LOCc( N ) = NUMROC( N, NB_A, MYCOL, CSRC_A, NPCOL ). 00233 * An upper bound for these quantities may be computed by: 00234 * LOCr( M ) <= ceil( ceil(M/MB_A)/NPROW )*MB_A 00235 * LOCc( N ) <= ceil( ceil(N/NB_A)/NPCOL )*NB_A 00236 * 00237 * 00238 * One-dimensional descriptors: 00239 * 00240 * One-dimensional descriptors are a new addition to ScaLAPACK since 00241 * version 1.0. They simplify and shorten the descriptor for 1D 00242 * arrays. 00243 * 00244 * Since ScaLAPACK supports two-dimensional arrays as the fundamental 00245 * object, we allow 1D arrays to be distributed either over the 00246 * first dimension of the array (as if the grid were P-by-1) or the 00247 * 2nd dimension (as if the grid were 1-by-P). This choice is 00248 * indicated by the descriptor type (501 or 502) 00249 * as described below. 00250 * 00251 * IMPORTANT NOTE: the actual BLACS grid represented by the 00252 * CTXT entry in the descriptor may be *either* P-by-1 or 1-by-P 00253 * irrespective of which one-dimensional descriptor type 00254 * (501 or 502) is input. 00255 * This routine will interpret the grid properly either way. 00256 * ScaLAPACK routines *do not support intercontext operations* so that 00257 * the grid passed to a single ScaLAPACK routine *must be the same* 00258 * for all array descriptors passed to that routine. 00259 * 00260 * NOTE: In all cases where 1D descriptors are used, 2D descriptors 00261 * may also be used, since a one-dimensional array is a special case 00262 * of a two-dimensional array with one dimension of size unity. 00263 * The two-dimensional array used in this case *must* be of the 00264 * proper orientation: 00265 * If the appropriate one-dimensional descriptor is DTYPEA=501 00266 * (1 by P type), then the two dimensional descriptor must 00267 * have a CTXT value that refers to a 1 by P BLACS grid; 00268 * If the appropriate one-dimensional descriptor is DTYPEA=502 00269 * (P by 1 type), then the two dimensional descriptor must 00270 * have a CTXT value that refers to a P by 1 BLACS grid. 00271 * 00272 * 00273 * Summary of allowed descriptors, types, and BLACS grids: 00274 * DTYPE 501 502 1 1 00275 * BLACS grid 1xP or Px1 1xP or Px1 1xP Px1 00276 * ----------------------------------------------------- 00277 * A OK NO OK NO 00278 * B NO OK NO OK 00279 * 00280 * Let A be a generic term for any 1D block cyclicly distributed array. 00281 * Such a global array has an associated description vector DESCA. 00282 * In the following comments, the character _ should be read as 00283 * "of the global array". 00284 * 00285 * NOTATION STORED IN EXPLANATION 00286 * --------------- ---------- ------------------------------------------ 00287 * DTYPE_A(global) DESCA( 1 ) The descriptor type. For 1D grids, 00288 * TYPE_A = 501: 1-by-P grid. 00289 * TYPE_A = 502: P-by-1 grid. 00290 * CTXT_A (global) DESCA( 2 ) The BLACS context handle, indicating 00291 * the BLACS process grid A is distribu- 00292 * ted over. The context itself is glo- 00293 * bal, but the handle (the integer 00294 * value) may vary. 00295 * N_A (global) DESCA( 3 ) The size of the array dimension being 00296 * distributed. 00297 * NB_A (global) DESCA( 4 ) The blocking factor used to distribute 00298 * the distributed dimension of the array. 00299 * SRC_A (global) DESCA( 5 ) The process row or column over which the 00300 * first row or column of the array 00301 * is distributed. 00302 * LLD_A (local) DESCA( 6 ) The leading dimension of the local array 00303 * storing the local blocks of the distri- 00304 * buted array A. Minimum value of LLD_A 00305 * depends on TYPE_A. 00306 * TYPE_A = 501: LLD_A >= 00307 * size of undistributed dimension, 1. 00308 * TYPE_A = 502: LLD_A >=NB_A, 1. 00309 * Reserved DESCA( 7 ) Reserved for future use. 00310 * 00311 * 00312 * 00313 * ===================================================================== 00314 * 00315 * Code Developer: Andrew J. Cleary, University of Tennessee. 00316 * Current address: Lawrence Livermore National Labs. 00317 * This version released: August, 2001. 00318 * 00319 * ===================================================================== 00320 * 00321 * .. 00322 * .. Parameters .. 00323 REAL ONE, ZERO 00324 PARAMETER ( ONE = 1.0E+0 ) 00325 PARAMETER ( ZERO = 0.0E+0 ) 00326 COMPLEX CONE, CZERO 00327 PARAMETER ( CONE = ( 1.0E+0, 0.0E+0 ) ) 00328 PARAMETER ( CZERO = ( 0.0E+0, 0.0E+0 ) ) 00329 INTEGER INT_ONE 00330 PARAMETER ( INT_ONE = 1 ) 00331 INTEGER DESCMULT, BIGNUM 00332 PARAMETER (DESCMULT = 100, BIGNUM = DESCMULT * DESCMULT) 00333 INTEGER BLOCK_CYCLIC_2D, CSRC_, CTXT_, DLEN_, DTYPE_, 00334 $ LLD_, MB_, M_, NB_, N_, RSRC_ 00335 PARAMETER ( BLOCK_CYCLIC_2D = 1, DLEN_ = 9, DTYPE_ = 1, 00336 $ CTXT_ = 2, M_ = 3, N_ = 4, MB_ = 5, NB_ = 6, 00337 $ RSRC_ = 7, CSRC_ = 8, LLD_ = 9 ) 00338 * .. 00339 * .. Local Scalars .. 00340 INTEGER CSRC, DL_N_M, DL_N_N, DL_P_M, DL_P_N, DU_N_M, 00341 $ DU_N_N, DU_P_M, DU_P_N, FIRST_PROC, I, ICTXT, 00342 $ ICTXT_NEW, ICTXT_SAVE, IDUM2, IDUM3, J, JA_NEW, 00343 $ LLDA, LLDB, MAX_BW, MYCOL, MYROW, MY_NUM_COLS, 00344 $ NB, NP, NPCOL, NPROW, NP_SAVE, ODD_SIZE, OFST, 00345 $ PART_OFFSET, PART_SIZE, STORE_M_B, STORE_N_A 00346 INTEGER NUMROC_SIZE 00347 * .. 00348 * .. Local Arrays .. 00349 INTEGER PARAM_CHECK( 17, 3 ) 00350 * .. 00351 * .. External Subroutines .. 00352 EXTERNAL BLACS_GRIDINFO, PXERBLA, RESHAPE 00353 * .. 00354 * .. External Functions .. 00355 LOGICAL LSAME 00356 INTEGER NUMROC 00357 EXTERNAL LSAME, NUMROC 00358 * .. 00359 * .. Intrinsic Functions .. 00360 INTRINSIC ICHAR, MIN, MOD 00361 * .. 00362 * .. Executable Statements .. 00363 * 00364 * Test the input parameters 00365 * 00366 INFO = 0 00367 * 00368 ICTXT = DESCA( CTXT_ ) 00369 CSRC = DESCA( CSRC_ ) 00370 NB = DESCA( NB_ ) 00371 LLDA = DESCA( LLD_ ) 00372 STORE_N_A = DESCA( N_ ) 00373 LLDB = DESCB( LLD_ ) 00374 STORE_M_B = DESCB( M_ ) 00375 * 00376 * 00377 * Size of separator blocks is maximum of bandwidths 00378 * 00379 MAX_BW = MAX(BWL,BWU) 00380 * 00381 CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) 00382 NP = NPROW * NPCOL 00383 * 00384 * 00385 * 00386 IF( LSAME( TRANS, 'N' ) ) THEN 00387 IDUM2 = ICHAR( 'N' ) 00388 ELSE IF ( LSAME( TRANS, 'C' ) ) THEN 00389 IDUM2 = ICHAR( 'C' ) 00390 ELSE 00391 INFO = -1 00392 END IF 00393 * 00394 IF( LWORK .LT. -1) THEN 00395 INFO = -15 00396 ELSE IF ( LWORK .EQ. -1 ) THEN 00397 IDUM3 = -1 00398 ELSE 00399 IDUM3 = 1 00400 ENDIF 00401 * 00402 IF( N .LT. 0 ) THEN 00403 INFO = -2 00404 ENDIF 00405 * 00406 IF( N+JA-1 .GT. STORE_N_A ) THEN 00407 INFO = -( 8*100 + 6 ) 00408 ENDIF 00409 * 00410 IF(( BWL .GT. N-1 ) .OR. 00411 $ ( BWL .LT. 0 ) ) THEN 00412 INFO = -3 00413 ENDIF 00414 * 00415 IF(( BWU .GT. N-1 ) .OR. 00416 $ ( BWU .LT. 0 ) ) THEN 00417 INFO = -4 00418 ENDIF 00419 * 00420 IF( LLDA .LT. (BWL+BWU+1) ) THEN 00421 INFO = -( 8*100 + 6 ) 00422 ENDIF 00423 * 00424 IF( NB .LE. 0 ) THEN 00425 INFO = -( 8*100 + 4 ) 00426 ENDIF 00427 * 00428 * Argument checking that is specific to Divide & Conquer routine 00429 * 00430 IF( NPROW .NE. 1 ) THEN 00431 INFO = -( 8*100+2 ) 00432 ENDIF 00433 * 00434 IF( N .GT. NP*NB-MOD( JA-1, NB )) THEN 00435 INFO = -( 2 ) 00436 CALL PXERBLA( ICTXT, 00437 $ 'PCDBDCMV, D&C alg.: only 1 block per proc', 00438 $ -INFO ) 00439 RETURN 00440 ENDIF 00441 * 00442 IF((JA+N-1.GT.NB) .AND. ( NB.LT.2*MAX(BWL,BWU) )) THEN 00443 INFO = -( 8*100+4 ) 00444 CALL PXERBLA( ICTXT, 00445 $ 'PCDBDCMV, D&C alg.: NB too small', 00446 $ -INFO ) 00447 RETURN 00448 ENDIF 00449 * 00450 * 00451 * Pack params and positions into arrays for global consistency check 00452 * 00453 PARAM_CHECK( 17, 1 ) = DESCB(5) 00454 PARAM_CHECK( 16, 1 ) = DESCB(4) 00455 PARAM_CHECK( 15, 1 ) = DESCB(3) 00456 PARAM_CHECK( 14, 1 ) = DESCB(2) 00457 PARAM_CHECK( 13, 1 ) = DESCB(1) 00458 PARAM_CHECK( 12, 1 ) = IB 00459 PARAM_CHECK( 11, 1 ) = DESCA(5) 00460 PARAM_CHECK( 10, 1 ) = DESCA(4) 00461 PARAM_CHECK( 9, 1 ) = DESCA(3) 00462 PARAM_CHECK( 8, 1 ) = DESCA(1) 00463 PARAM_CHECK( 7, 1 ) = JA 00464 PARAM_CHECK( 6, 1 ) = NRHS 00465 PARAM_CHECK( 5, 1 ) = BWU 00466 PARAM_CHECK( 4, 1 ) = BWL 00467 PARAM_CHECK( 3, 1 ) = N 00468 PARAM_CHECK( 2, 1 ) = IDUM3 00469 PARAM_CHECK( 1, 1 ) = IDUM2 00470 * 00471 PARAM_CHECK( 17, 2 ) = 1105 00472 PARAM_CHECK( 16, 2 ) = 1104 00473 PARAM_CHECK( 15, 2 ) = 1103 00474 PARAM_CHECK( 14, 2 ) = 1102 00475 PARAM_CHECK( 13, 2 ) = 1101 00476 PARAM_CHECK( 12, 2 ) = 10 00477 PARAM_CHECK( 11, 2 ) = 805 00478 PARAM_CHECK( 10, 2 ) = 804 00479 PARAM_CHECK( 9, 2 ) = 803 00480 PARAM_CHECK( 8, 2 ) = 801 00481 PARAM_CHECK( 7, 2 ) = 7 00482 PARAM_CHECK( 6, 2 ) = 5 00483 PARAM_CHECK( 5, 2 ) = 4 00484 PARAM_CHECK( 4, 2 ) = 3 00485 PARAM_CHECK( 3, 2 ) = 2 00486 PARAM_CHECK( 2, 2 ) = 15 00487 PARAM_CHECK( 1, 2 ) = 1 00488 * 00489 * Want to find errors with MIN( ), so if no error, set it to a big 00490 * number. If there already is an error, multiply by the the 00491 * descriptor multiplier. 00492 * 00493 IF( INFO.GE.0 ) THEN 00494 INFO = BIGNUM 00495 ELSE IF( INFO.LT.-DESCMULT ) THEN 00496 INFO = -INFO 00497 ELSE 00498 INFO = -INFO * DESCMULT 00499 END IF 00500 * 00501 * Check consistency across processors 00502 * 00503 CALL GLOBCHK( ICTXT, 17, PARAM_CHECK, 17, 00504 $ PARAM_CHECK( 1, 3 ), INFO ) 00505 * 00506 * Prepare output: set info = 0 if no error, and divide by DESCMULT 00507 * if error is not in a descriptor entry. 00508 * 00509 IF( INFO.EQ.BIGNUM ) THEN 00510 INFO = 0 00511 ELSE IF( MOD( INFO, DESCMULT ) .EQ. 0 ) THEN 00512 INFO = -INFO / DESCMULT 00513 ELSE 00514 INFO = -INFO 00515 END IF 00516 * 00517 IF( INFO.LT.0 ) THEN 00518 CALL PXERBLA( ICTXT, 'PCDBDCMV', -INFO ) 00519 RETURN 00520 END IF 00521 * 00522 * Quick return if possible 00523 * 00524 IF( N.EQ.0 ) 00525 $ RETURN 00526 * 00527 * 00528 * Adjust addressing into matrix space to properly get into 00529 * the beginning part of the relevant data 00530 * 00531 PART_OFFSET = NB*( (JA-1)/(NPCOL*NB) ) 00532 * 00533 IF ( (MYCOL-CSRC) .LT. (JA-PART_OFFSET-1)/NB ) THEN 00534 PART_OFFSET = PART_OFFSET + NB 00535 ENDIF 00536 * 00537 IF ( MYCOL .LT. CSRC ) THEN 00538 PART_OFFSET = PART_OFFSET - NB 00539 ENDIF 00540 * 00541 * Form a new BLACS grid (the "standard form" grid) with only procs 00542 * holding part of the matrix, of size 1xNP where NP is adjusted, 00543 * starting at csrc=0, with JA modified to reflect dropped procs. 00544 * 00545 * First processor to hold part of the matrix: 00546 * 00547 FIRST_PROC = MOD( ( JA-1 )/NB+CSRC, NPCOL ) 00548 * 00549 * Calculate new JA one while dropping off unused processors. 00550 * 00551 JA_NEW = MOD( JA-1, NB ) + 1 00552 * 00553 * Save and compute new value of NP 00554 * 00555 NP_SAVE = NP 00556 NP = ( JA_NEW+N-2 )/NB + 1 00557 * 00558 * Call utility routine that forms "standard-form" grid 00559 * 00560 CALL RESHAPE( ICTXT, INT_ONE, ICTXT_NEW, INT_ONE, 00561 $ FIRST_PROC, INT_ONE, NP ) 00562 * 00563 * Use new context from standard grid as context. 00564 * 00565 ICTXT_SAVE = ICTXT 00566 ICTXT = ICTXT_NEW 00567 * 00568 * Get information about new grid. 00569 * 00570 CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) 00571 * 00572 * Drop out processors that do not have part of the matrix. 00573 * 00574 IF( MYROW .LT. 0 ) THEN 00575 GOTO 1234 00576 ENDIF 00577 * 00578 * ******************************** 00579 * Values reused throughout routine 00580 * 00581 * User-input value of partition size 00582 * 00583 PART_SIZE = NB 00584 * 00585 * Number of columns in each processor 00586 * 00587 MY_NUM_COLS = NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) 00588 * 00589 * Offset in columns to beginning of main partition in each proc 00590 * 00591 IF ( MYCOL .EQ. 0 ) THEN 00592 PART_OFFSET = PART_OFFSET+MOD( JA_NEW-1, PART_SIZE ) 00593 MY_NUM_COLS = MY_NUM_COLS - MOD(JA_NEW-1, PART_SIZE ) 00594 ENDIF 00595 * 00596 * Offset in elements 00597 * 00598 OFST = PART_OFFSET*LLDA 00599 * 00600 * Size of main (or odd) partition in each processor 00601 * 00602 ODD_SIZE = MY_NUM_COLS 00603 IF ( MYCOL .LT. NP-1 ) THEN 00604 ODD_SIZE = ODD_SIZE - MAX_BW 00605 ENDIF 00606 * 00607 * 00608 * 00609 * Zero out solution to use to accumulate answer 00610 * 00611 NUMROC_SIZE = 00612 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL) 00613 * 00614 DO 2279 J=1,NRHS 00615 DO 4502 I=1,NUMROC_SIZE 00616 X( (J-1)*LLDB + I ) = CZERO 00617 4502 CONTINUE 00618 2279 CONTINUE 00619 * 00620 DO 5642 I=1, (MAX_BW+2)*MAX_BW 00621 WORK( I ) = CZERO 00622 5642 CONTINUE 00623 * 00624 * Begin main code 00625 * 00626 * 00627 ************************************** 00628 * 00629 IF ( LSAME( TRANS, 'N' ) ) THEN 00630 * 00631 * Sizes of the extra triangles communicated bewtween processors 00632 * 00633 IF( MYCOL .GT. 0 ) THEN 00634 * 00635 DL_P_M= MIN( BWL, 00636 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00637 DL_P_N= MIN( BWL, 00638 $ NUMROC( N, PART_SIZE, MYCOL-1, 0, NPCOL ) ) 00639 * 00640 DU_P_M= MIN( BWU, 00641 $ NUMROC( N, PART_SIZE, MYCOL-1, 0, NPCOL ) ) 00642 DU_P_N= MIN( BWU, 00643 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00644 ENDIF 00645 * 00646 IF( MYCOL .LT. NPCOL-1 ) THEN 00647 * 00648 DL_N_M= MIN( BWL, 00649 $ NUMROC( N, PART_SIZE, MYCOL+1, 0, NPCOL ) ) 00650 DL_N_N= MIN( BWL, 00651 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00652 * 00653 DU_N_M= MIN( BWU, 00654 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00655 DU_N_N= MIN( BWU, 00656 $ NUMROC( N, PART_SIZE, MYCOL+1, 0, NPCOL ) ) 00657 ENDIF 00658 * 00659 * 00660 * Use main partition in each processor to multiply locally 00661 * 00662 CALL CGBMV( TRANS, NUMROC_SIZE, NUMROC_SIZE, BWL, BWU, CONE, 00663 $ A( OFST+1 ), LLDA, B(PART_OFFSET+1), 1, CZERO, 00664 $ X( PART_OFFSET+1 ), 1 ) 00665 * 00666 * 00667 * 00668 IF ( MYCOL .LT. NPCOL-1 ) THEN 00669 * 00670 * Do the multiplication of the triangle in the lower half 00671 * 00672 CALL CCOPY( DL_N_N, 00673 $ B( NUMROC_SIZE-DL_N_N+1 ), 00674 $ 1, WORK( MAX_BW*MAX_BW+1+BWL-DL_N_N ), 1 ) 00675 * 00676 CALL CTRMV( 'U', 'N', 'N', BWL, 00677 $ A( LLDA*( NUMROC_SIZE-BWL )+1+BWU+BWL ), LLDA-1, 00678 $ WORK( MAX_BW*MAX_BW+1 ), 1) 00679 * 00680 * Zero out extraneous elements caused by TRMV if any 00681 * 00682 IF( DL_N_M .GT. DL_N_N ) THEN 00683 DO 10 I = DL_N_M-DL_N_N, DL_N_M 00684 WORK( MAX_BW*MAX_BW+I ) = 0 00685 10 CONTINUE 00686 ENDIF 00687 * 00688 * Send the result to the neighbor 00689 * 00690 CALL CGESD2D( ICTXT, BWL, 1, 00691 $ WORK( MAX_BW*MAX_BW+1 ), BWL, MYROW, MYCOL+1 ) 00692 * 00693 ENDIF 00694 * 00695 IF ( MYCOL .GT. 0 ) THEN 00696 * 00697 DO 20 I=1, MAX_BW*( MAX_BW+2 ) 00698 WORK( I ) = CZERO 00699 20 CONTINUE 00700 * 00701 * Do the multiplication of the triangle in the upper half 00702 * 00703 * Copy vector to workspace 00704 * 00705 CALL CCOPY( DU_P_N, B( 1 ), 1, 00706 $ WORK( MAX_BW*MAX_BW+1 ), 1) 00707 * 00708 CALL CTRMV( 00709 $ 'L', 00710 $ 'N', 00711 $ 'N', BWU, 00712 $ A( 1 ), LLDA-1, 00713 $ WORK( MAX_BW*MAX_BW+1 ), 1 ) 00714 * 00715 * Zero out extraneous results from TRMV if any 00716 * 00717 IF( DU_P_N .GT. DU_P_M ) THEN 00718 DO 30 I=1, DU_P_N-DU_P_M 00719 WORK( MAX_BW*MAX_BW+I ) = 0 00720 30 CONTINUE 00721 ENDIF 00722 * 00723 * Send result back 00724 * 00725 CALL CGESD2D( ICTXT, BWU, 1, WORK(MAX_BW*MAX_BW+1 ), 00726 $ BWU, MYROW, MYCOL-1 ) 00727 * 00728 * Receive vector result from neighboring processor 00729 * 00730 CALL CGERV2D( ICTXT, BWL, 1, WORK( MAX_BW*MAX_BW+1 ), 00731 $ BWL, MYROW, MYCOL-1 ) 00732 * 00733 * Do addition of received vector 00734 * 00735 CALL CAXPY( BWL, CONE, 00736 $ WORK( MAX_BW*MAX_BW+1 ), 1, 00737 $ X( 1 ), 1 ) 00738 * 00739 ENDIF 00740 * 00741 * 00742 * 00743 IF( MYCOL .LT. NPCOL-1 ) THEN 00744 * 00745 * Receive returned result 00746 * 00747 CALL CGERV2D( ICTXT, BWU, 1, WORK( MAX_BW*MAX_BW+1 ), 00748 $ BWU, MYROW, MYCOL+1 ) 00749 * 00750 * Do addition of received vector 00751 * 00752 CALL CAXPY( BWU, CONE, 00753 $ WORK( MAX_BW*MAX_BW+1 ), 1, 00754 $ X( NUMROC_SIZE-BWU+1 ), 1) 00755 * 00756 ENDIF 00757 * 00758 * 00759 ENDIF 00760 * 00761 * End of LSAME if 00762 * 00763 ************************************** 00764 * 00765 IF ( LSAME( TRANS, 'C' ) ) THEN 00766 * 00767 * Sizes of the extra triangles communicated bewtween processors 00768 * 00769 IF( MYCOL .GT. 0 ) THEN 00770 * 00771 DL_P_M= MIN( BWU, 00772 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00773 DL_P_N= MIN( BWU, 00774 $ NUMROC( N, PART_SIZE, MYCOL-1, 0, NPCOL ) ) 00775 * 00776 DU_P_M= MIN( BWL, 00777 $ NUMROC( N, PART_SIZE, MYCOL-1, 0, NPCOL ) ) 00778 DU_P_N= MIN( BWL, 00779 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00780 ENDIF 00781 * 00782 IF( MYCOL .LT. NPCOL-1 ) THEN 00783 * 00784 DL_N_M= MIN( BWU, 00785 $ NUMROC( N, PART_SIZE, MYCOL+1, 0, NPCOL ) ) 00786 DL_N_N= MIN( BWU, 00787 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00788 * 00789 DU_N_M= MIN( BWL, 00790 $ NUMROC( N, PART_SIZE, MYCOL, 0, NPCOL ) ) 00791 DU_N_N= MIN( BWL, 00792 $ NUMROC( N, PART_SIZE, MYCOL+1, 0, NPCOL ) ) 00793 ENDIF 00794 * 00795 * 00796 IF( MYCOL .GT. 0 ) THEN 00797 * ...must send triangle in lower half of matrix to left 00798 * 00799 * Transpose triangle in preparation for sending 00800 * 00801 CALL CLATCPY( 'L', BWU, BWU, A( OFST+1 ), 00802 $ LLDA-1, WORK( 1 ), MAX_BW ) 00803 * 00804 * Send the triangle to neighboring processor to left 00805 * 00806 CALL CTRSD2D(ICTXT, 'U', 'N', 00807 $ BWU, BWU, 00808 $ WORK( 1 ), 00809 $ MAX_BW, MYROW, MYCOL-1 ) 00810 * 00811 ENDIF 00812 * 00813 IF( MYCOL .LT. NPCOL-1 ) THEN 00814 * ...must send triangle in upper half of matrix to right 00815 * 00816 * Transpose triangle in preparation for sending 00817 * 00818 CALL CLATCPY( 'U', BWL, BWL, 00819 $ A( LLDA*( NUMROC_SIZE-BWL )+1+BWU+BWL ), 00820 $ LLDA-1, WORK( 1 ), MAX_BW ) 00821 * 00822 * Send the triangle to neighboring processor to right 00823 * 00824 CALL CTRSD2D(ICTXT, 'L', 'N', 00825 $ BWL, BWL, 00826 $ WORK( 1 ), 00827 $ MAX_BW, MYROW, MYCOL+1 ) 00828 * 00829 ENDIF 00830 * 00831 * Use main partition in each processor to multiply locally 00832 * 00833 CALL CGBMV( TRANS, NUMROC_SIZE, NUMROC_SIZE, BWL, BWU, CONE, 00834 $ A( OFST+1 ), LLDA, B(PART_OFFSET+1), 1, CZERO, 00835 $ X( PART_OFFSET+1 ), 1 ) 00836 * 00837 * 00838 * 00839 IF ( MYCOL .LT. NPCOL-1 ) THEN 00840 * 00841 * Do the multiplication of the triangle in the lower half 00842 * 00843 CALL CCOPY( DL_N_N, 00844 $ B( NUMROC_SIZE-DL_N_N+1 ), 00845 $ 1, WORK( MAX_BW*MAX_BW+1+BWU-DL_N_N ), 1 ) 00846 * 00847 * Receive the triangle prior to multiplying by it. 00848 * 00849 CALL CTRRV2D(ICTXT, 'U', 'N', 00850 $ BWU, BWU, 00851 $ WORK( 1 ), MAX_BW, MYROW, MYCOL+1 ) 00852 * 00853 CALL CTRMV( 'U', 'N', 'N', BWU, 00854 $ WORK( 1 ), MAX_BW, 00855 $ WORK( MAX_BW*MAX_BW+1 ), 1) 00856 * 00857 * Zero out extraneous elements caused by TRMV if any 00858 * 00859 IF( DL_N_M .GT. DL_N_N ) THEN 00860 DO 40 I = DL_N_M-DL_N_N, DL_N_M 00861 WORK( MAX_BW*MAX_BW+I ) = 0 00862 40 CONTINUE 00863 ENDIF 00864 * 00865 * Send the result to the neighbor 00866 * 00867 CALL CGESD2D( ICTXT, BWU, 1, 00868 $ WORK( MAX_BW*MAX_BW+1 ), BWU, MYROW, MYCOL+1 ) 00869 * 00870 ENDIF 00871 * 00872 IF ( MYCOL .GT. 0 ) THEN 00873 * 00874 DO 50 I=1, MAX_BW*( MAX_BW+2 ) 00875 WORK( I ) = CZERO 00876 50 CONTINUE 00877 * 00878 * Do the multiplication of the triangle in the upper half 00879 * 00880 * Copy vector to workspace 00881 * 00882 CALL CCOPY( DU_P_N, B( 1 ), 1, 00883 $ WORK( MAX_BW*MAX_BW+1 ), 1) 00884 * 00885 * Receive the triangle prior to multiplying by it. 00886 * 00887 CALL CTRRV2D(ICTXT, 'L', 'N', 00888 $ BWL, BWL, 00889 $ WORK( 1 ), MAX_BW, MYROW, MYCOL-1 ) 00890 * 00891 CALL CTRMV( 00892 $ 'L', 00893 $ 'N', 00894 $ 'N', BWL, 00895 $ WORK( 1 ), MAX_BW, 00896 $ WORK( MAX_BW*MAX_BW+1 ), 1 ) 00897 * 00898 * Zero out extraneous results from TRMV if any 00899 * 00900 IF( DU_P_N .GT. DU_P_M ) THEN 00901 DO 60 I=1, DU_P_N-DU_P_M 00902 WORK( MAX_BW*MAX_BW+I ) = 0 00903 60 CONTINUE 00904 ENDIF 00905 * 00906 * Send result back 00907 * 00908 CALL CGESD2D( ICTXT, BWL, 1, WORK(MAX_BW*MAX_BW+1 ), 00909 $ BWL, MYROW, MYCOL-1 ) 00910 * 00911 * Receive vector result from neighboring processor 00912 * 00913 CALL CGERV2D( ICTXT, BWU, 1, WORK( MAX_BW*MAX_BW+1 ), 00914 $ BWU, MYROW, MYCOL-1 ) 00915 * 00916 * Do addition of received vector 00917 * 00918 CALL CAXPY( BWU, CONE, 00919 $ WORK( MAX_BW*MAX_BW+1 ), 1, 00920 $ X( 1 ), 1 ) 00921 * 00922 ENDIF 00923 * 00924 * 00925 * 00926 IF( MYCOL .LT. NPCOL-1 ) THEN 00927 * 00928 * Receive returned result 00929 * 00930 CALL CGERV2D( ICTXT, BWL, 1, WORK( MAX_BW*MAX_BW+1 ), 00931 $ BWL, MYROW, MYCOL+1 ) 00932 * 00933 * Do addition of received vector 00934 * 00935 CALL CAXPY( BWL, CONE, 00936 $ WORK( MAX_BW*MAX_BW+1 ), 1, 00937 $ X( NUMROC_SIZE-BWL+1 ), 1) 00938 * 00939 ENDIF 00940 * 00941 * 00942 ENDIF 00943 * 00944 * End of LSAME if 00945 * 00946 * 00947 * Free BLACS space used to hold standard-form grid. 00948 * 00949 IF( ICTXT_SAVE .NE. ICTXT_NEW ) THEN 00950 CALL BLACS_GRIDEXIT( ICTXT_NEW ) 00951 ENDIF 00952 * 00953 1234 CONTINUE 00954 * 00955 * Restore saved input parameters 00956 * 00957 ICTXT = ICTXT_SAVE 00958 NP = NP_SAVE 00959 * 00960 * 00961 RETURN 00962 * 00963 * End of PCBhBMV1 00964 * 00965 END