******************************************************************************* ******************************************************************************* * STATUS SECTION * ******************************************************************************* ******************************************************************************* This file last modified on 03/05/2001. Version 1.1 release of the BLACS on 5/01/97. The tester for this release of the is available in the gzipped tarfile blacstester.tgz This release of the BLACS has 4 versions: mpiblacs.tgz : BLACS using MPI mplblacs.tgz : BLACS using IBM SP's MPL library nxblacs.tgz : BLACS using Intel's NX library pvmblacs.tgz : BLACS using PVM It appears that the CMMDBLACS have become obsolete. If you need this BLACS version, send mail to blacs@cs.utk.edu. BLACS errors found: MPI BLACS: error #8 and #9, #13, #16 BLACS tester errors found: none. There have been 3 patch(es) for this release. Patches are cumulative, so you need only get the newest one, mpiblacs-patch03.tgz. All g77 users should examine MPI ERROR #14 for instructions on making the tester compile correctly with g77. LAM-MPI users should examine MPI ERROR #12. This rest of this file is divided into 4 sections, one for each BLACS type. ******************************************************************************* ******************************************************************************* * MPIBLACS SECTION * ******************************************************************************* ******************************************************************************* Please note the MPICH version number in suspected MPICH bugs. These may have been fixed by subsequent releases. Be sure you have applied the mpi patch file mentioned above! =============================== MPI ERROR #1 ================================== WHERE: MPICH1.0.13 with patch WHAT: Probable MPICH error STATUS: fixed by MPICH1.1 If you apply the big patch to MPICH1.0.13, and run the BLACS tester, it will fail in the double precision BS/BR tests of the standard input files. It is not actually error in the double precision BS/BR tests: the same failure happens if you run the integer tests 3 times, for instance. This appears to be some sort of resource exhaustion or memory overwrite associated with the patch. The problem could possibly be in the BLACS, but they work with unpatched MPICH1.0.13, all previous MPICH releases, and IBM's MPI. For right now, the best solution is probably to not apply the MPICH patch. For LINUX users, you will probably need to use MPICH1.0.12 (MPICH1.0.13 will not compile on LINUX without the patch). The error message caused by this error is: >2 - MPI_p2_11509: p4_error: : 3 >rm_l_2_11511: p4_error: interrupt SIGINT: 2 >TYPE_COMMIT : Invalid datatype argument >[2] Aborting program ! >[2] Aborting program! =============================== MPI ERROR #2 ================================== WHERE: ALL MPICH WHAT: MPICH ERROR STATUS: NOT FIXED In MPICH1.0.13 and MPICH1.1, there appears to be an error in MPICH's MPI_Abort. It does not kill any other processes at all, but seems to behave pretty much like calling a local exit(). This will cause the BLACS tester to hang on the BLACS_ABORT test in the auxiliary test. Here is a straight MPI code demonstrating the error: #include #include "mpi.h" main(int narg, char **args) { int i, Iam, Np; MPI_Init(&narg, &args); MPI_Comm_size(MPI_COMM_WORLD, &Np); MPI_Comm_rank(MPI_COMM_WORLD, &Iam); if (Iam == Np-1) MPI_Abort(MPI_COMM_WORLD, -2); while(1); MPI_Finalize(); } =============================== MPI ERROR #3 ================================== WHERE: SGI's MPI v2.0 WHAT: SGI MPI ERROR STATUS: FIXED v3.0 SGI's MPI v2.0 cannot handle repeated usage and freeing of data types. This error has been fixed in SGI MPI v3.0. Included below is a small straight-MPI program that demonstrates the error. This program fails in 984th k-loop iteration, with the following message from MPI (you get this message if you run the BLACS tester as well): >Assertion failed: i < 1024, file type_util.c, line 69, pid 4965 >loop 983 >Assertion failed: i < 1024, file type_util.c, line 69, pid 4966 We have successfully used SGI MPI v3.0 and MPICH on this platform. #include #include "mpi.h" main(int narg, char **args) { int i, Iam, Np, k, j; MPI_Datatype Dtype; MPI_Status stat; MPI_Init(&narg, &args); MPI_Comm_size(MPI_COMM_WORLD, &Np); MPI_Comm_rank(MPI_COMM_WORLD, &Iam); fprintf(stdout, "%d: starting test\n", Iam); for (k=0; k != 10000; k++) { i = j = 1; MPI_Type_vector(1, 1, 1, MPI_INT, &Dtype); MPI_Type_commit(&Dtype); if (Iam == 0) { MPI_Send(&Iam, 1, Dtype, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&i, 1, Dtype, 0, 0, MPI_COMM_WORLD, &stat); } MPI_Type_free(&Dtype); fprintf(stdout, "loop %d\n",k); } fprintf(stdout, "MPI sanity test passed\n"); MPI_Finalize(); } =============================== MPI ERROR #4 ================================== WHERE: RS6000 WHAT: COMPILER PROBLEM Must use gcc, not xlc, to compile MPICH1.0.10 on the rs6000. To configure MPICH, need to add -cc=gcc to configure line (thus my configure line was: 'configure -device=ch_p4 -arch=rs6000 -cc=gcc'). =============================== MPI ERROR #5 ================================== WHERE: SUN4 WHAT: COMPILER MISMATCH STATUS: FIXED BY COMPILER FLAG We use gcc to compile the BLACS on the SUN4, and it seems to require that all double precision data be aligned on a 8-byte boundary. SUN's f77 defaults to aligning local double precision scalars to 4-byte boundaries, potentially causing bus errors. Use the f77 flag -f to compile all fortran code to force 8-byte alignment. Therefore, add -f to the NOPT macro in SLmake.inc and to the F77NO_OPTFLAGS in Bmake.inc. =============================== MPI ERROR #6 ================================== WHERE: T3E WHAT: MPI ERROR STATUS: FIXED in 1.2.0.0.6beta CRAY MPI (MPT 1.1.0.2) has an error in handling 0-byte data types. Here is some legal MPI code that fails on the T3E: #include #include main(int nargs, char **args) { MPI_Datatype Dt; int ierr; MPI_Init(&nargs, &args); printf( "If this routine does not complete, you should set SYSERRORS = -DZeroByteTypeBu g.\n"); ierr = MPI_Type_vector(0, 1, 1, MPI_INT, &Dt); if (ierr != MPI_SUCCESS) printf("MPI_Type_vector returned %d, set SYSERRORS = -DZeroByteTypeBug\n", ierr); else MPI_Type_commit(&Dt); if (ierr == MPI_SUCCESS) printf("Leave SYSERRORS blank for this system.\n"); MPI_Finalize(); } =============================== MPI ERROR #7 ================================== WHERE: T3E WHAT: MPI ERROR STATUS: FIXED in 1.2.0.0.6beta The CRAY MPI (MPT 1.1.0.2) has a strange error where it can't correctly handle some data types if the communicator used to do the communication is not MPI_COMM_WORLD. Here is a small routine showing the error: #include #include main(int nargs, char **args) { MPI_Datatype Dt; MPI_Comm CMPI_COMM_WORLD; int Iam, Np, i, k, ierr; int ibuff[4]; MPI_Init(&nargs, &args); MPI_Comm_rank(MPI_COMM_WORLD, &Iam); MPI_Comm_size(MPI_COMM_WORLD, &Np); MPI_Comm_dup(MPI_COMM_WORLD, &CMPI_COMM_WORLD); if (Iam) for (i=0; i != 4; i++) ibuff[i] = -9999; else for (i=0; i != 4; i++) ibuff[i] = i+1; ierr = MPI_Type_vector(2, 1, 2, MPI_INT, &Dt); if (ierr != MPI_SUCCESS) printf("MPI_Type_vector returned %d\n",ierr); else MPI_Type_commit(&Dt); MPI_Bcast(ibuff, 1, Dt, 0, CMPI_COMM_WORLD); MPI_Type_free(&Dt); for (k=0; k != Np; k++) { if (Iam == k) { fprintf(stdout, "%d: ibuff =", Iam); for (i=0; i != 4; i++) fprintf(stdout, " %d ", ibuff[i]); fprintf(stdout, "\n"); } MPI_Barrier(CMPI_COMM_WORLD); } MPI_Finalize(); } If CMPI_COMM_WORLD is set to MPI_COMM_WORLD, this routine produces the correct answer: >0: ibuff = 1 2 3 4 >1: ibuff = 1 -9999 3 -9999 Otherwise, you get: >_T3EMPI_coll_send asked to deal with unknown datatype. >0: ibuff = 1 2 3 4 >1: ibuff = 0 -9999 0 -9999 =============================== MPI ERROR #8 ================================== WHERE: SGI Origin 2000 WHAT: MPIBLACS ERROR STATUS: FIXED by patch The BLACS were not freeing groups created by calls to MPI_COMM_GROUP, causing some systems to run out of groups. =============================== MPI ERROR #9 ================================== WHERE: T3E WHAT: MPI BLACS ERROR STATUS: FIXED by patch There were a couple of problems in the BLACS handling of CRAY's non-standard F77 data types. Also, you can't call F77's mpi_init from C on this platform. These problems are fixed by the patch. =============================== MPI ERROR #10 ================================= WHERE: T3E WHAT: MPI ERROR STATUS: workaround in patch 01, 02 mpt.1.2.0.0.6beta couldn't handle 0-length segments used with MPI_Type_indexed. To work around this problem, apply the patch and throw the T3ETrError flag in your Bmake.inc (as shown in the example Bmake.T3E supplied with the patch). =============================== MPI ERROR #11 ================================= WHERE: T3E WHAT: MPI ERROR STATUS: workaround in patch 01, 02 mpt.1.2.0.0.6beta couldn't handle certain reductions where you mix types with a MPI data type. To work around this problem, apply the patch and throw the T3EReductErr flag in your Bmake.inc (as shown in the example Bmake.T3E supplied with the patch). =============================== MPI ERROR #12 ================================= WHERE: ALL platforms WHAT: new functionality STATUS: in patch 01, 02 MPI-2 provides a standard way to translate communicators between C and Fortran77. If your MPI implements these routines, set TRANSCOMM to -DUseMpi2. We have reports that the newer versions of LAM-MPI use this setting. =============================== MPI ERROR #13 ================================= WHERE: ALL platforms WHAT: BLACS ERROR STATUS: in patch 02 Even after the first patch, there were still errors in freeing groups. In BLACS/SRC/MPI/INTERNAL/BI_TransUserComm.c, the group ugrp was not freed. In BLACS/SRC/MPI/INTERNAL/BI_MPI_F77_to_c_trans_comm.c, to groups were freed as communicators, instead of groups. =============================== MPI ERROR #14 ================================= WHERE: LINUX/g77 WHAT: Compiler change STATUS: fixed by flags The BLACS tester uses a large array in order to simulate dynamic memory. It passes this array to routines that accept it as an array of differing data types. G77 has upgraded this, in some cases, from warning to error. In order to tell g77 to allow this behavior, change BLACS/TESTING/Makefile line 39 from: $(F77) $(F77NO_OPTFLAGS) -c $*.f to: $(F77) $(F77NO_OPTFLAGS) -fno-globals -fno-f90 -fugly-complex -w -c $*.f =============================== MPI ERROR #15 ================================= WHERE: ???? WHAT: Compiler error/macro problem STATUS: Not fixed There is a undiagnosed problem that causes some users' dwalltime00 routine to return bad values. It appears likely that there is a problem with macro name overruns, but errors in cpp or the code have not been ruled out. If you get bad return values from dwalltime00, overwrite BLACS/SRC/MPI/dwalltime00_.c with: #include "Bdef.h" #if (INTFACE == C_CALL) double Cdwalltime00(void) #else F_DOUBLE_FUNC dwalltime00_(void) #endif { return(MPI_Wtime()); } =============================== MPI ERROR #16 ================================= WHERE: mpich1.2.* WHAT: BLACS error STATUS: Fixed by patch 03 If you get missing f77 argc and argv symbols, when using the BLACS C init routine s, you are seeing this error. ******************************************************************************* ******************************************************************************* * MPLBLACS SECTION * ******************************************************************************* ******************************************************************************* =============================== MPL ERROR #1 ================================== WHERE: SP2 WHAT: ERROR IN MPL STATUS: NOT FIXED It appears that MP_BRECV requires that messages be received in the order they were sent, even if all messages have been successfully sent. IBM has reported that this is not an error, but rather perhaps an oversight in documentation. MPL does not support receiving messages any any order except that which they are sent. Here is a small routine showing the problem: program tst integer k, iam, Np, ictxt, i, j call mpc_environ(Np, Iam); k = Iam + 100 print*,'start' if (iam.eq.1) then call mp_send(Iam, 4, 0, 2, i) call mp_send(k, 4, 0, 3, j) print*,mp_status(i) print*,mp_status(j) else if (iam .eq. 0) then call mp_brecv(k, 4, 1, 3, j) call mp_brecv(k, 4, 1, 2, j) end if print*,'done' stop end When this is run, the output is: xtst2 -procs 2 start start 4 4 done So both sends complete, but the receives still hang. ******************************************************************************* ******************************************************************************* * NXBLACS SECTION * ******************************************************************************* ******************************************************************************* =============================== NX ERROR #1 ================================== WHERE: Some NX machines WHAT: ERROR IN NXBLACS STATUS: NOT FIXED The NXBLACS use a copy optimization which is, according to strict IEEE arithmetic rules, illegal. More precisely, doubles are sometimes used to copy floats or integers. At implementation time, the author tested all available NX platforms, found no errors, so put the optimization in even thought it was known to be illegal. Unfortunately, on more recent platforms (i.e., ASCI red with newest MPI) this causes problems. So, if you get mysterious errors in the tester, this may what's happening. To prevent the BLACS from applying this illegal optimization, delete the following lines in BLACS/SRC/NX/INTERNAL/mvcopy4.c: long iaddr; iaddr = (long) A; /* * If address is on a 8 byte boundary, and lda and m are evenly divisible by 2, * can use double sized pointers for faster packing */ if ( !(iaddr % 8) && !(lda % 2) && !(m % 2) ) mvcopy8(m/2, n, (double *) A, lda/2, (double *) buff); /* * Otherwise, must use 4 byte packing */ else You also need to delete basically the same lines from BLACS/SRC/NX/INTERNAL/vmcopy4.c: long iaddr; iaddr = (long) A; /* * If address is on a 8 byte boundary, and lda and m are evenly divisible by 2, * can use double sized pointers for faster packing */ if ( !(iaddr % 8) && !(lda % 2) && !(m % 2) ) vmcopy8(m/2, n, (double *) A, lda/2, (double *) buff); /* * Otherwise, must use 4 byte packing */ else ******************************************************************************* ******************************************************************************* * PVMBLACS SECTION * ******************************************************************************* ******************************************************************************* =============================== PVM ERROR #1 ================================== WHERE: SUNMP PVM WHAT: PVM3.3.11 ERROR STATUS: NOT FIXED SUNMP PVM is broken. Your best bet is to rig your PVM_ARCH so that it thinks it is a SUN4SOL2, and use that version of PVM. =============================== PVM ERROR #2 ================================== WHERE: SGI5/new gcc WHAT: COMPILER ERROR STATUS: NOT FIXED This appears to be a compiler problem with including files within the brackets of a routine. Must include system files before starting scope of the routine. Therefore, in BLACS/SRC/PVM/blacs_setup_.c, move line: #include to second line of file (after #include "Bdef.h"). =============================== PVM ERROR #3 ================================== WHERE: SGI5 WHAT: COMPILER ERROR STATUS: NOT FIXED The compiler does not accept the -o (renaming option) if optimization is turned on. This breaks the compilation of the C interface. Bmake.PVM-SGI5 defaults to using gcc. If you can't use gcc, you may be able to do a workaround like the following in BLACS/SRC/PVM/Makefile: Line 166 of original Makefile: .SUFFIXES: .o .C .c.C: $(CC) -c $(CCFLAGS) -o C$*.o $(BLACSDEFS) -DCallFromC $< mv C$*.o $*.C SGI error workaround: .SUFFIXES: .o .C .c.C: ln -s $*.c C$*.c $(CC) -c $(CCFLAGS) $(BLACSDEFS) -DCallFromC C$*.c mv C$*.o $*.C rm -f C$*.c --------------4570007A3B68A7D31D3D4B86--