[mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash

Antonio Molins amolins at MIT.EDU
Wed Oct 14 17:54:59 EDT 2009


After some more testing, found out same problem again. Now using MKL  
10.2.2.025, and the version of MVAPICH 1.1 that you sent me (that is  
mvapich-1.1-2009-10-09 nightly build). If you use the same code I  
provided,

% mpirun -np 36 ~/shadowfax/EMsmooth_work/bin/crash 10000 1
Exit code -5 signaled from compute-0-71
Killing remote processes...MPI process terminated unexpectedly
PMGR_COLLECTIVE ERROR: unexpected value: received 1, expecting 7 @  
file pmgr_collective_mpispawn.c:144
MPI process terminated unexpectedly
DONE

Any idea why this is happening?

Best,
A

On Oct 10, 2009, at 8:32 PM, Dhabaleswar Panda panda at cse.ohio- 
state.edu wrote:

> Good to know that the 1.1 branch version fixes the problem. Let us  
> know
> the outcome once you have done more extensive experiments.
>
> Thanks,
>
> DK
>
> On Sat, 10 Oct 2009, Antonio Molins wrote:
>
>> Thanks so much, DK!
>>
>> I was actually using MPICH 1.1, the last version available in the
>> webpage, not the nightly build. I have been trying the one you sent
>> me, and seems to fix the problem. Will tell you later once I try it
>> more thoroughly.
>>
>> Best,
>> A
>>
>> On Oct 9, 2009, at 9:16 PM, Dhabaleswar Panda wrote:
>>
>>> Thanks for the report. Your note indicates that you are using
>>> MVAPICH 1.0
>>> version.  This is a very old version (released during February '08).
>>> Can
>>> you check whether this happens with the latest MVAPICH 1.1 version.
>>> Please
>>> use the branch version obtained from the following URL. This has all
>>> the
>>> bug-fixes since 1.1 release.
>>>
>>> http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/
>>>
>>> In the Changelog
>>> (http://mvapich.cse.ohio-state.edu/download/mvapich/ 
>>> changes.shtml), I
>>> see that a fix related to segmentation fault while using BLACS was
>>> done
>>> for 1.0.1 version.
>>>
>>> Let us know whether the problem persists with this latest 1.1 branch
>>> version and we will take a look at this in-depth.
>>>
>>> Thanks,
>>>
>>> DK
>>>
>>> On Fri, 9 Oct 2009, Antonio Molins wrote:
>>>
>>>> Hi everybody,
>>>>
>>>> I am new to the list, and have been programming in MPI for a year
>>>> or so. I am running into what I think is a bug of either MVAPICH or
>>>> MKL that is totally driving me nuts :(
>>>>
>>>> The following code has proved good to generate a crash using MKL
>>>> 10.2 update 2 (sequential version and threaded), last revision of
>>>> MVAPICH, in two different clusters. Can anybody tell me what the
>>>> problem is here? It does not crash always, but it does crash when
>>>> the right number of MPI processes and matrix sizes are selected.
>>>> Sometimes it wouldn't crash but sloooooow down several orders of
>>>> magnitude, and note that the actual data being processed is exactly
>>>> the same.
>>>>
>>>> A
>>>>
>>>> /*
>>>> *  crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on
>>>> linux 64-bit
>>>> * both linked with the serial or threaded libraries
>>>> * doing mpirun -np 36 crash 5000 10
>>>> */
>>>>
>>>>
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <string.h>
>>>> #include <math.h>
>>>> #include "mpi.h"
>>>> #include "mkl_scalapack.h"
>>>>
>>>>
>>>>
>>>> extern "C" {
>>>> /* BLACS C interface */
>>>> void Cblacs_get( int context, int request, int* value);
>>>> int Cblacs_gridinit( int* context, char * order, int np_row, int
>>>> np_col);
>>>> void Cblacs_gridinfo( int context, int*  np_row, int* np_col, int*
>>>> my_row, int*  my_col);
>>>> int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int
>>>> *nprocs);
>>>> /* PBLAS */
>>>> void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int *
>>>> K, double * ALPHA,
>>>> double * A, int * IA, int * JA, int * DESCA, double * B, int * IB,
>>>> int * JB, int * DESCB,
>>>> double * BETA, double * C, int * IC, int * JC, int * DESCC );
>>>> }
>>>>
>>>>
>>>>
>>>> #define BLOCK_SIZE 65
>>>>
>>>>
>>>>
>>>> int main( int argc, char* argv[] )
>>>> {
>>>> int iam, nprocs;
>>>> MPI_Init(&argc,&argv);       /* starts MPI */
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &iam);
>>>> MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
>>>>
>>>>
>>>> // get done with the ones that are not part of the grid
>>>> int blacs_pgrid_size = floor(sqrt(nprocs));
>>>> if (iam>=blacs_pgrid_size*blacs_pgrid_size) {
>>>> printf("Bye bye world from process %d of %d. BLACS had no place for
>>>> me...\n",iam,nprocs);
>>>> MPI_Finalize();
>>>> }
>>>>
>>>>
>>>> // start BLACS with square processor grid
>>>> if(iam==0)
>>>> printf("starting BLACS...");
>>>> int ictxt,nprow,npcol,myrow,mycol;
>>>> Cblacs_get( -1, 0, &ictxt );
>>>> Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size );
>>>> Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol );
>>>> if(iam==0)
>>>> printf("done.\n");
>>>>
>>>>
>>>> double timing;
>>>> int m,n,k,lm,ln,nbm,nbn,rounds;
>>>> int myzero=0,myone=1;
>>>> sscanf(argv[1],"%d",&m);
>>>> n=m;
>>>> k=m;
>>>> sscanf(argv[2],"%d",&rounds);
>>>>
>>>>
>>>>
>>>> nbm = BLOCK_SIZE;
>>>> nbn = BLOCK_SIZE;
>>>> lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow);
>>>> ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol);
>>>>
>>>>
>>>> int info;
>>>> int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of
>>>> extra space just in case
>>>> char ta = 'N',tb = 'T';
>>>> double alpha = 1.0, beta = 0.0;
>>>>
>>>>
>>>>
>>>> double* test1data = new double[lm*ln];
>>>> double* test2data = new double[lm*ln];
>>>> double* test3data = new double[lm*ln];
>>>>
>>>>
>>>> for(int i=0;i<lm*ln;i++)
>>>> test1data[i]=(double)(rand()%100)/10000.0;
>>>>
>>>>
>>>> int *test1desc  = new int[9];
>>>> int *test2desc  = new int[9];
>>>> int *test3desc  = new int[9];
>>>>
>>>>
>>>> test1desc[0] = 1; // descriptor type
>>>> test1desc[1] = ictxt; // blacs context
>>>> test1desc[2] = m; // global number of rows
>>>> test1desc[3] = n; // global number of columns
>>>> test1desc[4] = nbm; // row block size
>>>> test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW
>>>> BLOCK SIZE)
>>>> test1desc[6] = 0; // initial process row(DEFINED 0)
>>>> test1desc[7] = 0; // initial process column (DEFINED 0)
>>>> test1desc[8] = lm; // leading dimension of local array
>>>>
>>>>
>>>> memcpy(test2desc,test1desc,9*sizeof(int));
>>>> memcpy(test3desc,test1desc,9*sizeof(int));
>>>>
>>>>
>>>>
>>>> for(int iter=0;iter<rounds;iter++)
>>>> {
>>>> if(iam==0)
>>>> printf("iter %i - ",iter);
>>>> //test2 = test1
>>>> memcpy(test2data,test1data,lm*ln*sizeof(double));
>>>> //test3 = test1*test2
>>>> timing=MPI_Wtime();
>>>> pdgemm_(&ta,&tb,&m,&n,&k,
>>>> &alpha,
>>>> test1data,&myone,&myone,test1desc,
>>>> test2data,&myone,&myone, test2desc,
>>>> &beta,
>>>> test3data,&myone,&myone, test3desc);
>>>> if(iam==0)
>>>> printf(" PDGEMM = %f |",MPI_Wtime()-timing);
>>>> //test3 = LU(test3)
>>>> timing=MPI_Wtime();
>>>> pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv,  
>>>> &info);
>>>> if(iam==0)
>>>> printf(" PDGETRF = %f.\n",MPI_Wtime()-timing);
>>>> }
>>>> delete[] ipiv;
>>>> delete[] test1data, test2data, test3data;
>>>> delete[] test1desc, test2desc, test3desc;
>>>>
>>>>
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>> --------------------------------------------------------------------------------
>>>> Antonio Molins, PhD Candidate
>>>> Medical Engineering and Medical Physics
>>>> Harvard - MIT Division of Health Sciences and Technology
>>>> --
>>>> "Y así del poco dormir y del mucho leer,
>>>> se le secó el cerebro de manera que vino
>>>> a perder el juicio".
>>>>                                      Miguel de Cervantes
>>>> --------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>> --------------------------------------------------------------------------------
>> Antonio Molins, PhD Candidate
>> Medical Engineering and Medical Physics
>> Harvard - MIT Division of Health Sciences and Technology
>> --
>> "Y así del poco dormir y del mucho leer,
>> se le secó el cerebro de manera que vino
>> a perder el juicio".
>>                                        Miguel de Cervantes
>> --------------------------------------------------------------------------------
>>
>>
>>
>>
>

--------------------------------------------------------------------------------
Antonio Molins, PhD Candidate
Medical Engineering and Medical Physics
Harvard - MIT Division of Health Sciences and Technology
--
"Far better an approximate answer to the right question,
which is often vague,
than an exact answer to the wrong question,
which can always be made precise. "

						John W. Tukey, 1962
--------------------------------------------------------------------------------




More information about the mvapich-discuss mailing list