[mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Oct 15 18:00:21 EDT 2009


On Wed, Oct 14, 2009 at 05:54:59PM -0400, Antonio Molins wrote:
> After some more testing, found out same problem again. Now using MKL  
> 10.2.2.025, and the version of MVAPICH 1.1 that you sent me (that is  
> mvapich-1.1-2009-10-09 nightly build). If you use the same code I  
> provided,
> 
> % mpirun -np 36 ~/shadowfax/EMsmooth_work/bin/crash 10000 1
> Exit code -5 signaled from compute-0-71
> Killing remote processes...MPI process terminated unexpectedly
> PMGR_COLLECTIVE ERROR: unexpected value: received 1, expecting 7 @  
> file pmgr_collective_mpispawn.c:144
> MPI process terminated unexpectedly
> DONE
> 
> Any idea why this is happening?

Not right now.  Is this segfaulting?  See if you can gather a backtrace
(you can send it to me offline).

It may also be worthwile to recompile with the option --enable-debug
added to the configure line of the make.mvapich.gen2 script as well as
making sure that debug symbols are available in your application and
other libraries as well.  Hopefully this will give us a good starting
point.

> 
> Best,
> A
> 
> On Oct 10, 2009, at 8:32 PM, Dhabaleswar Panda panda at cse.ohio- 
> state.edu wrote:
> 
> > Good to know that the 1.1 branch version fixes the problem. Let us  
> > know
> > the outcome once you have done more extensive experiments.
> >
> > Thanks,
> >
> > DK
> >
> > On Sat, 10 Oct 2009, Antonio Molins wrote:
> >
> >> Thanks so much, DK!
> >>
> >> I was actually using MPICH 1.1, the last version available in the
> >> webpage, not the nightly build. I have been trying the one you sent
> >> me, and seems to fix the problem. Will tell you later once I try it
> >> more thoroughly.
> >>
> >> Best,
> >> A
> >>
> >> On Oct 9, 2009, at 9:16 PM, Dhabaleswar Panda wrote:
> >>
> >>> Thanks for the report. Your note indicates that you are using
> >>> MVAPICH 1.0
> >>> version.  This is a very old version (released during February '08).
> >>> Can
> >>> you check whether this happens with the latest MVAPICH 1.1 version.
> >>> Please
> >>> use the branch version obtained from the following URL. This has all
> >>> the
> >>> bug-fixes since 1.1 release.
> >>>
> >>> http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/
> >>>
> >>> In the Changelog
> >>> (http://mvapich.cse.ohio-state.edu/download/mvapich/ 
> >>> changes.shtml), I
> >>> see that a fix related to segmentation fault while using BLACS was
> >>> done
> >>> for 1.0.1 version.
> >>>
> >>> Let us know whether the problem persists with this latest 1.1 branch
> >>> version and we will take a look at this in-depth.
> >>>
> >>> Thanks,
> >>>
> >>> DK
> >>>
> >>> On Fri, 9 Oct 2009, Antonio Molins wrote:
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> I am new to the list, and have been programming in MPI for a year
> >>>> or so. I am running into what I think is a bug of either MVAPICH or
> >>>> MKL that is totally driving me nuts :(
> >>>>
> >>>> The following code has proved good to generate a crash using MKL
> >>>> 10.2 update 2 (sequential version and threaded), last revision of
> >>>> MVAPICH, in two different clusters. Can anybody tell me what the
> >>>> problem is here? It does not crash always, but it does crash when
> >>>> the right number of MPI processes and matrix sizes are selected.
> >>>> Sometimes it wouldn't crash but sloooooow down several orders of
> >>>> magnitude, and note that the actual data being processed is exactly
> >>>> the same.
> >>>>
> >>>> A
> >>>>
> >>>> /*
> >>>> *  crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on
> >>>> linux 64-bit
> >>>> * both linked with the serial or threaded libraries
> >>>> * doing mpirun -np 36 crash 5000 10
> >>>> */
> >>>>
> >>>>
> >>>>
> >>>> #include <stdio.h>
> >>>> #include <stdlib.h>
> >>>> #include <string.h>
> >>>> #include <math.h>
> >>>> #include "mpi.h"
> >>>> #include "mkl_scalapack.h"
> >>>>
> >>>>
> >>>>
> >>>> extern "C" {
> >>>> /* BLACS C interface */
> >>>> void Cblacs_get( int context, int request, int* value);
> >>>> int Cblacs_gridinit( int* context, char * order, int np_row, int
> >>>> np_col);
> >>>> void Cblacs_gridinfo( int context, int*  np_row, int* np_col, int*
> >>>> my_row, int*  my_col);
> >>>> int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int
> >>>> *nprocs);
> >>>> /* PBLAS */
> >>>> void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int *
> >>>> K, double * ALPHA,
> >>>> double * A, int * IA, int * JA, int * DESCA, double * B, int * IB,
> >>>> int * JB, int * DESCB,
> >>>> double * BETA, double * C, int * IC, int * JC, int * DESCC );
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>> #define BLOCK_SIZE 65
> >>>>
> >>>>
> >>>>
> >>>> int main( int argc, char* argv[] )
> >>>> {
> >>>> int iam, nprocs;
> >>>> MPI_Init(&argc,&argv);       /* starts MPI */
> >>>> MPI_Comm_rank(MPI_COMM_WORLD, &iam);
> >>>> MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
> >>>>
> >>>>
> >>>> // get done with the ones that are not part of the grid
> >>>> int blacs_pgrid_size = floor(sqrt(nprocs));
> >>>> if (iam>=blacs_pgrid_size*blacs_pgrid_size) {
> >>>> printf("Bye bye world from process %d of %d. BLACS had no place for
> >>>> me...\n",iam,nprocs);
> >>>> MPI_Finalize();
> >>>> }
> >>>>
> >>>>
> >>>> // start BLACS with square processor grid
> >>>> if(iam==0)
> >>>> printf("starting BLACS...");
> >>>> int ictxt,nprow,npcol,myrow,mycol;
> >>>> Cblacs_get( -1, 0, &ictxt );
> >>>> Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size );
> >>>> Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol );
> >>>> if(iam==0)
> >>>> printf("done.\n");
> >>>>
> >>>>
> >>>> double timing;
> >>>> int m,n,k,lm,ln,nbm,nbn,rounds;
> >>>> int myzero=0,myone=1;
> >>>> sscanf(argv[1],"%d",&m);
> >>>> n=m;
> >>>> k=m;
> >>>> sscanf(argv[2],"%d",&rounds);
> >>>>
> >>>>
> >>>>
> >>>> nbm = BLOCK_SIZE;
> >>>> nbn = BLOCK_SIZE;
> >>>> lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow);
> >>>> ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol);
> >>>>
> >>>>
> >>>> int info;
> >>>> int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of
> >>>> extra space just in case
> >>>> char ta = 'N',tb = 'T';
> >>>> double alpha = 1.0, beta = 0.0;
> >>>>
> >>>>
> >>>>
> >>>> double* test1data = new double[lm*ln];
> >>>> double* test2data = new double[lm*ln];
> >>>> double* test3data = new double[lm*ln];
> >>>>
> >>>>
> >>>> for(int i=0;i<lm*ln;i++)
> >>>> test1data[i]=(double)(rand()%100)/10000.0;
> >>>>
> >>>>
> >>>> int *test1desc  = new int[9];
> >>>> int *test2desc  = new int[9];
> >>>> int *test3desc  = new int[9];
> >>>>
> >>>>
> >>>> test1desc[0] = 1; // descriptor type
> >>>> test1desc[1] = ictxt; // blacs context
> >>>> test1desc[2] = m; // global number of rows
> >>>> test1desc[3] = n; // global number of columns
> >>>> test1desc[4] = nbm; // row block size
> >>>> test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW
> >>>> BLOCK SIZE)
> >>>> test1desc[6] = 0; // initial process row(DEFINED 0)
> >>>> test1desc[7] = 0; // initial process column (DEFINED 0)
> >>>> test1desc[8] = lm; // leading dimension of local array
> >>>>
> >>>>
> >>>> memcpy(test2desc,test1desc,9*sizeof(int));
> >>>> memcpy(test3desc,test1desc,9*sizeof(int));
> >>>>
> >>>>
> >>>>
> >>>> for(int iter=0;iter<rounds;iter++)
> >>>> {
> >>>> if(iam==0)
> >>>> printf("iter %i - ",iter);
> >>>> //test2 = test1
> >>>> memcpy(test2data,test1data,lm*ln*sizeof(double));
> >>>> //test3 = test1*test2
> >>>> timing=MPI_Wtime();
> >>>> pdgemm_(&ta,&tb,&m,&n,&k,
> >>>> &alpha,
> >>>> test1data,&myone,&myone,test1desc,
> >>>> test2data,&myone,&myone, test2desc,
> >>>> &beta,
> >>>> test3data,&myone,&myone, test3desc);
> >>>> if(iam==0)
> >>>> printf(" PDGEMM = %f |",MPI_Wtime()-timing);
> >>>> //test3 = LU(test3)
> >>>> timing=MPI_Wtime();
> >>>> pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv,  
> >>>> &info);
> >>>> if(iam==0)
> >>>> printf(" PDGETRF = %f.\n",MPI_Wtime()-timing);
> >>>> }
> >>>> delete[] ipiv;
> >>>> delete[] test1data, test2data, test3data;
> >>>> delete[] test1desc, test2desc, test3desc;
> >>>>
> >>>>
> >>>> MPI_Finalize();
> >>>> return 0;
> >>>> }
> >>>>
> >>>> --------------------------------------------------------------------------------
> >>>> Antonio Molins, PhD Candidate
> >>>> Medical Engineering and Medical Physics
> >>>> Harvard - MIT Division of Health Sciences and Technology
> >>>> --
> >>>> "Y así del poco dormir y del mucho leer,
> >>>> se le secó el cerebro de manera que vino
> >>>> a perder el juicio".
> >>>>                                      Miguel de Cervantes
> >>>> --------------------------------------------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >> --------------------------------------------------------------------------------
> >> Antonio Molins, PhD Candidate
> >> Medical Engineering and Medical Physics
> >> Harvard - MIT Division of Health Sciences and Technology
> >> --
> >> "Y así del poco dormir y del mucho leer,
> >> se le secó el cerebro de manera que vino
> >> a perder el juicio".
> >>                                        Miguel de Cervantes
> >> --------------------------------------------------------------------------------
> >>
> >>
> >>
> >>
> >
> 
> --------------------------------------------------------------------------------
> Antonio Molins, PhD Candidate
> Medical Engineering and Medical Physics
> Harvard - MIT Division of Health Sciences and Technology
> --
> "Far better an approximate answer to the right question,
> which is often vague,
> than an exact answer to the wrong question,
> which can always be made precise. "
> 
> 						John W. Tukey, 1962
> --------------------------------------------------------------------------------
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091015/84b0c7c1/attachment-0001.bin


More information about the mvapich-discuss mailing list