[mvapich-discuss] MVAPICH problem in MPI_Finalize

Mark Potts potts at hpcapplications.com
Tue Jul 17 00:06:46 EDT 2007


Amith,
    The patch seems to do the job.  I can no longer induce any
    MPI_Finalize() seg faults in big jobs.
    Thanks.  We'll roll your patch into our builds.
         regards,

amith rajith mamidala wrote:
> Hi Mark,
> 
> Attached is the patch which should resolve the issue. Can you please try
> this out and let us know if it works,
> 
> thanks,
> 
> -Amith.
> 
> 
> On Wed, 11 Jul 2007, Mark Potts wrote:
> 
>> Hi,
>>     I've finally tracked an intermittent problem that causes MVAPICH
>>     processes to generate segmentation faults during their shutdown.
>>     It seems to only happen on fairly large jobs on a 256 node cluster
>>     (8-32 cores/node).  The following is the backtrace from the core
>>     file of one of the failed processes from a purposely simple pgm.
>>     (simpleprint_c).  This particular job ran with 1024 processes.
>>     We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>>     This segmentation fault occurs across a host of different pgms.
>>     but never on all processes and randomly(?) from one run to the
>>     next.
>>
>>     From the core dump, the seg fault occurs as a result of the call
>>     to MPI_Finalize() but ultimately lies in the free() function of
>>     ptmalloc2/malloc.c.
>>     From some cursory code examination it appears that the error
>>     is hit when trying to unmap a memory segment.  Since the
>>     seg fault occurrence is seemingly random, is this perhaps a
>>     timing issue in which processes within an SMP node get confused
>>     about who should be unmapping/freeing memory?
>>
>>
>> gdb simpleprint_c core.9334
>> :
>> :
>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>> 3455    ptmalloc2/malloc.c: No such file or directory.
>>          in ptmalloc2/malloc.c
>> (gdb) bt
>> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>> #1  0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
>> create_2level_comm.c:49
>> #2  0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
>> comm_free.c:187
>> #3  0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
>> comm_free.c:217
>> #4  0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
>> #5  0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at simple.c:18
>> (gdb)
>>
>>         regards,
>> --
>> ***********************************
>>  >> Mark J. Potts, PhD
>>  >>
>>  >> HPC Applications Inc.
>>  >> phone: 410-992-8360 Bus
>>  >>        410-313-9318 Home
>>  >>        443-418-4375 Cell
>>  >> email: potts at hpcapplications.com
>>  >>        potts at excray.com
>> ***********************************
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>> ------------------------------------------------------------------------
>>
>> Index: comm_free.c
>> ===================================================================
>> --- comm_free.c	(revision 1380)
>> +++ comm_free.c	(working copy)
>> @@ -59,6 +59,9 @@
>>  #define DBG(a) 
>>  #define OUTFILE stdout
>>  
>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> +int flag = 0;
>> +#endif
>>  extern int enable_rdma_collectives;
>>  #ifdef _SMP_
>>  extern int enable_shmem_collectives;
>> @@ -183,7 +186,7 @@
>>  #endif
>>  
>>  #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> -        if((comm->comm_coll == comm) && (comm->comm_type == MPIR_INTRA) && (enable_shmem_collectives)) {
>> +        if((comm->comm_coll == comm) && (comm->comm_type == MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
>>              free_2level_comm(comm);
>>          }
>>  #endif
>> @@ -214,7 +217,15 @@
>>  	/* Free collective communicator (unless it refers back to myself) */
>>  	if ( comm->comm_coll != comm ) {
>>  	    MPI_Comm ctmp = comm->comm_coll->self;
>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> +        if (comm->self == MPI_COMM_SELF){
>> +            flag = 1;
>> +        }
>> +#endif
>>  	    MPI_Comm_free ( &ctmp );
>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> +        flag = 0;
>> +#endif
>>  	}
>>  
>>  	/* Put this after freeing the collective comm because it may have

-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list