[mvapich-discuss] MVAPICH problem in MPI_Finalize

Pavel Shamis (Pasha) pasha at dev.mellanox.co.il
Tue Jul 17 02:58:09 EDT 2007


Amith,
Please commit the patch to 0.9.9 branch. (I would like to have it in 
future OFED bugfix release)

Regards,
Pasha

Mark Potts wrote:
> Amith,
>    The patch seems to do the job.  I can no longer induce any
>    MPI_Finalize() seg faults in big jobs.
>    Thanks.  We'll roll your patch into our builds.
>         regards,
>
> amith rajith mamidala wrote:
>> Hi Mark,
>>
>> Attached is the patch which should resolve the issue. Can you please try
>> this out and let us know if it works,
>>
>> thanks,
>>
>> -Amith.
>>
>>
>> On Wed, 11 Jul 2007, Mark Potts wrote:
>>
>>> Hi,
>>>     I've finally tracked an intermittent problem that causes MVAPICH
>>>     processes to generate segmentation faults during their shutdown.
>>>     It seems to only happen on fairly large jobs on a 256 node cluster
>>>     (8-32 cores/node).  The following is the backtrace from the core
>>>     file of one of the failed processes from a purposely simple pgm.
>>>     (simpleprint_c).  This particular job ran with 1024 processes.
>>>     We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>>>     This segmentation fault occurs across a host of different pgms.
>>>     but never on all processes and randomly(?) from one run to the
>>>     next.
>>>
>>>     From the core dump, the seg fault occurs as a result of the call
>>>     to MPI_Finalize() but ultimately lies in the free() function of
>>>     ptmalloc2/malloc.c.
>>>     From some cursory code examination it appears that the error
>>>     is hit when trying to unmap a memory segment.  Since the
>>>     seg fault occurrence is seemingly random, is this perhaps a
>>>     timing issue in which processes within an SMP node get confused
>>>     about who should be unmapping/freeing memory?
>>>
>>>
>>> gdb simpleprint_c core.9334
>>> :
>>> :
>>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>>> 3455    ptmalloc2/malloc.c: No such file or directory.
>>>          in ptmalloc2/malloc.c
>>> (gdb) bt
>>> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>>> #1  0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
>>> create_2level_comm.c:49
>>> #2  0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
>>> comm_free.c:187
>>> #3  0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
>>> comm_free.c:217
>>> #4  0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
>>> #5  0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at 
>>> simple.c:18
>>> (gdb)
>>>
>>>         regards,
>>> -- 
>>> ***********************************
>>>  >> Mark J. Potts, PhD
>>>  >>
>>>  >> HPC Applications Inc.
>>>  >> phone: 410-992-8360 Bus
>>>  >>        410-313-9318 Home
>>>  >>        443-418-4375 Cell
>>>  >> email: potts at hpcapplications.com
>>>  >>        potts at excray.com
>>> ***********************************
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> Index: comm_free.c
>>> ===================================================================
>>> --- comm_free.c    (revision 1380)
>>> +++ comm_free.c    (working copy)
>>> @@ -59,6 +59,9 @@
>>>  #define DBG(a)  #define OUTFILE stdout
>>>  
>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> +int flag = 0;
>>> +#endif
>>>  extern int enable_rdma_collectives;
>>>  #ifdef _SMP_
>>>  extern int enable_shmem_collectives;
>>> @@ -183,7 +186,7 @@
>>>  #endif
>>>  
>>>  #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> -        if((comm->comm_coll == comm) && (comm->comm_type == 
>>> MPIR_INTRA) && (enable_shmem_collectives)) {
>>> +        if((comm->comm_coll == comm) && (comm->comm_type == 
>>> MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
>>>              free_2level_comm(comm);
>>>          }
>>>  #endif
>>> @@ -214,7 +217,15 @@
>>>      /* Free collective communicator (unless it refers back to 
>>> myself) */
>>>      if ( comm->comm_coll != comm ) {
>>>          MPI_Comm ctmp = comm->comm_coll->self;
>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> +        if (comm->self == MPI_COMM_SELF){
>>> +            flag = 1;
>>> +        }
>>> +#endif
>>>          MPI_Comm_free ( &ctmp );
>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> +        flag = 0;
>>> +#endif
>>>      }
>>>  
>>>      /* Put this after freeing the collective comm because it may have
>



More information about the mvapich-discuss mailing list