[mvapich-discuss] MVAPICH problem in MPI_Finalize
Pavel Shamis (Pasha)
pasha at dev.mellanox.co.il
Tue Jul 17 02:58:09 EDT 2007
Amith,
Please commit the patch to 0.9.9 branch. (I would like to have it in
future OFED bugfix release)
Regards,
Pasha
Mark Potts wrote:
> Amith,
> The patch seems to do the job. I can no longer induce any
> MPI_Finalize() seg faults in big jobs.
> Thanks. We'll roll your patch into our builds.
> regards,
>
> amith rajith mamidala wrote:
>> Hi Mark,
>>
>> Attached is the patch which should resolve the issue. Can you please try
>> this out and let us know if it works,
>>
>> thanks,
>>
>> -Amith.
>>
>>
>> On Wed, 11 Jul 2007, Mark Potts wrote:
>>
>>> Hi,
>>> I've finally tracked an intermittent problem that causes MVAPICH
>>> processes to generate segmentation faults during their shutdown.
>>> It seems to only happen on fairly large jobs on a 256 node cluster
>>> (8-32 cores/node). The following is the backtrace from the core
>>> file of one of the failed processes from a purposely simple pgm.
>>> (simpleprint_c). This particular job ran with 1024 processes.
>>> We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>>> This segmentation fault occurs across a host of different pgms.
>>> but never on all processes and randomly(?) from one run to the
>>> next.
>>>
>>> From the core dump, the seg fault occurs as a result of the call
>>> to MPI_Finalize() but ultimately lies in the free() function of
>>> ptmalloc2/malloc.c.
>>> From some cursory code examination it appears that the error
>>> is hit when trying to unmap a memory segment. Since the
>>> seg fault occurrence is seemingly random, is this perhaps a
>>> timing issue in which processes within an SMP node get confused
>>> about who should be unmapping/freeing memory?
>>>
>>>
>>> gdb simpleprint_c core.9334
>>> :
>>> :
>>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>>> 3455 ptmalloc2/malloc.c: No such file or directory.
>>> in ptmalloc2/malloc.c
>>> (gdb) bt
>>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>>> #1 0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
>>> create_2level_comm.c:49
>>> #2 0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
>>> comm_free.c:187
>>> #3 0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
>>> comm_free.c:217
>>> #4 0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
>>> #5 0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at
>>> simple.c:18
>>> (gdb)
>>>
>>> regards,
>>> --
>>> ***********************************
>>> >> Mark J. Potts, PhD
>>> >>
>>> >> HPC Applications Inc.
>>> >> phone: 410-992-8360 Bus
>>> >> 410-313-9318 Home
>>> >> 443-418-4375 Cell
>>> >> email: potts at hpcapplications.com
>>> >> potts at excray.com
>>> ***********************************
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> Index: comm_free.c
>>> ===================================================================
>>> --- comm_free.c (revision 1380)
>>> +++ comm_free.c (working copy)
>>> @@ -59,6 +59,9 @@
>>> #define DBG(a) #define OUTFILE stdout
>>>
>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> +int flag = 0;
>>> +#endif
>>> extern int enable_rdma_collectives;
>>> #ifdef _SMP_
>>> extern int enable_shmem_collectives;
>>> @@ -183,7 +186,7 @@
>>> #endif
>>>
>>> #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> - if((comm->comm_coll == comm) && (comm->comm_type ==
>>> MPIR_INTRA) && (enable_shmem_collectives)) {
>>> + if((comm->comm_coll == comm) && (comm->comm_type ==
>>> MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
>>> free_2level_comm(comm);
>>> }
>>> #endif
>>> @@ -214,7 +217,15 @@
>>> /* Free collective communicator (unless it refers back to
>>> myself) */
>>> if ( comm->comm_coll != comm ) {
>>> MPI_Comm ctmp = comm->comm_coll->self;
>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> + if (comm->self == MPI_COMM_SELF){
>>> + flag = 1;
>>> + }
>>> +#endif
>>> MPI_Comm_free ( &ctmp );
>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>> + flag = 0;
>>> +#endif
>>> }
>>>
>>> /* Put this after freeing the collective comm because it may have
>
More information about the mvapich-discuss
mailing list