[mvapich-discuss] MVAPICH problem in MPI_Finalize
Pavel Shamis (Pasha)
pasha at dev.mellanox.co.il
Tue Jul 17 12:47:03 EDT 2007
Thanks !
amith rajith mamidala wrote:
> Pasha,
>
> I checked in the patch to the 0.9.9 branch.
>
> thanks,
>
> -Amith
>
> On Tue, 17 Jul 2007, Pavel Shamis (Pasha) wrote:
>
>
>> Amith,
>> Please commit the patch to 0.9.9 branch. (I would like to have it in
>> future OFED bugfix release)
>>
>> Regards,
>> Pasha
>>
>> Mark Potts wrote:
>>
>>> Amith,
>>> The patch seems to do the job. I can no longer induce any
>>> MPI_Finalize() seg faults in big jobs.
>>> Thanks. We'll roll your patch into our builds.
>>> regards,
>>>
>>> amith rajith mamidala wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> Attached is the patch which should resolve the issue. Can you please try
>>>> this out and let us know if it works,
>>>>
>>>> thanks,
>>>>
>>>> -Amith.
>>>>
>>>>
>>>> On Wed, 11 Jul 2007, Mark Potts wrote:
>>>>
>>>>
>>>>> Hi,
>>>>> I've finally tracked an intermittent problem that causes MVAPICH
>>>>> processes to generate segmentation faults during their shutdown.
>>>>> It seems to only happen on fairly large jobs on a 256 node cluster
>>>>> (8-32 cores/node). The following is the backtrace from the core
>>>>> file of one of the failed processes from a purposely simple pgm.
>>>>> (simpleprint_c). This particular job ran with 1024 processes.
>>>>> We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>>>>> This segmentation fault occurs across a host of different pgms.
>>>>> but never on all processes and randomly(?) from one run to the
>>>>> next.
>>>>>
>>>>> From the core dump, the seg fault occurs as a result of the call
>>>>> to MPI_Finalize() but ultimately lies in the free() function of
>>>>> ptmalloc2/malloc.c.
>>>>> From some cursory code examination it appears that the error
>>>>> is hit when trying to unmap a memory segment. Since the
>>>>> seg fault occurrence is seemingly random, is this perhaps a
>>>>> timing issue in which processes within an SMP node get confused
>>>>> about who should be unmapping/freeing memory?
>>>>>
>>>>>
>>>>> gdb simpleprint_c core.9334
>>>>> :
>>>>> :
>>>>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>>>>> 3455 ptmalloc2/malloc.c: No such file or directory.
>>>>> in ptmalloc2/malloc.c
>>>>> (gdb) bt
>>>>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>>>>> #1 0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
>>>>> create_2level_comm.c:49
>>>>> #2 0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
>>>>> comm_free.c:187
>>>>> #3 0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
>>>>> comm_free.c:217
>>>>> #4 0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
>>>>> #5 0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at
>>>>> simple.c:18
>>>>> (gdb)
>>>>>
>>>>> regards,
>>>>> --
>>>>> ***********************************
>>>>> >> Mark J. Potts, PhD
>>>>> >>
>>>>> >> HPC Applications Inc.
>>>>> >> phone: 410-992-8360 Bus
>>>>> >> 410-313-9318 Home
>>>>> >> 443-418-4375 Cell
>>>>> >> email: potts at hpcapplications.com
>>>>> >> potts at excray.com
>>>>> ***********************************
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Index: comm_free.c
>>>>> ===================================================================
>>>>> --- comm_free.c (revision 1380)
>>>>> +++ comm_free.c (working copy)
>>>>> @@ -59,6 +59,9 @@
>>>>> #define DBG(a) #define OUTFILE stdout
>>>>>
>>>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>>>> +int flag = 0;
>>>>> +#endif
>>>>> extern int enable_rdma_collectives;
>>>>> #ifdef _SMP_
>>>>> extern int enable_shmem_collectives;
>>>>> @@ -183,7 +186,7 @@
>>>>> #endif
>>>>>
>>>>> #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>>>> - if((comm->comm_coll == comm) && (comm->comm_type ==
>>>>> MPIR_INTRA) && (enable_shmem_collectives)) {
>>>>> + if((comm->comm_coll == comm) && (comm->comm_type ==
>>>>> MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
>>>>> free_2level_comm(comm);
>>>>> }
>>>>> #endif
>>>>> @@ -214,7 +217,15 @@
>>>>> /* Free collective communicator (unless it refers back to
>>>>> myself) */
>>>>> if ( comm->comm_coll != comm ) {
>>>>> MPI_Comm ctmp = comm->comm_coll->self;
>>>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>>>> + if (comm->self == MPI_COMM_SELF){
>>>>> + flag = 1;
>>>>> + }
>>>>> +#endif
>>>>> MPI_Comm_free ( &ctmp );
>>>>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>>>>> + flag = 0;
>>>>> +#endif
>>>>> }
>>>>>
>>>>> /* Put this after freeing the collective comm because it may have
>>>>>
>
>
>
More information about the mvapich-discuss
mailing list