[mvapich-discuss] MVAPICH problem in MPI_Finalize
Mark Potts
potts at hpcapplications.com
Tue Jul 17 00:06:46 EDT 2007
Amith,
The patch seems to do the job. I can no longer induce any
MPI_Finalize() seg faults in big jobs.
Thanks. We'll roll your patch into our builds.
regards,
amith rajith mamidala wrote:
> Hi Mark,
>
> Attached is the patch which should resolve the issue. Can you please try
> this out and let us know if it works,
>
> thanks,
>
> -Amith.
>
>
> On Wed, 11 Jul 2007, Mark Potts wrote:
>
>> Hi,
>> I've finally tracked an intermittent problem that causes MVAPICH
>> processes to generate segmentation faults during their shutdown.
>> It seems to only happen on fairly large jobs on a 256 node cluster
>> (8-32 cores/node). The following is the backtrace from the core
>> file of one of the failed processes from a purposely simple pgm.
>> (simpleprint_c). This particular job ran with 1024 processes.
>> We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>> This segmentation fault occurs across a host of different pgms.
>> but never on all processes and randomly(?) from one run to the
>> next.
>>
>> From the core dump, the seg fault occurs as a result of the call
>> to MPI_Finalize() but ultimately lies in the free() function of
>> ptmalloc2/malloc.c.
>> From some cursory code examination it appears that the error
>> is hit when trying to unmap a memory segment. Since the
>> seg fault occurrence is seemingly random, is this perhaps a
>> timing issue in which processes within an SMP node get confused
>> about who should be unmapping/freeing memory?
>>
>>
>> gdb simpleprint_c core.9334
>> :
>> :
>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
>> Program terminated with signal 11, Segmentation fault.
>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>> 3455 ptmalloc2/malloc.c: No such file or directory.
>> in ptmalloc2/malloc.c
>> (gdb) bt
>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
>> #1 0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
>> create_2level_comm.c:49
>> #2 0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
>> comm_free.c:187
>> #3 0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
>> comm_free.c:217
>> #4 0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
>> #5 0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at simple.c:18
>> (gdb)
>>
>> regards,
>> --
>> ***********************************
>> >> Mark J. Potts, PhD
>> >>
>> >> HPC Applications Inc.
>> >> phone: 410-992-8360 Bus
>> >> 410-313-9318 Home
>> >> 443-418-4375 Cell
>> >> email: potts at hpcapplications.com
>> >> potts at excray.com
>> ***********************************
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>> ------------------------------------------------------------------------
>>
>> Index: comm_free.c
>> ===================================================================
>> --- comm_free.c (revision 1380)
>> +++ comm_free.c (working copy)
>> @@ -59,6 +59,9 @@
>> #define DBG(a)
>> #define OUTFILE stdout
>>
>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> +int flag = 0;
>> +#endif
>> extern int enable_rdma_collectives;
>> #ifdef _SMP_
>> extern int enable_shmem_collectives;
>> @@ -183,7 +186,7 @@
>> #endif
>>
>> #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> - if((comm->comm_coll == comm) && (comm->comm_type == MPIR_INTRA) && (enable_shmem_collectives)) {
>> + if((comm->comm_coll == comm) && (comm->comm_type == MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
>> free_2level_comm(comm);
>> }
>> #endif
>> @@ -214,7 +217,15 @@
>> /* Free collective communicator (unless it refers back to myself) */
>> if ( comm->comm_coll != comm ) {
>> MPI_Comm ctmp = comm->comm_coll->self;
>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> + if (comm->self == MPI_COMM_SELF){
>> + flag = 1;
>> + }
>> +#endif
>> MPI_Comm_free ( &ctmp );
>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
>> + flag = 0;
>> +#endif
>> }
>>
>> /* Put this after freeing the collective comm because it may have
--
***********************************
>> Mark J. Potts, PhD
>>
>> HPC Applications Inc.
>> phone: 410-992-8360 Bus
>> 410-313-9318 Home
>> 443-418-4375 Cell
>> email: potts at hpcapplications.com
>> potts at excray.com
***********************************
More information about the mvapich-discuss
mailing list