[mvapich-discuss] MVAPICH problem in MPI_Finalize
amith rajith mamidala
mamidala at cse.ohio-state.edu
Tue Jul 17 12:04:14 EDT 2007
Pasha,
I checked in the patch to the 0.9.9 branch.
thanks,
-Amith
On Tue, 17 Jul 2007, Pavel Shamis (Pasha) wrote:
> Amith,
> Please commit the patch to 0.9.9 branch. (I would like to have it in
> future OFED bugfix release)
>
> Regards,
> Pasha
>
> Mark Potts wrote:
> > Amith,
> > The patch seems to do the job. I can no longer induce any
> > MPI_Finalize() seg faults in big jobs.
> > Thanks. We'll roll your patch into our builds.
> > regards,
> >
> > amith rajith mamidala wrote:
> >> Hi Mark,
> >>
> >> Attached is the patch which should resolve the issue. Can you please try
> >> this out and let us know if it works,
> >>
> >> thanks,
> >>
> >> -Amith.
> >>
> >>
> >> On Wed, 11 Jul 2007, Mark Potts wrote:
> >>
> >>> Hi,
> >>> I've finally tracked an intermittent problem that causes MVAPICH
> >>> processes to generate segmentation faults during their shutdown.
> >>> It seems to only happen on fairly large jobs on a 256 node cluster
> >>> (8-32 cores/node). The following is the backtrace from the core
> >>> file of one of the failed processes from a purposely simple pgm.
> >>> (simpleprint_c). This particular job ran with 1024 processes.
> >>> We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
> >>> This segmentation fault occurs across a host of different pgms.
> >>> but never on all processes and randomly(?) from one run to the
> >>> next.
> >>>
> >>> From the core dump, the seg fault occurs as a result of the call
> >>> to MPI_Finalize() but ultimately lies in the free() function of
> >>> ptmalloc2/malloc.c.
> >>> From some cursory code examination it appears that the error
> >>> is hit when trying to unmap a memory segment. Since the
> >>> seg fault occurrence is seemingly random, is this perhaps a
> >>> timing issue in which processes within an SMP node get confused
> >>> about who should be unmapping/freeing memory?
> >>>
> >>>
> >>> gdb simpleprint_c core.9334
> >>> :
> >>> :
> >>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> >>> 3455 ptmalloc2/malloc.c: No such file or directory.
> >>> in ptmalloc2/malloc.c
> >>> (gdb) bt
> >>> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> >>> #1 0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
> >>> create_2level_comm.c:49
> >>> #2 0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
> >>> comm_free.c:187
> >>> #3 0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
> >>> comm_free.c:217
> >>> #4 0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
> >>> #5 0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at
> >>> simple.c:18
> >>> (gdb)
> >>>
> >>> regards,
> >>> --
> >>> ***********************************
> >>> >> Mark J. Potts, PhD
> >>> >>
> >>> >> HPC Applications Inc.
> >>> >> phone: 410-992-8360 Bus
> >>> >> 410-313-9318 Home
> >>> >> 443-418-4375 Cell
> >>> >> email: potts at hpcapplications.com
> >>> >> potts at excray.com
> >>> ***********************************
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>>
> >>> Index: comm_free.c
> >>> ===================================================================
> >>> --- comm_free.c (revision 1380)
> >>> +++ comm_free.c (working copy)
> >>> @@ -59,6 +59,9 @@
> >>> #define DBG(a) #define OUTFILE stdout
> >>>
> >>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> +int flag = 0;
> >>> +#endif
> >>> extern int enable_rdma_collectives;
> >>> #ifdef _SMP_
> >>> extern int enable_shmem_collectives;
> >>> @@ -183,7 +186,7 @@
> >>> #endif
> >>>
> >>> #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> - if((comm->comm_coll == comm) && (comm->comm_type ==
> >>> MPIR_INTRA) && (enable_shmem_collectives)) {
> >>> + if((comm->comm_coll == comm) && (comm->comm_type ==
> >>> MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
> >>> free_2level_comm(comm);
> >>> }
> >>> #endif
> >>> @@ -214,7 +217,15 @@
> >>> /* Free collective communicator (unless it refers back to
> >>> myself) */
> >>> if ( comm->comm_coll != comm ) {
> >>> MPI_Comm ctmp = comm->comm_coll->self;
> >>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> + if (comm->self == MPI_COMM_SELF){
> >>> + flag = 1;
> >>> + }
> >>> +#endif
> >>> MPI_Comm_free ( &ctmp );
> >>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> + flag = 0;
> >>> +#endif
> >>> }
> >>>
> >>> /* Put this after freeing the collective comm because it may have
> >
>
More information about the mvapich-discuss
mailing list