[mvapich-discuss] MVAPICH problem in MPI_Finalize

amith rajith mamidala mamidala at cse.ohio-state.edu
Tue Jul 17 12:04:14 EDT 2007


Pasha,

I checked in the patch to the 0.9.9 branch.

thanks,

-Amith

On Tue, 17 Jul 2007, Pavel Shamis (Pasha) wrote:

> Amith,
> Please commit the patch to 0.9.9 branch. (I would like to have it in
> future OFED bugfix release)
>
> Regards,
> Pasha
>
> Mark Potts wrote:
> > Amith,
> >    The patch seems to do the job.  I can no longer induce any
> >    MPI_Finalize() seg faults in big jobs.
> >    Thanks.  We'll roll your patch into our builds.
> >         regards,
> >
> > amith rajith mamidala wrote:
> >> Hi Mark,
> >>
> >> Attached is the patch which should resolve the issue. Can you please try
> >> this out and let us know if it works,
> >>
> >> thanks,
> >>
> >> -Amith.
> >>
> >>
> >> On Wed, 11 Jul 2007, Mark Potts wrote:
> >>
> >>> Hi,
> >>>     I've finally tracked an intermittent problem that causes MVAPICH
> >>>     processes to generate segmentation faults during their shutdown.
> >>>     It seems to only happen on fairly large jobs on a 256 node cluster
> >>>     (8-32 cores/node).  The following is the backtrace from the core
> >>>     file of one of the failed processes from a purposely simple pgm.
> >>>     (simpleprint_c).  This particular job ran with 1024 processes.
> >>>     We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
> >>>     This segmentation fault occurs across a host of different pgms.
> >>>     but never on all processes and randomly(?) from one run to the
> >>>     next.
> >>>
> >>>     From the core dump, the seg fault occurs as a result of the call
> >>>     to MPI_Finalize() but ultimately lies in the free() function of
> >>>     ptmalloc2/malloc.c.
> >>>     From some cursory code examination it appears that the error
> >>>     is hit when trying to unmap a memory segment.  Since the
> >>>     seg fault occurrence is seemingly random, is this perhaps a
> >>>     timing issue in which processes within an SMP node get confused
> >>>     about who should be unmapping/freeing memory?
> >>>
> >>>
> >>> gdb simpleprint_c core.9334
> >>> :
> >>> :
> >>> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> >>> 3455    ptmalloc2/malloc.c: No such file or directory.
> >>>          in ptmalloc2/malloc.c
> >>> (gdb) bt
> >>> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> >>> #1  0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
> >>> create_2level_comm.c:49
> >>> #2  0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
> >>> comm_free.c:187
> >>> #3  0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
> >>> comm_free.c:217
> >>> #4  0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
> >>> #5  0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at
> >>> simple.c:18
> >>> (gdb)
> >>>
> >>>         regards,
> >>> --
> >>> ***********************************
> >>>  >> Mark J. Potts, PhD
> >>>  >>
> >>>  >> HPC Applications Inc.
> >>>  >> phone: 410-992-8360 Bus
> >>>  >>        410-313-9318 Home
> >>>  >>        443-418-4375 Cell
> >>>  >> email: potts at hpcapplications.com
> >>>  >>        potts at excray.com
> >>> ***********************************
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>>
> >>> Index: comm_free.c
> >>> ===================================================================
> >>> --- comm_free.c    (revision 1380)
> >>> +++ comm_free.c    (working copy)
> >>> @@ -59,6 +59,9 @@
> >>>  #define DBG(a)  #define OUTFILE stdout
> >>>
> >>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> +int flag = 0;
> >>> +#endif
> >>>  extern int enable_rdma_collectives;
> >>>  #ifdef _SMP_
> >>>  extern int enable_shmem_collectives;
> >>> @@ -183,7 +186,7 @@
> >>>  #endif
> >>>
> >>>  #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> -        if((comm->comm_coll == comm) && (comm->comm_type ==
> >>> MPIR_INTRA) && (enable_shmem_collectives)) {
> >>> +        if((comm->comm_coll == comm) && (comm->comm_type ==
> >>> MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
> >>>              free_2level_comm(comm);
> >>>          }
> >>>  #endif
> >>> @@ -214,7 +217,15 @@
> >>>      /* Free collective communicator (unless it refers back to
> >>> myself) */
> >>>      if ( comm->comm_coll != comm ) {
> >>>          MPI_Comm ctmp = comm->comm_coll->self;
> >>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> +        if (comm->self == MPI_COMM_SELF){
> >>> +            flag = 1;
> >>> +        }
> >>> +#endif
> >>>          MPI_Comm_free ( &ctmp );
> >>> +#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
> >>> +        flag = 0;
> >>> +#endif
> >>>      }
> >>>
> >>>      /* Put this after freeing the collective comm because it may have
> >
>



More information about the mvapich-discuss mailing list