[mvapich-discuss] MVAPICH problem in MPI_Finalize

amith rajith mamidala mamidala at cse.ohio-state.edu
Fri Jul 13 18:32:01 EDT 2007


Hi Mark,

Attached is the patch which should resolve the issue. Can you please try
this out and let us know if it works,

thanks,

-Amith.


On Wed, 11 Jul 2007, Mark Potts wrote:

> Hi,
>     I've finally tracked an intermittent problem that causes MVAPICH
>     processes to generate segmentation faults during their shutdown.
>     It seems to only happen on fairly large jobs on a 256 node cluster
>     (8-32 cores/node).  The following is the backtrace from the core
>     file of one of the failed processes from a purposely simple pgm.
>     (simpleprint_c).  This particular job ran with 1024 processes.
>     We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>     This segmentation fault occurs across a host of different pgms.
>     but never on all processes and randomly(?) from one run to the
>     next.
>
>     From the core dump, the seg fault occurs as a result of the call
>     to MPI_Finalize() but ultimately lies in the free() function of
>     ptmalloc2/malloc.c.
>     From some cursory code examination it appears that the error
>     is hit when trying to unmap a memory segment.  Since the
>     seg fault occurrence is seemingly random, is this perhaps a
>     timing issue in which processes within an SMP node get confused
>     about who should be unmapping/freeing memory?
>
>
> gdb simpleprint_c core.9334
> :
> :
> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
> Program terminated with signal 11, Segmentation fault.
> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> 3455    ptmalloc2/malloc.c: No such file or directory.
>          in ptmalloc2/malloc.c
> (gdb) bt
> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> #1  0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
> create_2level_comm.c:49
> #2  0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
> comm_free.c:187
> #3  0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
> comm_free.c:217
> #4  0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
> #5  0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at simple.c:18
> (gdb)
>
>         regards,
> --
> ***********************************
>  >> Mark J. Potts, PhD
>  >>
>  >> HPC Applications Inc.
>  >> phone: 410-992-8360 Bus
>  >>        410-313-9318 Home
>  >>        443-418-4375 Cell
>  >> email: potts at hpcapplications.com
>  >>        potts at excray.com
> ***********************************
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
Index: comm_free.c
===================================================================
--- comm_free.c	(revision 1380)
+++ comm_free.c	(working copy)
@@ -59,6 +59,9 @@
 #define DBG(a) 
 #define OUTFILE stdout
 
+#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
+int flag = 0;
+#endif
 extern int enable_rdma_collectives;
 #ifdef _SMP_
 extern int enable_shmem_collectives;
@@ -183,7 +186,7 @@
 #endif
 
 #if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
-        if((comm->comm_coll == comm) && (comm->comm_type == MPIR_INTRA) && (enable_shmem_collectives)) {
+        if((comm->comm_coll == comm) && (comm->comm_type == MPIR_INTRA) && (enable_shmem_collectives) && (!flag)) {
             free_2level_comm(comm);
         }
 #endif
@@ -214,7 +217,15 @@
 	/* Free collective communicator (unless it refers back to myself) */
 	if ( comm->comm_coll != comm ) {
 	    MPI_Comm ctmp = comm->comm_coll->self;
+#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
+        if (comm->self == MPI_COMM_SELF){
+            flag = 1;
+        }
+#endif
 	    MPI_Comm_free ( &ctmp );
+#if (defined(_SMP_) && (defined(CH_GEN2))) ||defined(CH_SMP)
+        flag = 0;
+#endif
 	}
 
 	/* Put this after freeing the collective comm because it may have


More information about the mvapich-discuss mailing list