[mvapich-discuss] MVAPICH problem in MPI_Finalize

amith rajith mamidala mamidala at cse.ohio-state.edu
Thu Jul 12 14:03:34 EDT 2007


Hi Mark,

Thanks for reporting this problem. We are looking into this.

thanks,
-Amith.

On Wed, 11 Jul 2007, Mark Potts wrote:

> Hi,
>     I've finally tracked an intermittent problem that causes MVAPICH
>     processes to generate segmentation faults during their shutdown.
>     It seems to only happen on fairly large jobs on a 256 node cluster
>     (8-32 cores/node).  The following is the backtrace from the core
>     file of one of the failed processes from a purposely simple pgm.
>     (simpleprint_c).  This particular job ran with 1024 processes.
>     We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
>     This segmentation fault occurs across a host of different pgms.
>     but never on all processes and randomly(?) from one run to the
>     next.
>
>     From the core dump, the seg fault occurs as a result of the call
>     to MPI_Finalize() but ultimately lies in the free() function of
>     ptmalloc2/malloc.c.
>     From some cursory code examination it appears that the error
>     is hit when trying to unmap a memory segment.  Since the
>     seg fault occurrence is seemingly random, is this perhaps a
>     timing issue in which processes within an SMP node get confused
>     about who should be unmapping/freeing memory?
>
>
> gdb simpleprint_c core.9334
> :
> :
> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
> Program terminated with signal 11, Segmentation fault.
> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> 3455    ptmalloc2/malloc.c: No such file or directory.
>          in ptmalloc2/malloc.c
> (gdb) bt
> #0  free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> #1  0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
> create_2level_comm.c:49
> #2  0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
> comm_free.c:187
> #3  0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
> comm_free.c:217
> #4  0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
> #5  0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at simple.c:18
> (gdb)
>
>         regards,
> --
> ***********************************
>  >> Mark J. Potts, PhD
>  >>
>  >> HPC Applications Inc.
>  >> phone: 410-992-8360 Bus
>  >>        410-313-9318 Home
>  >>        443-418-4375 Cell
>  >> email: potts at hpcapplications.com
>  >>        potts at excray.com
> ***********************************
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list