[mvapich-discuss] MVAPICH problem in MPI_Finalize
amith rajith mamidala
mamidala at cse.ohio-state.edu
Thu Jul 12 14:03:34 EDT 2007
Hi Mark,
Thanks for reporting this problem. We are looking into this.
thanks,
-Amith.
On Wed, 11 Jul 2007, Mark Potts wrote:
> Hi,
> I've finally tracked an intermittent problem that causes MVAPICH
> processes to generate segmentation faults during their shutdown.
> It seems to only happen on fairly large jobs on a 256 node cluster
> (8-32 cores/node). The following is the backtrace from the core
> file of one of the failed processes from a purposely simple pgm.
> (simpleprint_c). This particular job ran with 1024 processes.
> We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
> This segmentation fault occurs across a host of different pgms.
> but never on all processes and randomly(?) from one run to the
> next.
>
> From the core dump, the seg fault occurs as a result of the call
> to MPI_Finalize() but ultimately lies in the free() function of
> ptmalloc2/malloc.c.
> From some cursory code examination it appears that the error
> is hit when trying to unmap a memory segment. Since the
> seg fault occurrence is seemingly random, is this perhaps a
> timing issue in which processes within an SMP node get confused
> about who should be unmapping/freeing memory?
>
>
> gdb simpleprint_c core.9334
> :
> :
> Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
> Program terminated with signal 11, Segmentation fault.
> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> 3455 ptmalloc2/malloc.c: No such file or directory.
> in ptmalloc2/malloc.c
> (gdb) bt
> #0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
> #1 0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
> create_2level_comm.c:49
> #2 0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
> comm_free.c:187
> #3 0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
> comm_free.c:217
> #4 0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
> #5 0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at simple.c:18
> (gdb)
>
> regards,
> --
> ***********************************
> >> Mark J. Potts, PhD
> >>
> >> HPC Applications Inc.
> >> phone: 410-992-8360 Bus
> >> 410-313-9318 Home
> >> 443-418-4375 Cell
> >> email: potts at hpcapplications.com
> >> potts at excray.com
> ***********************************
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list