[mvapich-discuss] MVAPICH problem in MPI_Finalize
Mark Potts
potts at hpcapplications.com
Wed Jul 11 23:01:17 EDT 2007
Hi,
I've finally tracked an intermittent problem that causes MVAPICH
processes to generate segmentation faults during their shutdown.
It seems to only happen on fairly large jobs on a 256 node cluster
(8-32 cores/node). The following is the backtrace from the core
file of one of the failed processes from a purposely simple pgm.
(simpleprint_c). This particular job ran with 1024 processes.
We are using ch_gen2 MVAPICH 0.9.9 singlerail with _SMP turned on.
This segmentation fault occurs across a host of different pgms.
but never on all processes and randomly(?) from one run to the
next.
From the core dump, the seg fault occurs as a result of the call
to MPI_Finalize() but ultimately lies in the free() function of
ptmalloc2/malloc.c.
From some cursory code examination it appears that the error
is hit when trying to unmap a memory segment. Since the
seg fault occurrence is seemingly random, is this perhaps a
timing issue in which processes within an SMP node get confused
about who should be unmapping/freeing memory?
gdb simpleprint_c core.9334
:
:
Core was generated by `/var/tmp/mjpworkspace/simpleprint_c'.
Program terminated with signal 11, Segmentation fault.
#0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
3455 ptmalloc2/malloc.c: No such file or directory.
in ptmalloc2/malloc.c
(gdb) bt
#0 free (mem=0xfa00940af900940a) at ptmalloc2/malloc.c:3455
#1 0x00002b70b40489c5 in free_2level_comm (comm_ptr=0x57a720) at
create_2level_comm.c:49
#2 0x00002b70b40461af in PMPI_Comm_free (commp=0x7ffff6bb4e44) at
comm_free.c:187
#3 0x00002b70b404604f in PMPI_Comm_free (commp=0x7ffff6bb4e70) at
comm_free.c:217
#4 0x00002b70b404d56e in PMPI_Finalize () at finalize.c:159
#5 0x0000000000400814 in main (argc=1, argv=0x7ffff6bb4fa8) at simple.c:18
(gdb)
regards,
--
***********************************
>> Mark J. Potts, PhD
>>
>> HPC Applications Inc.
>> phone: 410-992-8360 Bus
>> 410-313-9318 Home
>> 443-418-4375 Cell
>> email: potts at hpcapplications.com
>> potts at excray.com
***********************************
More information about the mvapich-discuss
mailing list