[mvapich-discuss] segment fault when ENABLE_AFFINITY?

M Xie xmxmxie at gmail.com
Tue Aug 7 00:58:30 EDT 2012


Hello,

We met a problem using the mvapich2-1.8 release.

Now, we use the mvapich2-1.8a2 version, the programs using this MPI
version run ok. Recently, we upgraded the MPI version to mvapich2-1.8
release. But we met a problem which causes the programs failed.

The problem is: When MV2_ENABLE_AFFINITY=1 (this is the default mode),
the program will segment fault at the initialization. But when
MV2_ENABLE_AFFINITY=0, the program runs ok.

After browsing the source code, it seems there is many codes
modification from 1.8a2 to 1.8 release in the process affinity and
hwloc interface parts. I extracted the hwloc integrated in the 1.8a2
(version 1.3.1) to the 1.8 release (which has version 1.4.1), but the
segfault problem is not solved.

There is a core dump of the failed program in the attachment.

The PM we used is slurm, and the OS is redhat 5.5, kernel version
2.6.18-194. We use intel compiler 11.1. Limic2 is also used for
intra-node communication.

Thanks for your help.
-------------- next part --------------
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /tmp/osu_bw...(no debugging symbols found)...done.
Reading symbols from /usr/lib64/libpmi.so.0...done.
Loaded symbols for /usr/lib64/libpmi.so.0
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /usr/lib/liblimic2.so.0...done.
Loaded symbols for /usr/lib/liblimic2.so.0
Reading symbols from /usr/lib64/librdmacm.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/librdmacm.so.1
Reading symbols from /usr/lib64/libibverbs.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libibverbs.so.1
Reading symbols from /usr/lib64/libibumad.so.2...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libibumad.so.2
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libslurm.so.22...done.
Loaded symbols for /usr/lib64/libslurm.so.22
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/lib64/slurm/auth_none.so...done.
Loaded symbols for /usr/lib64/slurm/auth_none.so
Reading symbols from /usr/lib64/libmlx4-rdmav2.so...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libmlx4-rdmav2.so
Reading symbols from /usr/lib64/libmthca-rdmav2.so...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libmthca-rdmav2.so
Core was generated by `/tmp/osu_bw'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000045323f in _int_malloc ()
(gdb) bt
#0  0x000000000045323f in _int_malloc ()
#1  0x000000000045754a in calloc ()
#2  0x00000000004207a8 in MPIDI_CH3I_SMP_init ()
#3  0x00000000004cca16 in MPIDI_CH3_Init ()
#4  0x000000000047eee6 in MPID_Init ()
#5  0x0000000000413947 in MPIR_Init_thread ()
#6  0x000000000041348b in PMPI_Init ()
#7  0x000000000040e2c1 in main ()
(gdb) 


More information about the mvapich-discuss mailing list