[mvapich-discuss] Occasional failure initializing

Martin Pokorny mpokorny at nrao.edu
Mon Jul 27 18:12:41 EDT 2015


I'm currently in the process of upgrading an mvapich2 installation from 
version 1.9a2 to version 2.1. Mostly successful, so far, but there is 
one odd issue that I've encountered. I can work around the following 
issue by setting MV2_USE_SHMEM_COLL=0, but I was hoping not to have to 
do that on a continuing basis.

The attached program is sufficient to trigger the problem -- you'll 
notice that it's trivial. Also attached are a host file, a config file, 
and a backtrace. From the backtrace you can see that the failure occurs 
in the call to MPI_Init_thread. I have a core file that I can send, in 
case that's interesting. Note that I've only seen the problem when 
running in MPMD mode using a config file, which matches my case of 
interest, but I'm not sure that's strictly necessary. In this test case, 
I'm simply providing the same executable under two names. Also note that 
the problem only occurs in about one out of ten or twenty trials. 
Various other settings can change the frequency of occurrence, but I 
figure that's just a further of sign of the non-deterministic nature of 
the problem.

And here's the output of "mpiname -a":

> $ mpiname -a
> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>
> Compilation
> CC: gcc    -DNDEBUG -DNVALGRIND -O2
> CXX: g++   -DNDEBUG -DNVALGRIND -O2
> F77: gfortran -L/lib -L/lib   -O2
> FC: gfortran   -O2
>
> Configuration
> --prefix=/opt/cbe-local/stow/mvapich2-2.1 --enable-romio --with-file-system=lustre --with-limic2

-- 
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 309 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150727/f6bc2499/attachment.bin>
-------------- next part --------------
-n 2 : /users/mpokorny/tmp/mpitest/testA
-n 2 : /users/mpokorny/tmp/mpitest/testB
-------------- next part --------------
cbe-node-08
cbe-node-09
-------------- next part --------------
[cbe-node-09:mpi_rank_3][error_sighandler] Caught error: Bus error (signal 7)
[cbe-node-09:mpi_rank_3][print_backtrace]   0: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f7dbbaaecbe]
[cbe-node-09:mpi_rank_3][print_backtrace]   1: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(error_sighandler+0x59) [0x7f7dbbaaedc9]
[cbe-node-09:mpi_rank_3][print_backtrace]   2: /lib64/libc.so.6() [0x3558032920]
[cbe-node-09:mpi_rank_3][print_backtrace]   3: /lib64/libc.so.6() [0x3558083716]
[cbe-node-09:mpi_rank_3][print_backtrace]   4: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_Mmap+0x27c) [0x7f7dbb83e8fc]
[cbe-node-09:mpi_rank_3][print_backtrace]   5: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SMP_init+0x1376) [0x7f7dbba6a1f6]
[cbe-node-09:mpi_rank_3][print_backtrace]   6: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3_Init+0x2dd) [0x7f7dbba61a0d]
[cbe-node-09:mpi_rank_3][print_backtrace]   7: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPID_Init+0x1ba) [0x7f7dbba568ba]
[cbe-node-09:mpi_rank_3][print_backtrace]   8: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIR_Init_thread+0x2a4) [0x7f7dbb9d3984]
[cbe-node-09:mpi_rank_3][print_backtrace]   9: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(PMPI_Init_thread+0x74) [0x7f7dbb9d3ab4]
[cbe-node-09:mpi_rank_3][print_backtrace]  10: /users/mpokorny/tmp/mpitest/testB() [0x4006b5]
[cbe-node-09:mpi_rank_3][print_backtrace]  11: /lib64/libc.so.6(__libc_start_main+0xfd) [0x355801ecdd]
[cbe-node-09:mpi_rank_3][print_backtrace]  12: /users/mpokorny/tmp/mpitest/testB() [0x4005c9]


More information about the mvapich-discuss mailing list