[mvapich-discuss] Occasional failure initializing
Martin Pokorny
mpokorny at nrao.edu
Mon Jul 27 18:12:41 EDT 2015
I'm currently in the process of upgrading an mvapich2 installation from
version 1.9a2 to version 2.1. Mostly successful, so far, but there is
one odd issue that I've encountered. I can work around the following
issue by setting MV2_USE_SHMEM_COLL=0, but I was hoping not to have to
do that on a continuing basis.
The attached program is sufficient to trigger the problem -- you'll
notice that it's trivial. Also attached are a host file, a config file,
and a backtrace. From the backtrace you can see that the failure occurs
in the call to MPI_Init_thread. I have a core file that I can send, in
case that's interesting. Note that I've only seen the problem when
running in MPMD mode using a config file, which matches my case of
interest, but I'm not sure that's strictly necessary. In this test case,
I'm simply providing the same executable under two names. Also note that
the problem only occurs in about one out of ten or twenty trials.
Various other settings can change the frequency of occurrence, but I
figure that's just a further of sign of the non-deterministic nature of
the problem.
And here's the output of "mpiname -a":
> $ mpiname -a
> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>
> Compilation
> CC: gcc -DNDEBUG -DNVALGRIND -O2
> CXX: g++ -DNDEBUG -DNVALGRIND -O2
> F77: gfortran -L/lib -L/lib -O2
> FC: gfortran -O2
>
> Configuration
> --prefix=/opt/cbe-local/stow/mvapich2-2.1 --enable-romio --with-file-system=lustre --with-limic2
--
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 309 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150727/f6bc2499/attachment.bin>
-------------- next part --------------
-n 2 : /users/mpokorny/tmp/mpitest/testA
-n 2 : /users/mpokorny/tmp/mpitest/testB
-------------- next part --------------
cbe-node-08
cbe-node-09
-------------- next part --------------
[cbe-node-09:mpi_rank_3][error_sighandler] Caught error: Bus error (signal 7)
[cbe-node-09:mpi_rank_3][print_backtrace] 0: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f7dbbaaecbe]
[cbe-node-09:mpi_rank_3][print_backtrace] 1: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(error_sighandler+0x59) [0x7f7dbbaaedc9]
[cbe-node-09:mpi_rank_3][print_backtrace] 2: /lib64/libc.so.6() [0x3558032920]
[cbe-node-09:mpi_rank_3][print_backtrace] 3: /lib64/libc.so.6() [0x3558083716]
[cbe-node-09:mpi_rank_3][print_backtrace] 4: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_Mmap+0x27c) [0x7f7dbb83e8fc]
[cbe-node-09:mpi_rank_3][print_backtrace] 5: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SMP_init+0x1376) [0x7f7dbba6a1f6]
[cbe-node-09:mpi_rank_3][print_backtrace] 6: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3_Init+0x2dd) [0x7f7dbba61a0d]
[cbe-node-09:mpi_rank_3][print_backtrace] 7: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPID_Init+0x1ba) [0x7f7dbba568ba]
[cbe-node-09:mpi_rank_3][print_backtrace] 8: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIR_Init_thread+0x2a4) [0x7f7dbb9d3984]
[cbe-node-09:mpi_rank_3][print_backtrace] 9: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(PMPI_Init_thread+0x74) [0x7f7dbb9d3ab4]
[cbe-node-09:mpi_rank_3][print_backtrace] 10: /users/mpokorny/tmp/mpitest/testB() [0x4006b5]
[cbe-node-09:mpi_rank_3][print_backtrace] 11: /lib64/libc.so.6(__libc_start_main+0xfd) [0x355801ecdd]
[cbe-node-09:mpi_rank_3][print_backtrace] 12: /users/mpokorny/tmp/mpitest/testB() [0x4005c9]
More information about the mvapich-discuss
mailing list