[mvapich-discuss] Occasional failure initializing

Martin Pokorny mpokorny at nrao.edu
Tue Jul 28 11:29:50 EDT 2015


On 07/28/2015 09:18 AM, Jonathan Perkins wrote:
> The MV2_USE_MPIRUN_MAPPING=0 variable causes our library to do a
> collective over PMI to determine which ranks are local to each other.
> This is as opposed to an optimization where mpirun_rsh would attempt to
> directly tell the library.
>
> Can you tell us a little more about the other errors that you were
> facing.  Was there some sort of backtrace or error stack that you can
> share?  There might be some unexpected interaction taking place.

Here's an example (edited slightly):

> mpokorny at cbe-node-12:~/tmp/mpitest$ MV2_USE_RDMA_CM=1 MV2_ENABLE_AFFINITY=0 mpirun_rsh -export -config configfile -hostfile hostfile
> [cbe-node-09:mpi_rank_3][error_sighandler] Caught error: Bus error (signal 7)
> [cbe-node-09:mpi_rank_3][print_backtrace]   0: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f5c12a76cbe]
> [cbe-node-09:mpi_rank_3][print_backtrace]   1: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(error_sighandler+0x59) [0x7f5c12a76dc9]
> [cbe-node-09:mpi_rank_3][print_backtrace]   2: /lib64/libc.so.6() [0x3558032920]
> [cbe-node-09:mpi_rank_3][print_backtrace]   3: /lib64/libc.so.6() [0x3558083716]
> [cbe-node-09:mpi_rank_3][print_backtrace]   4: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_Mmap+0x27c) [0x7f5c128068fc]
> [cbe-node-09:mpi_rank_3][print_backtrace]   5: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SMP_init+0x1376) [0x7f5c12a321f6]
> [cbe-node-09:mpi_rank_3][print_backtrace]   6: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3_Init+0x2dd) [0x7f5c12a29a0d]
> [cbe-node-09:mpi_rank_3][print_backtrace]   7: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPID_Init+0x1ba) [0x7f5c12a1e8ba]
> [cbe-node-09:mpi_rank_3][print_backtrace]   8: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIR_Init_thread+0x2a4) [0x7f5c1299b984]
> [cbe-node-09:mpi_rank_3][print_backtrace]   9: /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(PMPI_Init_thread+0x74) [0x7f5c1299bab4]
> [cbe-node-09:mpi_rank_3][print_backtrace]  10: /users/mpokorny/tmp/mpitest/testB() [0x4006b5]
> [cbe-node-09:mpi_rank_3][print_backtrace]  11: /lib64/libc.so.6(__libc_start_main+0xfd) [0x355801ecdd]
> [cbe-node-09:mpi_rank_3][print_backtrace]  12: /users/mpokorny/tmp/mpitest/testB() [0x4005c9]
> [cbe-node-09:mpi_rank_1][cm_qp_conn_create] ../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1818: Failed to modify QP to INIT
> : Invalid argument (22)
> [cbe-node-08:mpi_rank_0][cm_qp_conn_create] ../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1818: Failed to modify QP to INIT
> : Invalid argument (22)

This example obviously shows both errors. I've also seen instances in 
which the bus error doesn't occur, but the IB error does.

-- 
Martin


More information about the mvapich-discuss mailing list