[mvapich-discuss] Occasional failure initializing

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Jul 28 11:33:39 EDT 2015


Thanks for sending this out.  We'll try to reproduce this issue and see how
best to resolve it.  In the meantime it should be safe for you to continue
using MV2_USE_MPIRUN_MAPPING=0 as a work around.

On Tue, Jul 28, 2015 at 11:30 AM Martin Pokorny <mpokorny at nrao.edu> wrote:

> On 07/28/2015 09:18 AM, Jonathan Perkins wrote:
> > The MV2_USE_MPIRUN_MAPPING=0 variable causes our library to do a
> > collective over PMI to determine which ranks are local to each other.
> > This is as opposed to an optimization where mpirun_rsh would attempt to
> > directly tell the library.
> >
> > Can you tell us a little more about the other errors that you were
> > facing.  Was there some sort of backtrace or error stack that you can
> > share?  There might be some unexpected interaction taking place.
>
> Here's an example (edited slightly):
>
> > mpokorny at cbe-node-12:~/tmp/mpitest$ MV2_USE_RDMA_CM=1
> MV2_ENABLE_AFFINITY=0 mpirun_rsh -export -config configfile -hostfile
> hostfile
> > [cbe-node-09:mpi_rank_3][error_sighandler] Caught error: Bus error
> (signal 7)
> > [cbe-node-09:mpi_rank_3][print_backtrace]   0:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(print_backtrace+0x1e)
> [0x7f5c12a76cbe]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   1:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7f5c12a76dc9]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   2: /lib64/libc.so.6()
> [0x3558032920]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   3: /lib64/libc.so.6()
> [0x3558083716]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   4:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_Mmap+0x27c)
> [0x7f5c128068fc]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   5:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3I_SMP_init+0x1376)
> [0x7f5c12a321f6]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   6:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIDI_CH3_Init+0x2dd)
> [0x7f5c12a29a0d]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   7:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPID_Init+0x1ba)
> [0x7f5c12a1e8ba]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   8:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(MPIR_Init_thread+0x2a4)
> [0x7f5c1299b984]
> > [cbe-node-09:mpi_rank_3][print_backtrace]   9:
> /opt/cbe-local/stow/mvapich2-2.1/lib/libmpi.so.12(PMPI_Init_thread+0x74)
> [0x7f5c1299bab4]
> > [cbe-node-09:mpi_rank_3][print_backtrace]  10:
> /users/mpokorny/tmp/mpitest/testB() [0x4006b5]
> > [cbe-node-09:mpi_rank_3][print_backtrace]  11:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x355801ecdd]
> > [cbe-node-09:mpi_rank_3][print_backtrace]  12:
> /users/mpokorny/tmp/mpitest/testB() [0x4005c9]
> > [cbe-node-09:mpi_rank_1][cm_qp_conn_create]
> ../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1818: Failed to
> modify QP to INIT
> > : Invalid argument (22)
> > [cbe-node-08:mpi_rank_0][cm_qp_conn_create]
> ../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1818: Failed to
> modify QP to INIT
> > : Invalid argument (22)
>
> This example obviously shows both errors. I've also seen instances in
> which the bus error doesn't occur, but the IB error does.
>
> --
> Martin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150728/316cc1f2/attachment-0001.html>


More information about the mvapich-discuss mailing list