[mvapich-discuss] multicast difficulties
Martin Pokorny
mpokorny at nrao.edu
Thu Jan 31 13:06:57 EST 2013
Hello everyone.
I'm having some difficulties getting multicast to work my system. I
don't have a lot of experience with Infiniband, so I've probably got
something misconfigured, but I only see the problem whenever I try to
run any program linked to the mvapich2 libraries, so I thought I'd ask
here first. I'm using mvapich2-1.9a2 on a cluster running RHEL 6.3, with
Mellanox MT26428 HCAs. mvapich2 was built with the following configure
options:
> ./configure --prefix=/opt/cbe-local/stow/mvapich2-1.9a2 --enable-romio --with-file-system=lustre --enable-shared --enable-sharedlibs=gcc --with-rdma-cm --enable-fast=O3 --with-limic2 --enable-g=dbg,log
When I run a program with MV2_USE_MCAST=1 and MV2_USE_RDMA_CM=1 it fails
with output like the following for all nodes:
> Failed to modify QP to INIT
> Error in creating UD QP
> [cbe-node-11:mpi_rank_2][mv2_mcast_prepare_ud_ctx] MCAST UD QP creation failed[cbe-node-11:mpi_rank_2][MPIDI_CH3_Init] Error in create multicast UD context for multicast
When I run the same program with only MV2_USE_MCAST=1, a segfault
occurs, with the following backtrace (obtained using
MV2_DEBUG_SHOW_BACKTRACE):
> [cbe-node-09:mpi_rank_0][print_backtrace] 0: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(print_backtrace+0x1e) [0x7f99d1089f5e]
> [cbe-node-09:mpi_rank_0][print_backtrace] 1: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(error_sighandler+0x59) [0x7f99d108a069]
> [cbe-node-09:mpi_rank_0][print_backtrace] 2: /lib64/libpthread.so.0() [0x355840f500]
> [cbe-node-09:mpi_rank_0][print_backtrace] 3: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_MRAILI_Eager_send+0x2de) [0x7f99d1050e1e]
> [cbe-node-09:mpi_rank_0][print_backtrace] 4: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3_iStartMsg+0x27c) [0x7f99d1039a1c]
> [cbe-node-09:mpi_rank_0][print_backtrace] 5: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(+0x13c55d) [0x7f99d108655d]
> [cbe-node-09:mpi_rank_0][print_backtrace] 6: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(mv2_process_mcast_msg+0xfc) [0x7f99d108680c]
> [cbe-node-09:mpi_rank_0][print_backtrace] 7: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_MRAILI_Cq_poll+0x11fc) [0x7f99d10661cc]
> [cbe-node-09:mpi_rank_0][print_backtrace] 8: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_read_progress+0x18f) [0x7f99d103d07f]
> [cbe-node-09:mpi_rank_0][print_backtrace] 9: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_Progress+0x13a) [0x7f99d103c55a]
> [cbe-node-09:mpi_rank_0][print_backtrace] 10: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(mv2_mcast_progress_comm_ready+0x69) [0x7f99d1086db9]
> [cbe-node-09:mpi_rank_0][print_backtrace] 11: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(create_2level_comm+0x1109) [0x7f99d1154259]
> [cbe-node-09:mpi_rank_0][print_backtrace] 12: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIR_Init_thread+0x4d8) [0x7f99d1177f18]
> [cbe-node-09:mpi_rank_0][print_backtrace] 13: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPI_Init+0xdb) [0x7f99d117727b]
> [cbe-node-09:mpi_rank_0][print_backtrace] 14: ./bdfsim1() [0x403cee]
> [cbe-node-09:mpi_rank_0][print_backtrace] 15: /lib64/libc.so.6(__libc_start_main+0xfd) [0x355801ecdd]
> [cbe-node-09:mpi_rank_0][print_backtrace] 16: ./bdfsim1() [0x401949]
The program runs correctly with only MV2_USE_RDMA_CM=1.
--
Martin Pokorny
Software Engineer - Karl G. Jansky Very Large Array
National Radio Astronomy Observatory - New Mexico Operations
More information about the mvapich-discuss
mailing list