[mvapich-discuss] multicast difficulties

Martin Pokorny mpokorny at nrao.edu
Thu Jan 31 13:06:57 EST 2013


Hello everyone.

I'm having some difficulties getting multicast to work my system. I 
don't have a lot of experience with Infiniband, so I've probably got 
something misconfigured, but I only see the problem whenever I try to 
run any program linked to the mvapich2 libraries, so I thought I'd ask 
here first. I'm using mvapich2-1.9a2 on a cluster running RHEL 6.3, with 
Mellanox MT26428 HCAs. mvapich2 was built with the following configure 
options:

> ./configure --prefix=/opt/cbe-local/stow/mvapich2-1.9a2 --enable-romio --with-file-system=lustre --enable-shared --enable-sharedlibs=gcc --with-rdma-cm --enable-fast=O3 --with-limic2 --enable-g=dbg,log

When I run a program with MV2_USE_MCAST=1 and MV2_USE_RDMA_CM=1 it fails 
with output like the following for all nodes:

> Failed to modify QP to INIT
> Error in creating UD QP
> [cbe-node-11:mpi_rank_2][mv2_mcast_prepare_ud_ctx] MCAST UD QP creation failed[cbe-node-11:mpi_rank_2][MPIDI_CH3_Init] Error in create multicast UD context for multicast

When I run the same program with only MV2_USE_MCAST=1, a segfault 
occurs, with the following backtrace (obtained using 
MV2_DEBUG_SHOW_BACKTRACE):

> [cbe-node-09:mpi_rank_0][print_backtrace]   0: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(print_backtrace+0x1e) [0x7f99d1089f5e]
> [cbe-node-09:mpi_rank_0][print_backtrace]   1: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(error_sighandler+0x59) [0x7f99d108a069]
> [cbe-node-09:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0() [0x355840f500]
> [cbe-node-09:mpi_rank_0][print_backtrace]   3: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_MRAILI_Eager_send+0x2de) [0x7f99d1050e1e]
> [cbe-node-09:mpi_rank_0][print_backtrace]   4: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3_iStartMsg+0x27c) [0x7f99d1039a1c]
> [cbe-node-09:mpi_rank_0][print_backtrace]   5: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(+0x13c55d) [0x7f99d108655d]
> [cbe-node-09:mpi_rank_0][print_backtrace]   6: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(mv2_process_mcast_msg+0xfc) [0x7f99d108680c]
> [cbe-node-09:mpi_rank_0][print_backtrace]   7: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_MRAILI_Cq_poll+0x11fc) [0x7f99d10661cc]
> [cbe-node-09:mpi_rank_0][print_backtrace]   8: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_read_progress+0x18f) [0x7f99d103d07f]
> [cbe-node-09:mpi_rank_0][print_backtrace]   9: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIDI_CH3I_Progress+0x13a) [0x7f99d103c55a]
> [cbe-node-09:mpi_rank_0][print_backtrace]  10: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(mv2_mcast_progress_comm_ready+0x69) [0x7f99d1086db9]
> [cbe-node-09:mpi_rank_0][print_backtrace]  11: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(create_2level_comm+0x1109) [0x7f99d1154259]
> [cbe-node-09:mpi_rank_0][print_backtrace]  12: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPIR_Init_thread+0x4d8) [0x7f99d1177f18]
> [cbe-node-09:mpi_rank_0][print_backtrace]  13: /opt/cbe-local/stow/mvapich2-1.9a2/lib/libmpich.so.8(MPI_Init+0xdb) [0x7f99d117727b]
> [cbe-node-09:mpi_rank_0][print_backtrace]  14: ./bdfsim1() [0x403cee]
> [cbe-node-09:mpi_rank_0][print_backtrace]  15: /lib64/libc.so.6(__libc_start_main+0xfd) [0x355801ecdd]
> [cbe-node-09:mpi_rank_0][print_backtrace]  16: ./bdfsim1() [0x401949]

The program runs correctly with only MV2_USE_RDMA_CM=1.

-- 
Martin Pokorny
Software Engineer - Karl G. Jansky Very Large Array
National Radio Astronomy Observatory - New Mexico Operations


More information about the mvapich-discuss mailing list