[mvapich-discuss] infrequent error in ibv_channel_manager
Martin Pokorny
mpokorny at nrao.edu
Fri Mar 10 11:06:52 EST 2017
We've recently been seeing the following sorts of errors at a small yet
noticeable rate
> [cbe-node-24:mpi_rank_9][async_thread] ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1152: Got FATAL event 3
> : Invalid argument (22)
> [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to 9, wc_opcode=0
> [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10, wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 = MPIDI_CH3_PKT_RPUT_FINISH
> [cbe-node-28:mpi_rank_24][handle_cqe] ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got completion with error 10, vendor code=0x88, dest rank=9
Unfortunately, I can't send the source for the program that is
experiencing this error, nor am I able to come up with a simpler
reproducer. I'm hoping that perhaps you might have some advice for
helping me diagnose the cause of the error. For example is there some
environment variable that might be worth looking at?
I'm using mvapich2-2.1 on a cluster with IB network. I built mvapich2 as
follows:
../configure --enable-romio --with-file-system=lustre --enable-debuginfo
--enable-g=dbg,log --with-limic2 --enable-rdma-cm
--
Martin Pokorny
Software Engineer
Jansky Very Large Array correlator backend and CASA software
National Radio Astronomy Observatory - New Mexico Operations
More information about the mvapich-discuss
mailing list