[mvapich-discuss] infrequent error in ibv_channel_manager

Martin Pokorny mpokorny at nrao.edu
Fri Mar 10 11:06:52 EST 2017


We've recently been seeing the following sorts of errors at a small yet 
noticeable rate

> [cbe-node-24:mpi_rank_9][async_thread] ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1152: Got FATAL event 3
> : Invalid argument (22)
> [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to 9, wc_opcode=0
> [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10, wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 = MPIDI_CH3_PKT_RPUT_FINISH
> [cbe-node-28:mpi_rank_24][handle_cqe] ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got completion with error 10, vendor code=0x88, dest rank=9

Unfortunately, I can't send the source for the program that is 
experiencing this error, nor am I able to come up with a simpler 
reproducer. I'm hoping that perhaps you might have some advice for 
helping me diagnose the cause of the error. For example is there some 
environment variable that might be worth looking at?

I'm using mvapich2-2.1 on a cluster with IB network. I built mvapich2 as 
follows:
../configure --enable-romio --with-file-system=lustre --enable-debuginfo 
--enable-g=dbg,log --with-limic2 --enable-rdma-cm

-- 
Martin Pokorny
Software Engineer
Jansky Very Large Array correlator backend and CASA software
National Radio Astronomy Observatory - New Mexico Operations


More information about the mvapich-discuss mailing list