[mvapich-discuss] Hardware problem or code bug?

Lana Deere lana.deere at gmail.com
Tue Jul 21 13:53:56 EDT 2020


The following error is occurring rarely when running an MPI program.  I am
not able to interpret what the error is saying, so any help decoding this
would be appreciated.  In particular, does anyone know whether this is a
sign of an infiniband hardware error or rather an intermittent software bug?

mlx5: compute-0-8.local: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 12006802 00004016 1d35add3
[compute-0-8.local:mpi_rank_6][handle_cqe] Send desc error in msg to 6,
wc_opcode=0
[compute-0-8.local:mpi_rank_6][handle_cqe] Msg from 6: wc.status=2,
wc.wr_id=0xc58e040, wc.opcode=0, vbuf->phead->type=0 =
MPIDI_CH3_PKT_EAGER_SEND
[compute-0-8.local:mpi_rank_6][handle_cqe]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
completion with error 2, vendor code=0x68, dest rank=6

.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200721/61f695ab/attachment.html>


More information about the mvapich-discuss mailing list