[mvapich-discuss] Hardware problem or code bug?

Subramoni, Hari subramoni.1 at osu.edu
Tue Jul 21 13:59:56 EDT 2020


Hi, Lana.

We have seen this in the past when the MPI library is trying to do some IB operation that was deemed illegal by the HCA. This can be caused either a genuinely wrong operation by MVAPICH2 or because the remote process (with which the sender was trying to communicate) died or went into some bad state.

Some further details can help us narrow down the issue


  1.  What version of MVAPICH2 were you using?
  2.  What application were you using? What MPI operations does it perform? Is it possible to give us a reproducer/access to the system where the error occurs?

Regarding the hexdump and vendor_error code – these are likely to provide more hints as to what sort of illegal operation was executed. However, these are proprietary and only Mellanox/NVIDIA folks would know how to interpret them ☹.

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Lana Deere
Sent: Tuesday, July 21, 2020 1:54 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Hardware problem or code bug?

The following error is occurring rarely when running an MPI program.  I am not able to interpret what the error is saying, so any help decoding this would be appreciated.  In particular, does anyone know whether this is a sign of an infiniband hardware error or rather an intermittent software bug?

mlx5: compute-0-8.local: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 12006802 00004016 1d35add3
[compute-0-8.local:mpi_rank_6][handle_cqe] Send desc error in msg to 6, wc_opcode=0
[compute-0-8.local:mpi_rank_6][handle_cqe] Msg from 6: wc.status=2, wc.wr_id=0xc58e040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[compute-0-8.local:mpi_rank_6][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 2, vendor code=0x68, dest rank=6

.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200721/0e76bad2/attachment.html>


More information about the mvapich-discuss mailing list