[mvapich-discuss] Hardware problem or code bug?

Subramoni, Hari subramoni.1 at osu.edu
Tue Jul 21 14:24:22 EDT 2020


Hi, Lana.

OK. I understand. The list of MPI calls seem simple enough.

MVAPICH2 2.3.1 was released in 03/01/2019. If possible, can you try using MVAPICH2-2.3.4 GA release? We have made several bug fixes in the code since then that can potentially address this issue.

Best,
Hari.

From: Lana Deere <lana.deere at gmail.com>
Sent: Tuesday, July 21, 2020 2:16 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Hardware problem or code bug?

It's mvapich2 2.3.1.  The application is proprietary and the system is not one I can give access to.  The error started recently but is only happening maybe once per week or 10 days.  The systems are running CentOS7.6 with the distribution's built-in infiniband driver.  I will see if I can figure out any way to get some decoding of the error from Mellanox/NVIDIA.

As regards the MPI APIs we use, they include these, but I'm not sure which one reported the error.
    MPI_Allgather
    MPI_Allreduce
    MPI_Barrier
    MPI_Comm_get_parent
    MPI_Comm_rank
    MPI_Comm_size
    MPI_Comm_spawn
    MPI_Finalize
    MPI_Get_processor_name
    MPI_Info_create
    MPI_Info_set
    MPI_Init
    MPI_Intercomm_merge
    MPI_Irecv
    MPI_Isend
    MPI_Recv
    MPI_Send
    MPI_Waitall

.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)



On Tue, Jul 21, 2020 at 2:00 PM Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>> wrote:
Hi, Lana.

We have seen this in the past when the MPI library is trying to do some IB operation that was deemed illegal by the HCA. This can be caused either a genuinely wrong operation by MVAPICH2 or because the remote process (with which the sender was trying to communicate) died or went into some bad state.

Some further details can help us narrow down the issue


  1.  What version of MVAPICH2 were you using?
  2.  What application were you using? What MPI operations does it perform? Is it possible to give us a reproducer/access to the system where the error occurs?

Regarding the hexdump and vendor_error code – these are likely to provide more hints as to what sort of illegal operation was executed. However, these are proprietary and only Mellanox/NVIDIA folks would know how to interpret them ☹.

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> On Behalf Of Lana Deere
Sent: Tuesday, July 21, 2020 1:54 PM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: [mvapich-discuss] Hardware problem or code bug?

The following error is occurring rarely when running an MPI program.  I am not able to interpret what the error is saying, so any help decoding this would be appreciated.  In particular, does anyone know whether this is a sign of an infiniband hardware error or rather an intermittent software bug?

mlx5: compute-0-8.local: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 12006802 00004016 1d35add3
[compute-0-8.local:mpi_rank_6][handle_cqe] Send desc error in msg to 6, wc_opcode=0
[compute-0-8.local:mpi_rank_6][handle_cqe] Msg from 6: wc.status=2, wc.wr_id=0xc58e040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[compute-0-8.local:mpi_rank_6][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 2, vendor code=0x68, dest rank=6

.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200721/c36d6d40/attachment.html>


More information about the mvapich-discuss mailing list