[mvapich-discuss] Hardware problem or code bug?

Lana Deere lana.deere at gmail.com
Tue Jul 21 14:15:37 EDT 2020


It's mvapich2 2.3.1.  The application is proprietary and the system is not
one I can give access to.  The error started recently but is only happening
maybe once per week or 10 days.  The systems are running CentOS7.6 with the
distribution's built-in infiniband driver.  I will see if I can figure out
any way to get some decoding of the error from Mellanox/NVIDIA.

As regards the MPI APIs we use, they include these, but I'm not sure which
one reported the error.
    MPI_Allgather
    MPI_Allreduce
    MPI_Barrier
    MPI_Comm_get_parent
    MPI_Comm_rank
    MPI_Comm_size
    MPI_Comm_spawn
    MPI_Finalize
    MPI_Get_processor_name
    MPI_Info_create
    MPI_Info_set
    MPI_Init
    MPI_Intercomm_merge
    MPI_Irecv
    MPI_Isend
    MPI_Recv
    MPI_Send
    MPI_Waitall

.. Lana (lana.deere at gmail.com)




On Tue, Jul 21, 2020 at 2:00 PM Subramoni, Hari <subramoni.1 at osu.edu> wrote:

> Hi, Lana.
>
>
>
> We have seen this in the past when the MPI library is trying to do some IB
> operation that was deemed illegal by the HCA. This can be caused either a
> genuinely wrong operation by MVAPICH2 or because the remote process (with
> which the sender was trying to communicate) died or went into some bad
> state.
>
>
>
> Some further details can help us narrow down the issue
>
>
>
>    1. What version of MVAPICH2 were you using?
>    2. What application were you using? What MPI operations does it
>    perform? Is it possible to give us a reproducer/access to the system where
>    the error occurs?
>
>
>
> Regarding the hexdump and vendor_error code – these are likely to provide
> more hints as to what sort of illegal operation was executed. However,
> these are proprietary and only Mellanox/NVIDIA folks would know how to
> interpret them ☹.
>
>
>
> Best,
>
> Hari.
>
>
>
> *From:* mvapich-discuss-bounces at cse.ohio-state.edu <
> mvapich-discuss-bounces at mailman.cse.ohio-state.edu> *On Behalf Of *Lana
> Deere
> *Sent:* Tuesday, July 21, 2020 1:54 PM
> *To:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* [mvapich-discuss] Hardware problem or code bug?
>
>
>
> The following error is occurring rarely when running an MPI program.  I am
> not able to interpret what the error is saying, so any help decoding this
> would be appreciated.  In particular, does anyone know whether this is a
> sign of an infiniband hardware error or rather an intermittent software bug?
>
>
>
> mlx5: compute-0-8.local: got completion with error:
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000005 00000000 00000000 00000000
> 00000000 12006802 00004016 1d35add3
> [compute-0-8.local:mpi_rank_6][handle_cqe] Send desc error in msg to 6,
> wc_opcode=0
> [compute-0-8.local:mpi_rank_6][handle_cqe] Msg from 6: wc.status=2,
> wc.wr_id=0xc58e040, wc.opcode=0, vbuf->phead->type=0 =
> MPIDI_CH3_PKT_EAGER_SEND
> [compute-0-8.local:mpi_rank_6][handle_cqe]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
> completion with error 2, vendor code=0x68, dest rank=6
>
>
> .. Lana (lana.deere at gmail.com)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200721/de113125/attachment-0001.html>


More information about the mvapich-discuss mailing list