[mvapich-discuss] Hardware problem or code bug?

Lana Deere lana.deere at gmail.com
Tue Jul 21 14:30:08 EDT 2020


I will try to put the 2.3.4GA version in place and see if that makes the
problem go away.

.. Lana (lana.deere at gmail.com)




On Tue, Jul 21, 2020 at 2:24 PM Subramoni, Hari <subramoni.1 at osu.edu> wrote:

> Hi, Lana.
>
>
>
> OK. I understand. The list of MPI calls seem simple enough.
>
>
>
> MVAPICH2 2.3.1 was released in 03/01/2019. If possible, can you try using
> MVAPICH2-2.3.4 GA release? We have made several bug fixes in the code since
> then that can potentially address this issue.
>
>
>
> Best,
>
> Hari.
>
>
>
> *From:* Lana Deere <lana.deere at gmail.com>
> *Sent:* Tuesday, July 21, 2020 2:16 PM
> *To:* Subramoni, Hari <subramoni.1 at osu.edu>
> *Cc:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* Re: [mvapich-discuss] Hardware problem or code bug?
>
>
>
> It's mvapich2 2.3.1.  The application is proprietary and the system is not
> one I can give access to.  The error started recently but is only happening
> maybe once per week or 10 days.  The systems are running CentOS7.6 with the
> distribution's built-in infiniband driver.  I will see if I can figure out
> any way to get some decoding of the error from Mellanox/NVIDIA.
>
>
>
> As regards the MPI APIs we use, they include these, but I'm not sure which
> one reported the error.
>
>     MPI_Allgather
>     MPI_Allreduce
>     MPI_Barrier
>     MPI_Comm_get_parent
>     MPI_Comm_rank
>     MPI_Comm_size
>     MPI_Comm_spawn
>     MPI_Finalize
>     MPI_Get_processor_name
>     MPI_Info_create
>     MPI_Info_set
>     MPI_Init
>     MPI_Intercomm_merge
>     MPI_Irecv
>     MPI_Isend
>     MPI_Recv
>     MPI_Send
>     MPI_Waitall
>
>
> .. Lana (lana.deere at gmail.com)
>
>
>
>
>
> On Tue, Jul 21, 2020 at 2:00 PM Subramoni, Hari <subramoni.1 at osu.edu>
> wrote:
>
> Hi, Lana.
>
>
>
> We have seen this in the past when the MPI library is trying to do some IB
> operation that was deemed illegal by the HCA. This can be caused either a
> genuinely wrong operation by MVAPICH2 or because the remote process (with
> which the sender was trying to communicate) died or went into some bad
> state.
>
>
>
> Some further details can help us narrow down the issue
>
>
>
>    1. What version of MVAPICH2 were you using?
>    2. What application were you using? What MPI operations does it
>    perform? Is it possible to give us a reproducer/access to the system where
>    the error occurs?
>
>
>
> Regarding the hexdump and vendor_error code – these are likely to provide
> more hints as to what sort of illegal operation was executed. However,
> these are proprietary and only Mellanox/NVIDIA folks would know how to
> interpret them ☹.
>
>
>
> Best,
>
> Hari.
>
>
>
> *From:* mvapich-discuss-bounces at cse.ohio-state.edu <
> mvapich-discuss-bounces at mailman.cse.ohio-state.edu> *On Behalf Of *Lana
> Deere
> *Sent:* Tuesday, July 21, 2020 1:54 PM
> *To:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* [mvapich-discuss] Hardware problem or code bug?
>
>
>
> The following error is occurring rarely when running an MPI program.  I am
> not able to interpret what the error is saying, so any help decoding this
> would be appreciated.  In particular, does anyone know whether this is a
> sign of an infiniband hardware error or rather an intermittent software bug?
>
>
>
> mlx5: compute-0-8.local: got completion with error:
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000005 00000000 00000000 00000000
> 00000000 12006802 00004016 1d35add3
> [compute-0-8.local:mpi_rank_6][handle_cqe] Send desc error in msg to 6,
> wc_opcode=0
> [compute-0-8.local:mpi_rank_6][handle_cqe] Msg from 6: wc.status=2,
> wc.wr_id=0xc58e040, wc.opcode=0, vbuf->phead->type=0 =
> MPIDI_CH3_PKT_EAGER_SEND
> [compute-0-8.local:mpi_rank_6][handle_cqe]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
> completion with error 2, vendor code=0x68, dest rank=6
>
>
> .. Lana (lana.deere at gmail.com)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200721/22a23696/attachment-0001.html>


More information about the mvapich-discuss mailing list