[mvapich-discuss] Hardware problem or code bug?
Lana Deere
lana.deere at gmail.com
Tue Jul 21 14:30:08 EDT 2020
I will try to put the 2.3.4GA version in place and see if that makes the
problem go away.
.. Lana (lana.deere at gmail.com)
On Tue, Jul 21, 2020 at 2:24 PM Subramoni, Hari <subramoni.1 at osu.edu> wrote:
> Hi, Lana.
>
>
>
> OK. I understand. The list of MPI calls seem simple enough.
>
>
>
> MVAPICH2 2.3.1 was released in 03/01/2019. If possible, can you try using
> MVAPICH2-2.3.4 GA release? We have made several bug fixes in the code since
> then that can potentially address this issue.
>
>
>
> Best,
>
> Hari.
>
>
>
> *From:* Lana Deere <lana.deere at gmail.com>
> *Sent:* Tuesday, July 21, 2020 2:16 PM
> *To:* Subramoni, Hari <subramoni.1 at osu.edu>
> *Cc:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* Re: [mvapich-discuss] Hardware problem or code bug?
>
>
>
> It's mvapich2 2.3.1. The application is proprietary and the system is not
> one I can give access to. The error started recently but is only happening
> maybe once per week or 10 days. The systems are running CentOS7.6 with the
> distribution's built-in infiniband driver. I will see if I can figure out
> any way to get some decoding of the error from Mellanox/NVIDIA.
>
>
>
> As regards the MPI APIs we use, they include these, but I'm not sure which
> one reported the error.
>
> MPI_Allgather
> MPI_Allreduce
> MPI_Barrier
> MPI_Comm_get_parent
> MPI_Comm_rank
> MPI_Comm_size
> MPI_Comm_spawn
> MPI_Finalize
> MPI_Get_processor_name
> MPI_Info_create
> MPI_Info_set
> MPI_Init
> MPI_Intercomm_merge
> MPI_Irecv
> MPI_Isend
> MPI_Recv
> MPI_Send
> MPI_Waitall
>
>
> .. Lana (lana.deere at gmail.com)
>
>
>
>
>
> On Tue, Jul 21, 2020 at 2:00 PM Subramoni, Hari <subramoni.1 at osu.edu>
> wrote:
>
> Hi, Lana.
>
>
>
> We have seen this in the past when the MPI library is trying to do some IB
> operation that was deemed illegal by the HCA. This can be caused either a
> genuinely wrong operation by MVAPICH2 or because the remote process (with
> which the sender was trying to communicate) died or went into some bad
> state.
>
>
>
> Some further details can help us narrow down the issue
>
>
>
> 1. What version of MVAPICH2 were you using?
> 2. What application were you using? What MPI operations does it
> perform? Is it possible to give us a reproducer/access to the system where
> the error occurs?
>
>
>
> Regarding the hexdump and vendor_error code – these are likely to provide
> more hints as to what sort of illegal operation was executed. However,
> these are proprietary and only Mellanox/NVIDIA folks would know how to
> interpret them ☹.
>
>
>
> Best,
>
> Hari.
>
>
>
> *From:* mvapich-discuss-bounces at cse.ohio-state.edu <
> mvapich-discuss-bounces at mailman.cse.ohio-state.edu> *On Behalf Of *Lana
> Deere
> *Sent:* Tuesday, July 21, 2020 1:54 PM
> *To:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* [mvapich-discuss] Hardware problem or code bug?
>
>
>
> The following error is occurring rarely when running an MPI program. I am
> not able to interpret what the error is saying, so any help decoding this
> would be appreciated. In particular, does anyone know whether this is a
> sign of an infiniband hardware error or rather an intermittent software bug?
>
>
>
> mlx5: compute-0-8.local: got completion with error:
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000005 00000000 00000000 00000000
> 00000000 12006802 00004016 1d35add3
> [compute-0-8.local:mpi_rank_6][handle_cqe] Send desc error in msg to 6,
> wc_opcode=0
> [compute-0-8.local:mpi_rank_6][handle_cqe] Msg from 6: wc.status=2,
> wc.wr_id=0xc58e040, wc.opcode=0, vbuf->phead->type=0 =
> MPIDI_CH3_PKT_EAGER_SEND
> [compute-0-8.local:mpi_rank_6][handle_cqe]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
> completion with error 2, vendor code=0x68, dest rank=6
>
>
> .. Lana (lana.deere at gmail.com)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200721/22a23696/attachment-0001.html>
More information about the mvapich-discuss
mailing list