[Mvapich-discuss] RDMA CM messages

Lana Deere lana.deere at gmail.com
Thu Jul 29 12:47:09 EDT 2021


As far as I can tell, all the IB ports have IPoIB addresses assigned.  I
will try with USE_RDMA_CM=0.  A rerun without change change got a different
error,  I think I was seeing this kind of error last summer and fall, but
it went away.

mlx5: host9: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000006 00000000 00000000 00000000
00000000 12006802 000039a6 0210c3d2
[host9:mpi_rank_6][handle_cqe] Send desc error in msg to 10, wc_opcode=0
[host9:mpi_rank_6][handle_cqe] Msg from 10: wc.status=2 (local QP operation
error), wc.wr_id=0xc21fcc0, wc.opcode=0, vbuf->phead->type=32 =
MPIDI_CH3_PKT_RNDV_REQ_TO_SEND
[host9:mpi_rank_6][mv2_print_wc_status_error] IBV_WC_LOC_QP_OP_ERR: This
event is generated when a QP error occurs. For example, it may be generated
if a) user neglects to specify responder_resources and initiator_depth
values in struct rdma_conn_param before calling rdma_connect() on the
client side and rdma_accept() on the server side, b) a Work Request that
was posted in a local Send Queue of a UD QP contains an Address Handle that
is associated with a Protection Domain to a QP which is associated with a
different Protection Domain, or c) an opcode which is not supported by the
transport type of the QP is not supported (for example: RDMA Write over a
UD QP).
[host9:mpi_rank_6][handle_cqe]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:499: [] Got
completion with error 2, vendor code=0x68, dest rank=10


.. Lana (lana.deere at gmail.com)




On Wed, Jul 28, 2021 at 7:46 AM Subramoni, Hari <subramoni.1 at osu.edu> wrote:

> Hi, Lana.
>
>
>
> It looks like IP addresses were not assigned to all the IB ports.
>
>
>
> As a workaround, can you please set MV2_USE_RDMA_CM=0 and try?
>
>
>
> Thx,
>
> Hari.
>
>
>
> PS: Please try and move to MVAPICH2 2.3.6. It has a lot of fixes and
> performance enhancements compared to the 2.3.5 release.
>
>
>
> *From:* Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> *On
> Behalf Of *Lana Deere via Mvapich-discuss
> *Sent:* Tuesday, July 27, 2021 6:24 PM
> *To:* mvapich-discuss at lists.osu.edu
> *Subject:* [Mvapich-discuss] RDMA CM messages
>
>
>
> I'm using mvapich2.3.5 on CentOS 7.
>
>
>
> I've got an MPI job which is failing intermittently.  One of the failure
> symptoms is a hang in MPI_InitThread, with this traceback:
>
> /lib64/libpthread.so.0  read
> libmpi.so.12            PMIU_readline
> libmpi.so.12
> libmpi.so.12            UPMI_BARRIER
> libmpi.so.12            rdma_cm_exchange_hostid
> libmpi.so.12            MPIDI_CH3I_RDMA_CM_Init
> libmpi.so.12            MPIDI_CH3_Init
> libmpi.so.12            MPID_Init
> libmpi.so.12            MPIR_Init_thread
> libmpi.so.12            MPI_Init_thread
>
>
>
> A run which didn't fail produced this warning:
>
> Warning: RDMA CM Initialization failed. Continuing without RDMA CM
> support. Please set MV2_USE_RDMA_CM=0 to disable RDMA CM.
>
>
>
> Does anyone have advice on tracking this down?  Does it suggest a software
> issue?  An infiniband hardware issue?
>
>
>
> Thanks.
>
>
> .. Lana (lana.deere at gmail.com)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20210729/6df2518b/attachment-0022.html>


More information about the Mvapich-discuss mailing list