<div dir="ltr"><div>I've been running the same program with the same dataset a bunch of times in order to try to reproduce a different issue. One of my runs failed with the message included below. I'm using mvapich2 2.3.5-1. The vendor code 0x68 it references is "
malformed WQE (Work Queue Element)". Anyone have any ideas on the cause of this? I'm not sure how repeatable this will turn out to be.<br></div><div><br></div><div>mlx5: worker15.local: got completion with error:</div>00000000 00000000 00000000 00000000<br>00000000 00000000 00000000 00000000<br>00000006 00000000 00000000 00000000<br>00000000 12006802 000022c5 06746ed2<br>[worker15.local:mpi_rank_7][handle_cqe] Send desc error in msg to 7, wc_opcode=0<br>[worker15.local:mpi_rank_7][handle_cqe] Msg from 7: wc.status=2 (local QP operation error), wc.wr_id=0xe4eaa50, wc.opcode=0, vbuf->phead->type=2 = MPIDI_CH3_PKT_FAST_EAGER_SEND<br>[worker15.local:mpi_rank_7][mv2_print_wc_status_error] IBV_WC_LOC_QP_OP_ERR: This event is generated when a QP error occurs. For example, it may be generated if a) user neglects to specify responder_resources and initiator_depth values in struct rdma_conn_param before calling rdma_connect() on the client side and rdma_accept() on the server side, b) a Work Request that was posted in a local Send Queue of a UD QP contains an Address Handle that is associated with a Protection Domain to a QP which is associated with a different Protection Domain, or c) an opcode which is not supported by the transport type of the QP is not supported (for example: RDMA Write over a UD QP).<br>[worker15.local:mpi_rank_7][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:499: [] Got completion with error 2, vendor code=0x68, dest rank=7<br><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><br>.. Lana (<a href="mailto:lana.deere@gmail.com" target="_blank">lana.deere@gmail.com</a>)<br><br><br></div></div></div>