[mvapich-discuss] infrequent error in ibv_channel_manager

Hari Subramoni subramoni.1 at osu.edu
Fri Mar 10 11:37:34 EST 2017


Hi Martin,

Sorry to hear that you're facing issues.

Event 3 is IBV_EVENT_QP_ACCESS_ERR. From the man pages, this can be caused
because of one of the following reasons

 1. Misaligned atomic request
 2. Too many RDMA Read or Atomic requests
 3. R_Key violation
 4. Length errors without immediate data

Out of these, #2 could be related to the application communication pattern.
Do you think the application is issuing several back-to-back large message
send operations of MPI3-RMA operations?

For the others, it could be some issue inside the MVAPICH2 library. Since
you're using MVAPICH2-2.1, which is more than a year old, may I request
that you retry the application with MVAPICH2-2.2-GA? We've fixed several
issues since MVAPICH2-2.1 which is available in MVAPICH2-2.2GA.

Could you give us some more details about the underlying IB fabric?

Regards,
Hari.

On Fri, Mar 10, 2017 at 11:06 AM, Martin Pokorny <mpokorny at nrao.edu> wrote:

> We've recently been seeing the following sorts of errors at a small yet
> noticeable rate
>
> [cbe-node-24:mpi_rank_9][async_thread] ../src/mpid/ch3/channels/mrail
>> /src/gen2/ibv_channel_manager.c:1152: Got FATAL event 3
>> : Invalid argument (22)
>> [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to 9,
>> wc_opcode=0
>> [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10,
>> wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 =
>> MPIDI_CH3_PKT_RPUT_FINISH
>> [cbe-node-28:mpi_rank_24][handle_cqe] ../src/mpid/ch3/channels/mrail
>> /src/gen2/ibv_channel_manager.c:587: [] Got completion with error 10,
>> vendor code=0x88, dest rank=9
>>
>
> Unfortunately, I can't send the source for the program that is
> experiencing this error, nor am I able to come up with a simpler
> reproducer. I'm hoping that perhaps you might have some advice for helping
> me diagnose the cause of the error. For example is there some environment
> variable that might be worth looking at?
>
> I'm using mvapich2-2.1 on a cluster with IB network. I built mvapich2 as
> follows:
> ../configure --enable-romio --with-file-system=lustre --enable-debuginfo
> --enable-g=dbg,log --with-limic2 --enable-rdma-cm
>
> --
> Martin Pokorny
> Software Engineer
> Jansky Very Large Array correlator backend and CASA software
> National Radio Astronomy Observatory - New Mexico Operations
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170310/5767443d/attachment-0001.html>


More information about the mvapich-discuss mailing list