[mvapich-discuss] infrequent error in ibv_channel_manager

Hari Subramoni subramoni.1 at osu.edu
Fri Mar 10 16:34:13 EST 2017


Hi Martin,

Thank you for the details. Can you also see if there is a segfault
happening at any process causing this failure?

Output of "ibv_devinfo -v" will help.

Regards,
Hari.

On Fri, Mar 10, 2017 at 11:57 AM, Martin Pokorny <mpokorny at nrao.edu> wrote:

> Hi Hari,
>
> Please see below for my comments.
>
> On 03/10/2017 09:37 AM, Hari Subramoni wrote:
>
>> Sorry to hear that you're facing issues.
>>
>> Event 3 is IBV_EVENT_QP_ACCESS_ERR. From the man pages, this can be
>> caused because of one of the following reasons
>>
>>  1. Misaligned atomic request
>>  2. Too many RDMA Read or Atomic requests
>>  3. R_Key violation
>>  4. Length errors without immediate data
>>
>> Out of these, #2 could be related to the application communication
>> pattern. Do you think the application is issuing several
>> back-to-back large message send operations of MPI3-RMA operations?
>>
>
> The majority of MPI traffic is from MPI-IO. I don't recall seeing lots of
> RMA operations in the source code of the Lustre ADIO module (with which I'm
> somewhat familiar), but I'll have another look at that.
>
> For the others, it could be some issue inside the MVAPICH2 library.
>> Since you're using MVAPICH2-2.1, which is more than a year old, may I
>> request that you retry the application with MVAPICH2-2.2-GA? We've fixed
>> several issues since MVAPICH2-2.1 which is available in MVAPICH2-2.2GA.
>>
>
> That's on my list of things to try, but it will have to wait until I can
> get some testing time, meaning mid next week at the earliest.
>
> Could you give us some more details about the underlying IB fabric?
>>
>
> Sure -- what sorts of details might be useful?
>
>
>> Regards,
>> Hari.
>>
>> On Fri, Mar 10, 2017 at 11:06 AM, Martin Pokorny <mpokorny at nrao.edu
>> <mailto:mpokorny at nrao.edu>> wrote:
>>
>>     We've recently been seeing the following sorts of errors at a small
>>     yet noticeable rate
>>
>>         [cbe-node-24:mpi_rank_9][async_thread]
>>         ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.
>> c:1152:
>>         Got FATAL event 3
>>         : Invalid argument (22)
>>         [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to
>>         9, wc_opcode=0
>>         [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10,
>>         wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 =
>>         MPIDI_CH3_PKT_RPUT_FINISH
>>         [cbe-node-28:mpi_rank_24][handle_cqe]
>>         ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.
>> c:587:
>>         [] Got completion with error 10, vendor code=0x88, dest rank=9
>>
>>
>>     Unfortunately, I can't send the source for the program that is
>>     experiencing this error, nor am I able to come up with a simpler
>>     reproducer. I'm hoping that perhaps you might have some advice for
>>     helping me diagnose the cause of the error. For example is there
>>     some environment variable that might be worth looking at?
>>
>>     I'm using mvapich2-2.1 on a cluster with IB network. I built
>>     mvapich2 as follows:
>>     ../configure --enable-romio --with-file-system=lustre
>>     --enable-debuginfo --enable-g=dbg,log --with-limic2 --enable-rdma-cm
>>
>>     --
>>     Martin Pokorny
>>     Software Engineer
>>     Jansky Very Large Array correlator backend and CASA software
>>     National Radio Astronomy Observatory - New Mexico Operations
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>>
>>
>>
>
> --
> Martin Pokorny
> Software Engineer
> Jansky Very Large Array correlator backend and CASA software
> National Radio Astronomy Observatory - New Mexico Operations
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170310/2221797b/attachment.html>


More information about the mvapich-discuss mailing list