[mvapich-discuss] infrequent error in ibv_channel_manager
Hari Subramoni
subramoni.1 at osu.edu
Fri Mar 10 16:34:13 EST 2017
Hi Martin,
Thank you for the details. Can you also see if there is a segfault
happening at any process causing this failure?
Output of "ibv_devinfo -v" will help.
Regards,
Hari.
On Fri, Mar 10, 2017 at 11:57 AM, Martin Pokorny <mpokorny at nrao.edu> wrote:
> Hi Hari,
>
> Please see below for my comments.
>
> On 03/10/2017 09:37 AM, Hari Subramoni wrote:
>
>> Sorry to hear that you're facing issues.
>>
>> Event 3 is IBV_EVENT_QP_ACCESS_ERR. From the man pages, this can be
>> caused because of one of the following reasons
>>
>> 1. Misaligned atomic request
>> 2. Too many RDMA Read or Atomic requests
>> 3. R_Key violation
>> 4. Length errors without immediate data
>>
>> Out of these, #2 could be related to the application communication
>> pattern. Do you think the application is issuing several
>> back-to-back large message send operations of MPI3-RMA operations?
>>
>
> The majority of MPI traffic is from MPI-IO. I don't recall seeing lots of
> RMA operations in the source code of the Lustre ADIO module (with which I'm
> somewhat familiar), but I'll have another look at that.
>
> For the others, it could be some issue inside the MVAPICH2 library.
>> Since you're using MVAPICH2-2.1, which is more than a year old, may I
>> request that you retry the application with MVAPICH2-2.2-GA? We've fixed
>> several issues since MVAPICH2-2.1 which is available in MVAPICH2-2.2GA.
>>
>
> That's on my list of things to try, but it will have to wait until I can
> get some testing time, meaning mid next week at the earliest.
>
> Could you give us some more details about the underlying IB fabric?
>>
>
> Sure -- what sorts of details might be useful?
>
>
>> Regards,
>> Hari.
>>
>> On Fri, Mar 10, 2017 at 11:06 AM, Martin Pokorny <mpokorny at nrao.edu
>> <mailto:mpokorny at nrao.edu>> wrote:
>>
>> We've recently been seeing the following sorts of errors at a small
>> yet noticeable rate
>>
>> [cbe-node-24:mpi_rank_9][async_thread]
>> ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.
>> c:1152:
>> Got FATAL event 3
>> : Invalid argument (22)
>> [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to
>> 9, wc_opcode=0
>> [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10,
>> wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 =
>> MPIDI_CH3_PKT_RPUT_FINISH
>> [cbe-node-28:mpi_rank_24][handle_cqe]
>> ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.
>> c:587:
>> [] Got completion with error 10, vendor code=0x88, dest rank=9
>>
>>
>> Unfortunately, I can't send the source for the program that is
>> experiencing this error, nor am I able to come up with a simpler
>> reproducer. I'm hoping that perhaps you might have some advice for
>> helping me diagnose the cause of the error. For example is there
>> some environment variable that might be worth looking at?
>>
>> I'm using mvapich2-2.1 on a cluster with IB network. I built
>> mvapich2 as follows:
>> ../configure --enable-romio --with-file-system=lustre
>> --enable-debuginfo --enable-g=dbg,log --with-limic2 --enable-rdma-cm
>>
>> --
>> Martin Pokorny
>> Software Engineer
>> Jansky Very Large Array correlator backend and CASA software
>> National Radio Astronomy Observatory - New Mexico Operations
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>>
>>
>>
>
> --
> Martin Pokorny
> Software Engineer
> Jansky Very Large Array correlator backend and CASA software
> National Radio Astronomy Observatory - New Mexico Operations
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170310/2221797b/attachment.html>
More information about the mvapich-discuss
mailing list