[mvapich-discuss] infrequent error in ibv_channel_manager

Martin Pokorny mpokorny at nrao.edu
Fri Mar 10 11:57:49 EST 2017


Hi Hari,

Please see below for my comments.

On 03/10/2017 09:37 AM, Hari Subramoni wrote:
> Sorry to hear that you're facing issues.
>
> Event 3 is IBV_EVENT_QP_ACCESS_ERR. From the man pages, this can be
> caused because of one of the following reasons
>
>  1. Misaligned atomic request
>  2. Too many RDMA Read or Atomic requests
>  3. R_Key violation
>  4. Length errors without immediate data
>
> Out of these, #2 could be related to the application communication
> pattern. Do you think the application is issuing several
> back-to-back large message send operations of MPI3-RMA operations?

The majority of MPI traffic is from MPI-IO. I don't recall seeing lots 
of RMA operations in the source code of the Lustre ADIO module (with 
which I'm somewhat familiar), but I'll have another look at that.

> For the others, it could be some issue inside the MVAPICH2 library.
> Since you're using MVAPICH2-2.1, which is more than a year old, may I
> request that you retry the application with MVAPICH2-2.2-GA? We've fixed
> several issues since MVAPICH2-2.1 which is available in MVAPICH2-2.2GA.

That's on my list of things to try, but it will have to wait until I can 
get some testing time, meaning mid next week at the earliest.

> Could you give us some more details about the underlying IB fabric?

Sure -- what sorts of details might be useful?

>
> Regards,
> Hari.
>
> On Fri, Mar 10, 2017 at 11:06 AM, Martin Pokorny <mpokorny at nrao.edu
> <mailto:mpokorny at nrao.edu>> wrote:
>
>     We've recently been seeing the following sorts of errors at a small
>     yet noticeable rate
>
>         [cbe-node-24:mpi_rank_9][async_thread]
>         ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1152:
>         Got FATAL event 3
>         : Invalid argument (22)
>         [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to
>         9, wc_opcode=0
>         [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10,
>         wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 =
>         MPIDI_CH3_PKT_RPUT_FINISH
>         [cbe-node-28:mpi_rank_24][handle_cqe]
>         ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587:
>         [] Got completion with error 10, vendor code=0x88, dest rank=9
>
>
>     Unfortunately, I can't send the source for the program that is
>     experiencing this error, nor am I able to come up with a simpler
>     reproducer. I'm hoping that perhaps you might have some advice for
>     helping me diagnose the cause of the error. For example is there
>     some environment variable that might be worth looking at?
>
>     I'm using mvapich2-2.1 on a cluster with IB network. I built
>     mvapich2 as follows:
>     ../configure --enable-romio --with-file-system=lustre
>     --enable-debuginfo --enable-g=dbg,log --with-limic2 --enable-rdma-cm
>
>     --
>     Martin Pokorny
>     Software Engineer
>     Jansky Very Large Array correlator backend and CASA software
>     National Radio Astronomy Observatory - New Mexico Operations
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>


-- 
Martin Pokorny
Software Engineer
Jansky Very Large Array correlator backend and CASA software
National Radio Astronomy Observatory - New Mexico Operations


More information about the mvapich-discuss mailing list