[mvapich-discuss] infrequent error in ibv_channel_manager
Martin Pokorny
mpokorny at nrao.edu
Fri Mar 10 11:57:49 EST 2017
Hi Hari,
Please see below for my comments.
On 03/10/2017 09:37 AM, Hari Subramoni wrote:
> Sorry to hear that you're facing issues.
>
> Event 3 is IBV_EVENT_QP_ACCESS_ERR. From the man pages, this can be
> caused because of one of the following reasons
>
> 1. Misaligned atomic request
> 2. Too many RDMA Read or Atomic requests
> 3. R_Key violation
> 4. Length errors without immediate data
>
> Out of these, #2 could be related to the application communication
> pattern. Do you think the application is issuing several
> back-to-back large message send operations of MPI3-RMA operations?
The majority of MPI traffic is from MPI-IO. I don't recall seeing lots
of RMA operations in the source code of the Lustre ADIO module (with
which I'm somewhat familiar), but I'll have another look at that.
> For the others, it could be some issue inside the MVAPICH2 library.
> Since you're using MVAPICH2-2.1, which is more than a year old, may I
> request that you retry the application with MVAPICH2-2.2-GA? We've fixed
> several issues since MVAPICH2-2.1 which is available in MVAPICH2-2.2GA.
That's on my list of things to try, but it will have to wait until I can
get some testing time, meaning mid next week at the earliest.
> Could you give us some more details about the underlying IB fabric?
Sure -- what sorts of details might be useful?
>
> Regards,
> Hari.
>
> On Fri, Mar 10, 2017 at 11:06 AM, Martin Pokorny <mpokorny at nrao.edu
> <mailto:mpokorny at nrao.edu>> wrote:
>
> We've recently been seeing the following sorts of errors at a small
> yet noticeable rate
>
> [cbe-node-24:mpi_rank_9][async_thread]
> ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1152:
> Got FATAL event 3
> : Invalid argument (22)
> [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in msg to
> 9, wc_opcode=0
> [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9: wc.status=10,
> wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 =
> MPIDI_CH3_PKT_RPUT_FINISH
> [cbe-node-28:mpi_rank_24][handle_cqe]
> ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587:
> [] Got completion with error 10, vendor code=0x88, dest rank=9
>
>
> Unfortunately, I can't send the source for the program that is
> experiencing this error, nor am I able to come up with a simpler
> reproducer. I'm hoping that perhaps you might have some advice for
> helping me diagnose the cause of the error. For example is there
> some environment variable that might be worth looking at?
>
> I'm using mvapich2-2.1 on a cluster with IB network. I built
> mvapich2 as follows:
> ../configure --enable-romio --with-file-system=lustre
> --enable-debuginfo --enable-g=dbg,log --with-limic2 --enable-rdma-cm
>
> --
> Martin Pokorny
> Software Engineer
> Jansky Very Large Array correlator backend and CASA software
> National Radio Astronomy Observatory - New Mexico Operations
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>
--
Martin Pokorny
Software Engineer
Jansky Very Large Array correlator backend and CASA software
National Radio Astronomy Observatory - New Mexico Operations
More information about the mvapich-discuss
mailing list