[mvapich-discuss] infrequent error in ibv_channel_manager
Martin Pokorny
mpokorny at nrao.edu
Fri Mar 10 17:07:25 EST 2017
Hi Hari,
On 03/10/2017 02:34 PM, Hari Subramoni wrote:
> Thank you for the details. Can you also see if there is a segfault
> happening at any process causing this failure?
No evidence of such that I can find. You should know that the
application that's failing is part of a streaming data acquisition
system that runs continuously. I usually have only log files to look at
after the fact, occasionally a core file, but not this time. You did get
me looking at a few more log files, and I noticed that the failures
coincide with times that a new MPI job was starting. The vast majority
of times, a job start isn't coincident with any failure, but the last
two failures did occur at such times. It's just something I happened to
notice, and may not be significant.
> Output of "ibv_devinfo -v" will help.
Here it is:
> $ ibv_devinfo -v
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.9.1000
> node_guid: 0002:c903:0028:25ca
> sys_image_guid: 0002:c903:0028:25cd
> vendor_id: 0x02c9
> vendor_part_id: 26428
> hw_ver: 0xB0
> board_id: MT_0D90110009
> phys_port_cnt: 1
> max_mr_size: 0xffffffffffffffff
> page_size_cap: 0xfffffe00
> max_qp: 163256
> max_qp_wr: 16351
> device_cap_flags: 0x007c9c76
> max_sge: 32
> max_sge_rd: 0
> max_cq: 65408
> max_cqe: 4194303
> max_mr: 524272
> max_pd: 32764
> max_qp_rd_atom: 16
> max_ee_rd_atom: 0
> max_res_rd_atom: 2612096
> max_qp_init_rd_atom: 128
> max_ee_init_rd_atom: 0
> atomic_cap: ATOMIC_HCA (1)
> max_ee: 0
> max_rdd: 0
> max_mw: 0
> max_raw_ipv6_qp: 0
> max_raw_ethy_qp: 0
> max_mcast_grp: 8192
> max_mcast_qp_attach: 248
> max_total_mcast_qp_attach: 2031616
> max_ah: 0
> max_fmr: 0
> max_srq: 65472
> max_srq_wr: 16383
> max_srq_sge: 31
> max_pkeys: 128
> local_ca_ack_delay: 15
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 3
> port_lid: 28
> port_lmc: 0x00
> link_layer: InfiniBand
> max_msg_sz: 0x40000000
> port_cap_flags: 0x02510868
> max_vl_num: 4 (3)
> bad_pkey_cntr: 0x0
> qkey_viol_cntr: 0x0
> sm_sl: 0
> pkey_tbl_len: 128
> gid_tbl_len: 128
> subnet_timeout: 18
> init_type_reply: 0
> active_width: 4X (2)
> active_speed: 10.0 Gbps (4)
> phys_state: LINK_UP (5)
> GID[ 0]: fe80:0000:0000:0000:0002:c903:0028:25cb
>
> Regards,
> Hari.
>
> On Fri, Mar 10, 2017 at 11:57 AM, Martin Pokorny <mpokorny at nrao.edu
> <mailto:mpokorny at nrao.edu>> wrote:
>
> Hi Hari,
>
> Please see below for my comments.
>
> On 03/10/2017 09:37 AM, Hari Subramoni wrote:
>
> Sorry to hear that you're facing issues.
>
> Event 3 is IBV_EVENT_QP_ACCESS_ERR. From the man pages, this can be
> caused because of one of the following reasons
>
> 1. Misaligned atomic request
> 2. Too many RDMA Read or Atomic requests
> 3. R_Key violation
> 4. Length errors without immediate data
>
> Out of these, #2 could be related to the application communication
> pattern. Do you think the application is issuing several
> back-to-back large message send operations of MPI3-RMA operations?
>
>
> The majority of MPI traffic is from MPI-IO. I don't recall seeing
> lots of RMA operations in the source code of the Lustre ADIO module
> (with which I'm somewhat familiar), but I'll have another look at that.
>
> For the others, it could be some issue inside the MVAPICH2 library.
> Since you're using MVAPICH2-2.1, which is more than a year old,
> may I
> request that you retry the application with MVAPICH2-2.2-GA?
> We've fixed
> several issues since MVAPICH2-2.1 which is available in
> MVAPICH2-2.2GA.
>
>
> That's on my list of things to try, but it will have to wait until I
> can get some testing time, meaning mid next week at the earliest.
>
> Could you give us some more details about the underlying IB fabric?
>
>
> Sure -- what sorts of details might be useful?
>
>
> Regards,
> Hari.
>
> On Fri, Mar 10, 2017 at 11:06 AM, Martin Pokorny
> <mpokorny at nrao.edu <mailto:mpokorny at nrao.edu>
> <mailto:mpokorny at nrao.edu <mailto:mpokorny at nrao.edu>>> wrote:
>
> We've recently been seeing the following sorts of errors at
> a small
> yet noticeable rate
>
> [cbe-node-24:mpi_rank_9][async_thread]
>
> ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1152:
> Got FATAL event 3
> : Invalid argument (22)
> [cbe-node-28:mpi_rank_24][handle_cqe] Send desc error in
> msg to
> 9, wc_opcode=0
> [cbe-node-28:mpi_rank_24][handle_cqe] Msg from 9:
> wc.status=10,
> wc.wr_id=0x249f5b0, wc.opcode=0, vbuf->phead->type=4 =
> MPIDI_CH3_PKT_RPUT_FINISH
> [cbe-node-28:mpi_rank_24][handle_cqe]
>
> ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587:
> [] Got completion with error 10, vendor code=0x88, dest
> rank=9
>
>
> Unfortunately, I can't send the source for the program that is
> experiencing this error, nor am I able to come up with a simpler
> reproducer. I'm hoping that perhaps you might have some
> advice for
> helping me diagnose the cause of the error. For example is there
> some environment variable that might be worth looking at?
>
> I'm using mvapich2-2.1 on a cluster with IB network. I built
> mvapich2 as follows:
> ../configure --enable-romio --with-file-system=lustre
> --enable-debuginfo --enable-g=dbg,log --with-limic2
> --enable-rdma-cm
>
> --
> Martin Pokorny
> Software Engineer
> Jansky Very Large Array correlator backend and CASA software
> National Radio Astronomy Observatory - New Mexico Operations
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> <mailto:mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>>
>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>>
>
>
>
>
> --
> Martin Pokorny
> Software Engineer
> Jansky Very Large Array correlator backend and CASA software
> National Radio Astronomy Observatory - New Mexico Operations
>
>
--
Martin Pokorny
Software Engineer
Jansky Very Large Array correlator back-end and CASA software
National Radio Astronomy Observatory - New Mexico Operations
More information about the mvapich-discuss
mailing list