[mvapich-discuss] mvapich2 problems on node with active mlx4_0
and nes0 hcas
Michael Wang
mwang at fnal.gov
Thu Sep 12 10:01:44 EDT 2013
Hi Devender,
We upgraded to mvapich2-1.9 and reran the osu_bw test (with
MV2_IBA_HCA=mlx4_0 ), as you recommended, but still get the same problem:
[mwang at dsfr1 ~]$ mpiexec -launcher rsh -hosts dseb2,dsag -n 2 -genv
MV2_IBA_HCA=mlx4_0 /usr/local/mvapich2-1.9/libexec/mvapich2/osu_bw
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:166]: PMI_KVS_Get
error
[dsag:mpi_rank_1][rdma_cm_exchange_hostid]
src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:954: PMI Lookup name
failed
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 253
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at dseb2] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:0 at dseb2] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at dseb2] main (./pm/pmiserv/pmip.c:206): demux engine error
waiting for event
Do you have any suggestions?
Thanks,
Mike Wang
On 09/05/2013 03:08 PM, Devendar Bureddy wrote:
> Hi Mike
>
> mvapich2-1.7 is very old and we had some fixes related
> to heterogeneous environment in later releases. Can you please try with
> our latest release (mvapich2-1.9 or 2.0a) with MV2_IBA_HCA=mlx4_0
> environment option?
>
> - Devendar
>
>
> On Thu, Sep 5, 2013 at 3:42 PM, Michael Wang <mwang at fnal.gov
> <mailto:mwang at fnal.gov>> wrote:
>
> Hi,
>
> We are using mvapich2-1.7 and are having issues with one node on our
> cluster that has both a Mellanox MT27500 IB adapter and a NetEffect
> NE020 10Gb ethernet adapter (this problem goes away when the iw_nes
> driver is disabled). Here is the ibv_devinfo output for this node:
>
>
> hca_id: nes0
> transport: iWARP (1)
> fw_ver: 3.21
> node_guid: 0012:5503:5cf0:0000
> sys_image_guid: 0012:5503:5cf0:0000
> vendor_id: 0x1255
> vendor_part_id: 256
> hw_ver: 0x5
> board_id: NES020 Board ID
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 1
> port_lmc: 0x00
> link_layer: Ethernet
>
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.10.700
> node_guid: 0002:c903:00fd:ace0
> sys_image_guid: 0002:c903:00fd:ace3
> vendor_id: 0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id: MT_1060110018
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 5
> port_lmc: 0x00
> link_layer: IB
>
>
> To demonstrate the problem, I use the utility program "osu_bw" to
> run a simple test between two nodes on the IB cluster:
>
>
> $ mpiexec -launcher rsh -hosts dseb2,dsag -n 2 \
> /usr/mpi/gcc/mvapich2-1.7/__tests/osu_benchmarks-3.1.1/__osu_bw
>
>
> which results in the following error:
>
>
> [ring_startup.c:184]: PMI_KVS_Get error
>
> [1] Abort: PMI Lookup name failed
> at line 951 in file
> /var/tmp/OFED_topdir/BUILD/__mvapich2-1.7-r5140/src/mpid/__ch3/channels/common/src/rdma___cm/rdma_cm.c
> [mwang at dsfr1 ~]$ mpiexec -launcher rsh -hosts dseb2,dsag -n 2
> /usr/mpi/gcc/mvapich2-1.7/__tests/osu_benchmarks-3.1.1/__osu_bw
> [ring_startup.c:184]: PMI_KVS_Get error
>
> [1] Abort: PMI Lookup name failed
> at line 951 in file
> /var/tmp/OFED_topdir/BUILD/__mvapich2-1.7-r5140/src/mpid/__ch3/channels/common/src/rdma___cm/rdma_cm.c
>
>
> The node with the IB and 10GbE adapters is "dsag". If I replace
> this node in the command above with another node that only has the
> Mellanox hca but not the NetEffect 10GbE adapter, then everything
> runs fine and the bandwidth results are printed out.
>
> I am not an expert but if I try re-running the above command with
> "-v" for a verbose output, I see the following PMI related messages
> which may be relevant to the experts on this list in helping
> troubleshoot this problem:
>
>
> [proxy:0:1 at dsag] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [proxy:0:1 at dsag] [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=put
> kvsname=kvs_19955_0 key=HOST-1 value=-32873218
> .
> .
> .
> .
> [proxy:0:0 at dseb2] got pmi command (from 4): get
> kvsname=kvs_19955_0 key=MVAPICH2_0001
> [proxy:0:1 at dsag] [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=get
> kvsname=kvs_19955_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] PMI response to fd 12 pid 4: cmd=get_result rc=-1
> msg=key_MVAPICH2_0001_not___found value=unknown
>
>
> This is in contrast to a successful run where the corresponding
> lines would look like:
>
>
> [proxy:0:1 at dseb3] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=put
> kvsname=kvs_19934_0 key=MVAPICH2_0001
> value=00000008:0048004a:__0048004b:
> .
> .
> .
> .
> [proxy:0:0 at dseb2] got pmi command (from 4): get
> kvsname=kvs_19934_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=get
> kvsname=kvs_19934_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] PMI response to fd 12 pid 4: cmd=get_result rc=0
> msg=success value=00000008:0048004a:__0048004b:
>
> I have tried passing environment variables like MV2_IBA_HCA=mlx4_0
> to mpirun_rsh or mpiexec or even using a hostfile with node:rank:hca
> lines to force usage of the IB hca, but to no no avail.
>
> I would greatly appreciate any help or insight I can get on this
> from the experts on this list.
>
> Thanks in advance,
>
> Mike Wang
> _________________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-__state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.__edu/mailman/listinfo/mvapich-__discuss
> <http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>
>
>
> --
> Devendar
More information about the mvapich-discuss
mailing list