[mvapich-discuss] mvapich2 problems on node with active mlx4_0
and nes0 hcas
Devendar Bureddy
bureddy at cse.ohio-state.edu
Thu Sep 5 16:08:18 EDT 2013
Hi Mike
mvapich2-1.7 is very old and we had some fixes related
to heterogeneous environment in later releases. Can you please try with our
latest release (mvapich2-1.9 or 2.0a) with MV2_IBA_HCA=mlx4_0 environment
option?
- Devendar
On Thu, Sep 5, 2013 at 3:42 PM, Michael Wang <mwang at fnal.gov> wrote:
> Hi,
>
> We are using mvapich2-1.7 and are having issues with one node on our
> cluster that has both a Mellanox MT27500 IB adapter and a NetEffect NE020
> 10Gb ethernet adapter (this problem goes away when the iw_nes driver is
> disabled). Here is the ibv_devinfo output for this node:
>
>
> hca_id: nes0
> transport: iWARP (1)
> fw_ver: 3.21
> node_guid: 0012:5503:5cf0:0000
> sys_image_guid: 0012:5503:5cf0:0000
> vendor_id: 0x1255
> vendor_part_id: 256
> hw_ver: 0x5
> board_id: NES020 Board ID
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 1
> port_lmc: 0x00
> link_layer: Ethernet
>
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.10.700
> node_guid: 0002:c903:00fd:ace0
> sys_image_guid: 0002:c903:00fd:ace3
> vendor_id: 0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id: MT_1060110018
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 5
> port_lmc: 0x00
> link_layer: IB
>
>
> To demonstrate the problem, I use the utility program "osu_bw" to run a
> simple test between two nodes on the IB cluster:
>
>
> $ mpiexec -launcher rsh -hosts dseb2,dsag -n 2 \
> /usr/mpi/gcc/mvapich2-1.7/**tests/osu_benchmarks-3.1.1/**osu_bw
>
>
> which results in the following error:
>
>
> [ring_startup.c:184]: PMI_KVS_Get error
>
> [1] Abort: PMI Lookup name failed
> at line 951 in file /var/tmp/OFED_topdir/BUILD/**
> mvapich2-1.7-r5140/src/mpid/**ch3/channels/common/src/rdma_**cm/rdma_cm.c
> [mwang at dsfr1 ~]$ mpiexec -launcher rsh -hosts dseb2,dsag -n 2
> /usr/mpi/gcc/mvapich2-1.7/**tests/osu_benchmarks-3.1.1/**osu_bw
> [ring_startup.c:184]: PMI_KVS_Get error
>
> [1] Abort: PMI Lookup name failed
> at line 951 in file /var/tmp/OFED_topdir/BUILD/**
> mvapich2-1.7-r5140/src/mpid/**ch3/channels/common/src/rdma_**cm/rdma_cm.c
>
>
> The node with the IB and 10GbE adapters is "dsag". If I replace this node
> in the command above with another node that only has the Mellanox hca but
> not the NetEffect 10GbE adapter, then everything runs fine and the
> bandwidth results are printed out.
>
> I am not an expert but if I try re-running the above command with "-v" for
> a verbose output, I see the following PMI related messages which may be
> relevant to the experts on this list in helping troubleshoot this problem:
>
>
> [proxy:0:1 at dsag] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [proxy:0:1 at dsag] [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=put
> kvsname=kvs_19955_0 key=HOST-1 value=-32873218
> .
> .
> .
> .
> [proxy:0:0 at dseb2] got pmi command (from 4): get
> kvsname=kvs_19955_0 key=MVAPICH2_0001
> [proxy:0:1 at dsag] [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=get
> kvsname=kvs_19955_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] PMI response to fd 12 pid 4: cmd=get_result rc=-1
> msg=key_MVAPICH2_0001_not_**found value=unknown
>
>
> This is in contrast to a successful run where the corresponding lines
> would look like:
>
>
> [proxy:0:1 at dseb3] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=put kvsname=kvs_19934_0
> key=MVAPICH2_0001 value=00000008:0048004a:**0048004b:
> .
> .
> .
> .
> [proxy:0:0 at dseb2] got pmi command (from 4): get
> kvsname=kvs_19934_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=get kvsname=kvs_19934_0
> key=MVAPICH2_0001
> [mpiexec at dsfr1] PMI response to fd 12 pid 4: cmd=get_result rc=0
> msg=success value=00000008:0048004a:**0048004b:
>
> I have tried passing environment variables like MV2_IBA_HCA=mlx4_0 to
> mpirun_rsh or mpiexec or even using a hostfile with node:rank:hca lines to
> force usage of the IB hca, but to no no avail.
>
> I would greatly appreciate any help or insight I can get on this from the
> experts on this list.
>
> Thanks in advance,
>
> Mike Wang
> ______________________________**_________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-**state.edu <mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.**edu/mailman/listinfo/mvapich-**discuss<http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
--
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130905/f889060c/attachment-0001.html
More information about the mvapich-discuss
mailing list