[mvapich-discuss] mvapich2 problems on node with active mlx4_0 and nes0 hcas

Devendar Bureddy bureddy at cse.ohio-state.edu
Thu Sep 5 16:08:18 EDT 2013


Hi Mike

mvapich2-1.7 is very old and we had some fixes related
to heterogeneous environment in later releases. Can you please try with our
latest release (mvapich2-1.9 or 2.0a) with MV2_IBA_HCA=mlx4_0 environment
option?

- Devendar


On Thu, Sep 5, 2013 at 3:42 PM, Michael Wang <mwang at fnal.gov> wrote:

> Hi,
>
> We are using mvapich2-1.7 and are having issues with one node on our
> cluster that has both a Mellanox MT27500 IB adapter and a NetEffect NE020
> 10Gb ethernet adapter (this problem goes away when the iw_nes driver is
> disabled).  Here is the ibv_devinfo output for this node:
>
>
> hca_id: nes0
>         transport:                      iWARP (1)
>         fw_ver:                         3.21
>         node_guid:                      0012:5503:5cf0:0000
>         sys_image_guid:                 0012:5503:5cf0:0000
>         vendor_id:                      0x1255
>         vendor_part_id:                 256
>         hw_ver:                         0x5
>         board_id:                       NES020 Board ID
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             1024 (3)
>                         sm_lid:                 0
>                         port_lid:               1
>                         port_lmc:               0x00
>                         link_layer:             Ethernet
>
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.10.700
>         node_guid:                      0002:c903:00fd:ace0
>         sys_image_guid:                 0002:c903:00fd:ace3
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       MT_1060110018
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               5
>                         port_lmc:               0x00
>                         link_layer:             IB
>
>
> To demonstrate the problem, I use the utility program "osu_bw" to run a
> simple test between two nodes on the IB cluster:
>
>
>   $ mpiexec -launcher rsh -hosts dseb2,dsag -n 2 \
>     /usr/mpi/gcc/mvapich2-1.7/**tests/osu_benchmarks-3.1.1/**osu_bw
>
>
> which results in the following error:
>
>
> [ring_startup.c:184]: PMI_KVS_Get error
>
> [1] Abort: PMI Lookup name failed
>  at line 951 in file /var/tmp/OFED_topdir/BUILD/**
> mvapich2-1.7-r5140/src/mpid/**ch3/channels/common/src/rdma_**cm/rdma_cm.c
> [mwang at dsfr1 ~]$ mpiexec -launcher rsh -hosts dseb2,dsag -n 2
> /usr/mpi/gcc/mvapich2-1.7/**tests/osu_benchmarks-3.1.1/**osu_bw
> [ring_startup.c:184]: PMI_KVS_Get error
>
> [1] Abort: PMI Lookup name failed
>  at line 951 in file /var/tmp/OFED_topdir/BUILD/**
> mvapich2-1.7-r5140/src/mpid/**ch3/channels/common/src/rdma_**cm/rdma_cm.c
>
>
> The node with the IB and 10GbE adapters is "dsag".  If I replace this node
> in the command above with another node that only has the Mellanox hca but
> not the NetEffect 10GbE adapter, then everything runs fine and the
> bandwidth results are printed out.
>
> I am not an expert but if I try re-running the above command with "-v" for
> a verbose output, I see the following PMI related messages which may be
> relevant to the experts on this list in helping troubleshoot this problem:
>
>
> [proxy:0:1 at dsag] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [proxy:0:1 at dsag] [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=put
> kvsname=kvs_19955_0 key=HOST-1 value=-32873218
>         .
>         .
>         .
>         .
> [proxy:0:0 at dseb2] got pmi command (from 4): get
> kvsname=kvs_19955_0 key=MVAPICH2_0001
> [proxy:0:1 at dsag] [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=get
> kvsname=kvs_19955_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] PMI response to fd 12 pid 4: cmd=get_result rc=-1
> msg=key_MVAPICH2_0001_not_**found value=unknown
>
>
> This is in contrast to a successful run where the corresponding lines
> would look like:
>
>
> [proxy:0:1 at dseb3] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=put kvsname=kvs_19934_0
> key=MVAPICH2_0001 value=00000008:0048004a:**0048004b:
>         .
>         .
>         .
>         .
> [proxy:0:0 at dseb2] got pmi command (from 4): get
> kvsname=kvs_19934_0 key=MVAPICH2_0001
> [mpiexec at dsfr1] [pgid: 0] got PMI command: cmd=get kvsname=kvs_19934_0
> key=MVAPICH2_0001
> [mpiexec at dsfr1] PMI response to fd 12 pid 4: cmd=get_result rc=0
> msg=success value=00000008:0048004a:**0048004b:
>
> I have tried passing environment variables like MV2_IBA_HCA=mlx4_0 to
> mpirun_rsh or mpiexec or even using a hostfile with node:rank:hca lines to
> force usage of the IB hca, but to no no avail.
>
> I would greatly appreciate any help or insight I can get on this from the
> experts on this list.
>
> Thanks in advance,
>
> Mike Wang
> ______________________________**_________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-**state.edu <mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.**edu/mailman/listinfo/mvapich-**discuss<http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>



-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130905/f889060c/attachment-0001.html


More information about the mvapich-discuss mailing list