[mvapich-discuss] problems with MVAPICH2 over 10GbE
Konz, Jeffrey (SSA Solution Centers)
jeffrey.konz at hp.com
Wed Aug 17 16:48:33 EDT 2011
Jonathan,
An issue is selecting the right port on the Mellanox NIC, it has two ports: 1 IB, 1 10GigE.
Not sure how to do that.
#ibstat
CA 'mlx4_0'
CA type: MT26438
Number of ports: 2
Firmware version: 2.7.9100
Hardware version: b0
Node GUID: 0x78e7d10300214bbc
System image GUID: 0x78e7d10300214bbf
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 13
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x78e7d10300214bbd
Link layer: IB
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x7ae7d1fffe214bbd
Link layer: Ethernet
#ibv_devinfo -v
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.9100
node_guid: 78e7:d103:0021:4bbc
sys_image_guid: 78e7:d103:0021:4bbf
vendor_id: 0x02c9
vendor_part_id: 26438
hw_ver: 0xB0
board_id: HP_0200000003
phys_port_cnt: 2
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffe00
max_qp: 260032
max_qp_wr: 16351
device_cap_flags: 0x007c9c76
max_sge: 32
max_sge_rd: 0
max_cq: 65408
max_cqe: 4194303
max_mr: 524272
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4160512
max_qp_init_rd_atom: 128
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 2
max_mcast_grp: 8192
max_mcast_qp_attach: 56
max_total_mcast_qp_attach: 458752
max_ah: 0
max_fmr: 0
max_srq: 65472
max_srq_wr: 16383
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 15
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 13
port_lmc: 0x00
link_layer: IB
max_msg_sz: 0x40000000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 128
subnet_timeout: 18
init_type_reply: 0
active_width: 4X (2)
active_speed: 10.0 Gbps (4)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:78e7:d103:0021:4bbd
port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
max_msg_sz: 0x40000000
port_cap_flags: 0x00010000
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 128
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 10.0 Gbps (4)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:7ae7:d1ff:fe21:4bbd
-Jeff
> -----Original Message-----
> From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu]
> Sent: Wednesday, August 17, 2011 12:25 PM
> To: Konz, Jeffrey (SSA Solution Centers)
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] problems with MVAPICH2 over 10GbE
>
> Thanks for your report, I'm checking with some of the other developers
> to verify the way this should work. I believe that you do not need to
> use the IP addresses of the RDMAoE port but instead specify the the
> HCA name using MV2_IBA_HCA in addition to the MV2_USE_RDMAOE=1 option.
>
> The name of the HCA can be found by using the ibstat command and
> should look something like mlx4_...
>
> On Wed, Aug 17, 2011 at 11:17 AM, Konz, Jeffrey (SSA Solution Centers)
> <jeffrey.konz at hp.com> wrote:
> > I am running on a cluster with the Mellanox LOM that supports both IB
> and 10 GbE.
> > Both ports on the interface are active, one is on IB network the
> other on 10 GbE network.
> >
> > I built mvapich2-1.7rc1 with these options : "--with-device=ch3:mrail
> --with-rdma=gen2"
> >
> > Running over IB works fine.
> >
> > When I try to run over the 10GbE network with the "MV2_USE_RDMAOE=1"
> option I get this error:
> >
> > Fatal error in MPI_Init:
> > Internal MPI error!
> >
> > [atl3-13:mpispawn_0][readline] Unexpected End-Of-File on file
> descriptor 5. MPI process died?
> > [atl3-13:mpispawn_0][mtpmi_processops] Error while reading PMI
> socket. MPI process died?
> > [atl3-13:mpispawn_0][child_handler] MPI process (rank: 0, pid: 23500)
> exited with status 1
> > [atl3-13:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from
> node 10.10.0.149 aborted: Error while reading a PMI socket (4)
> >
> > In the hostfile I specified the IP addresses of the 10 GbE ports.
> >
> > I am running incorrectly or have I not built mvapich with the correct
> options?
> >
> > Thanks,
> >
> > -Jeff
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list