[mvapich-discuss] mvapich2-2.3.3 over connectX-5 regression issue
Subramoni, Hari
subramoni.1 at osu.edu
Fri Apr 10 09:52:20 EDT 2020
Hi, Honggang.
It looks like your systems have multiple network adapters that have been setup with different modes (IB and Ethernet). In such a scenario, I would recommend explicitly setting the network adapter you want MVAPICH2 to use.
e.g. MV2_IBA_HCA=mlx5_0 or MV2_IBA_HCA=mlx5_1
Best,
Hari.
-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Honggang LI
Sent: Friday, April 10, 2020 3:56 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] mvapich2-2.3.3 over connectX-5 regression issue
hi
short summary:
+----------+----------+-----------+
|mvapich2 | mpirun | mpirun_rsh|
|version | | |
+----------+----------+-----------+
|2.3.2 | works | hang |
+----------+----------+-----------+
|2.3.3 | failed | hang |
+----------+----------+-----------+
Is it possible to run something like 'git bisect' to narrow down the source of regression issue? It seems no git repo available for public.
I don't know how to run 'git bisect' with the SVN repo.
thanks
[root at rdma-virt-02 ~]$ cat hfile_one_core
172.31.0.202
172.31.0.203
[root at rdma-virt-02 ~]$ ip addr show | grep -w 172.31.0.202
inet 172.31.0.202/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx5_ib0
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
[rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0 [rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] Msg from 1: wc.status=12, wc.wr_id=0x560c8bac9040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND [rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 12, vendor code=0x81, dest rank=1
: Protocol not supported (93)
[rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] Send desc error in msg to 0, wc_opcode=0 [rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] Msg from 0: wc.status=12, wc.wr_id=0x563896cf9040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND [rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 12, vendor code=0x81, dest rank=0
: Protocol not supported (93)
[root at rdma-virt-02 ~]$ dnf downgrade mvapich2 Updating Subscription Management repositories.
Unable to read consumer identity
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Last metadata expiration check: 2:24:17 ago on Fri 10 Apr 2020 01:18:01 AM EDT.
Dependencies resolved.
========================================================================================================================================
Package Architecture Version Repository Size
========================================================================================================================================
Downgrading:
mvapich2 x86_64 2.3.2-2.el8 beaker-AppStream 3.1 M
Transaction Summary
========================================================================================================================================
Downgrade 1 Package
Total download size: 3.1 M
Is this ok [y/N]: y
Downloading Packages:
mvapich2-2.3.2-2.el8.x86_64.rpm 39 MB/s | 3.1 MB 00:00
----------------------------------------------------------------------------------------------------------------------------------------
Total 39 MB/s | 3.1 MB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Downgrading : mvapich2-2.3.2-2.el8.x86_64 1/2
Cleanup : mvapich2-2.3.3-1.el8.x86_64 2/2
Running scriptlet: mvapich2-2.3.3-1.el8.x86_64 2/2
Verifying : mvapich2-2.3.2-2.el8.x86_64 1/2
Verifying : mvapich2-2.3.3-1.el8.x86_64 2/2
Installed products updated.
Downgraded:
mvapich2-2.3.2-2.el8.x86_64
Complete!
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
# OSU MPI Latency Test v5.4.1
# Size Latency (us)
0 1.24
1 1.29
2 1.29
4 1.29
8 1.29
16 1.34
32 1.35
64 1.36
128 1.42
256 1.82
512 1.92
1024 2.11
2048 2.53
4096 3.48
8192 5.19
16384 7.37
32768 10.12
65536 15.00
131072 24.69
262144 44.15
524288 82.97
1048576 160.92
2097152 316.19
4194304 626.91
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun_rsh -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
(hang on, no output)
[root at rdma-virt-03 ~]$ ibstat
CA 'mlx5_bond_0'
CA type: MT4117
Number of ports: 1
Firmware version: 14.25.1020
Hardware version: 0
Node GUID: 0xe41d2d0300fda736
System image GUID: 0xe41d2d0300fda736
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xe61d2dfffefda736
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4115
Number of ports: 1
Firmware version: 12.25.1020
Hardware version: 0
Node GUID: 0xe41d2d0300e70e87
System image GUID: 0xe41d2d0300e70e86
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 30
LMC: 0
SM lid: 1
Capability mask: 0x2659e848
Port GUID: 0xe41d2d0300e70e87
Link layer: InfiniBand
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.25.1020
Hardware version: 0
Node GUID: 0xe41d2d0300e70e86
System image GUID: 0xe41d2d0300e70e86
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 20
LMC: 0
SM lid: 13
Capability mask: 0x2659e848
Port GUID: 0xe41d2d0300e70e86
Link layer: InfiniBand
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list