[mvapich-discuss] mvapich2-2.3.3 over connectX-5 regression issue

Subramoni, Hari subramoni.1 at osu.edu
Fri Apr 10 09:52:20 EDT 2020


Hi, Honggang.

It looks like your systems have multiple network adapters that have been setup with different modes (IB and Ethernet). In such a scenario, I would recommend explicitly setting the network adapter you want MVAPICH2 to use.

e.g. MV2_IBA_HCA=mlx5_0 or MV2_IBA_HCA=mlx5_1

Best,
Hari.

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Honggang LI
Sent: Friday, April 10, 2020 3:56 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] mvapich2-2.3.3 over connectX-5 regression issue

hi

short summary:
+----------+----------+-----------+
|mvapich2  | mpirun   | mpirun_rsh|
|version   |          |           |
+----------+----------+-----------+
|2.3.2     | works    | hang      |
+----------+----------+-----------+
|2.3.3     | failed   | hang      |
+----------+----------+-----------+

Is it possible to run something like 'git bisect' to narrow down the source of regression issue? It seems no git repo available for public.
I don't know how to run 'git bisect' with the SVN repo.

thanks

[root at rdma-virt-02 ~]$ cat hfile_one_core
172.31.0.202
172.31.0.203
[root at rdma-virt-02 ~]$ ip addr show | grep -w 172.31.0.202
    inet 172.31.0.202/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx5_ib0


[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun  -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
[rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0 [rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] Msg from 1: wc.status=12, wc.wr_id=0x560c8bac9040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND [rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 12, vendor code=0x81, dest rank=1
: Protocol not supported (93)
[rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] Send desc error in msg to 0, wc_opcode=0 [rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] Msg from 0: wc.status=12, wc.wr_id=0x563896cf9040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND [rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 12, vendor code=0x81, dest rank=0
: Protocol not supported (93)

[root at rdma-virt-02 ~]$ dnf downgrade mvapich2 Updating Subscription Management repositories.
Unable to read consumer identity
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Last metadata expiration check: 2:24:17 ago on Fri 10 Apr 2020 01:18:01 AM EDT.
Dependencies resolved.
========================================================================================================================================
 Package                       Architecture                Version                          Repository                             Size
========================================================================================================================================
Downgrading:
 mvapich2                      x86_64                      2.3.2-2.el8                      beaker-AppStream                      3.1 M

Transaction Summary
========================================================================================================================================
Downgrade  1 Package

Total download size: 3.1 M
Is this ok [y/N]: y
Downloading Packages:
mvapich2-2.3.2-2.el8.x86_64.rpm                                                                          39 MB/s | 3.1 MB     00:00
----------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                    39 MB/s | 3.1 MB     00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                1/1
  Downgrading      : mvapich2-2.3.2-2.el8.x86_64                                                                                    1/2
  Cleanup          : mvapich2-2.3.3-1.el8.x86_64                                                                                    2/2
  Running scriptlet: mvapich2-2.3.3-1.el8.x86_64                                                                                    2/2
  Verifying        : mvapich2-2.3.2-2.el8.x86_64                                                                                    1/2
  Verifying        : mvapich2-2.3.3-1.el8.x86_64                                                                                    2/2
Installed products updated.

Downgraded:
  mvapich2-2.3.2-2.el8.x86_64

Complete!
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun  -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
0                       1.24
1                       1.29
2                       1.29
4                       1.29
8                       1.29
16                      1.34
32                      1.35
64                      1.36
128                     1.42
256                     1.82
512                     1.92
1024                    2.11
2048                    2.53
4096                    3.48
8192                    5.19
16384                   7.37
32768                  10.12
65536                  15.00
131072                 24.69
262144                 44.15
524288                 82.97
1048576               160.92
2097152               316.19
4194304               626.91
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun_rsh  -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency

(hang on, no output)

[root at rdma-virt-03 ~]$ ibstat
CA 'mlx5_bond_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.25.1020
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda736
	System image GUID: 0xe41d2d0300fda736
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xe61d2dfffefda736
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.25.1020
	Hardware version: 0
	Node GUID: 0xe41d2d0300e70e87
	System image GUID: 0xe41d2d0300e70e86
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 30
		LMC: 0
		SM lid: 1
		Capability mask: 0x2659e848
		Port GUID: 0xe41d2d0300e70e87
		Link layer: InfiniBand
CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.25.1020
	Hardware version: 0
	Node GUID: 0xe41d2d0300e70e86
	System image GUID: 0xe41d2d0300e70e86
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 20
		LMC: 0
		SM lid: 13
		Capability mask: 0x2659e848
		Port GUID: 0xe41d2d0300e70e86
		Link layer: InfiniBand


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list