[mvapich-discuss] mvapich2-2.3.3 over connectX-5 regression issue

Honggang LI honli at redhat.com
Fri Apr 10 03:55:40 EDT 2020


hi

short summary:
+----------+----------+-----------+
|mvapich2  | mpirun   | mpirun_rsh|
|version   |          |           |
+----------+----------+-----------+
|2.3.2     | works    | hang      |
+----------+----------+-----------+
|2.3.3     | failed   | hang      |
+----------+----------+-----------+

Is it possible to run something like 'git bisect' to narrow down the
source of regression issue? It seems no git repo available for public.
I don't know how to run 'git bisect' with the SVN repo.

thanks

[root at rdma-virt-02 ~]$ cat hfile_one_core
172.31.0.202
172.31.0.203
[root at rdma-virt-02 ~]$ ip addr show | grep -w 172.31.0.202
    inet 172.31.0.202/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx5_ib0


[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun  -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
[rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
[rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] Msg from 1: wc.status=12, wc.wr_id=0x560c8bac9040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[rdma-virt-02.lab.bos.redhat.com:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 12, vendor code=0x81, dest rank=1
: Protocol not supported (93)
[rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] Send desc error in msg to 0, wc_opcode=0
[rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] Msg from 0: wc.status=12, wc.wr_id=0x563896cf9040, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[rdma-virt-03.lab.bos.redhat.com:mpi_rank_1][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 12, vendor code=0x81, dest rank=0
: Protocol not supported (93)

[root at rdma-virt-02 ~]$ dnf downgrade mvapich2
Updating Subscription Management repositories.
Unable to read consumer identity
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Last metadata expiration check: 2:24:17 ago on Fri 10 Apr 2020 01:18:01 AM EDT.
Dependencies resolved.
========================================================================================================================================
 Package                       Architecture                Version                          Repository                             Size
========================================================================================================================================
Downgrading:
 mvapich2                      x86_64                      2.3.2-2.el8                      beaker-AppStream                      3.1 M

Transaction Summary
========================================================================================================================================
Downgrade  1 Package

Total download size: 3.1 M
Is this ok [y/N]: y
Downloading Packages:
mvapich2-2.3.2-2.el8.x86_64.rpm                                                                          39 MB/s | 3.1 MB     00:00
----------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                    39 MB/s | 3.1 MB     00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                1/1
  Downgrading      : mvapich2-2.3.2-2.el8.x86_64                                                                                    1/2
  Cleanup          : mvapich2-2.3.3-1.el8.x86_64                                                                                    2/2
  Running scriptlet: mvapich2-2.3.3-1.el8.x86_64                                                                                    2/2
  Verifying        : mvapich2-2.3.2-2.el8.x86_64                                                                                    1/2
  Verifying        : mvapich2-2.3.3-1.el8.x86_64                                                                                    2/2
Installed products updated.

Downgraded:
  mvapich2-2.3.2-2.el8.x86_64

Complete!
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun  -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
0                       1.24
1                       1.29
2                       1.29
4                       1.29
8                       1.29
16                      1.34
32                      1.35
64                      1.36
128                     1.42
256                     1.82
512                     1.92
1024                    2.11
2048                    2.53
4096                    3.48
8192                    5.19
16384                   7.37
32768                  10.12
65536                  15.00
131072                 24.69
262144                 44.15
524288                 82.97
1048576               160.92
2097152               316.19
4194304               626.91
[root at rdma-virt-02 ~]$ /usr/lib64/mvapich2/bin/mpirun_rsh  -np 2 -hostfile /root/hfile_one_core /usr/lib64/mvapich2/bin/mpitests-osu_latency

(hang on, no output)

[root at rdma-virt-03 ~]$ ibstat
CA 'mlx5_bond_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.25.1020
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda736
	System image GUID: 0xe41d2d0300fda736
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xe61d2dfffefda736
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.25.1020
	Hardware version: 0
	Node GUID: 0xe41d2d0300e70e87
	System image GUID: 0xe41d2d0300e70e86
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 30
		LMC: 0
		SM lid: 1
		Capability mask: 0x2659e848
		Port GUID: 0xe41d2d0300e70e87
		Link layer: InfiniBand
CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.25.1020
	Hardware version: 0
	Node GUID: 0xe41d2d0300e70e86
	System image GUID: 0xe41d2d0300e70e86
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 20
		LMC: 0
		SM lid: 13
		Capability mask: 0x2659e848
		Port GUID: 0xe41d2d0300e70e86
		Link layer: InfiniBand




More information about the mvapich-discuss mailing list