[mvapich-discuss] Unable to run mpi hello world program on KVM guests with SR-IOV

JieZhang zhang.2794 at buckeyemail.osu.edu
Fri Apr 20 16:04:51 EDT 2018


Hi, Pharthiphan,

Can you please try basic IB primitive tests, such as ib_send_latency across two KVM guests to see if they are work well?


> On Apr 20, 2018, at 7:43 AM, Pharthiphan Asokan <pasokan at ddn.com> wrote:
> 
> Hi Folks,
> 
> Unable to run mpi hello world program on KVM guests with SR-IOV, it gets hung [no response on the stdout] and error messages on stdou when I do CTRL + C
> 
> Note:- It worked pretty straight forward when I had CentOS 7.3 + MOFED 4.3 with the latest mvapich2 release. but my application requirement push me to have CentOS 7.4 + MOFED 4.2
> 
> Environment details added in the mail. Please help us using mvapich2 
> 
> 
> system program (hostname) on two KVM guests
> 
> # mpirun_rsh -np 2 vcn01 vcn02 hostname
> vcn01
> vcn02
> #
> 
> 
> MPI hello world program on single client
> 
> # mpirun_rsh -np 1 vcn01 /home/pasokan/a.out
> Hello world from processor vcn01, rank 0 out of 1 processors
> #
> 
> 
> MPI hello world program on two KVM guests
> 
> 
> # mpirun_rsh -np 2 vcn01 vcn02  /home/pasokan/a.out
> ^C[vcn01:mpirun_rsh][signal_processor] Caught signal 2, killing job
> [root at vcn01 pasokan]# ^C
> [root at vcn01 pasokan]# [vcn01:mpispawn_0][error_sighandler] Caught error: Segmentation fault (signal 11)
> /usr/bin/bash: line 1:  4444 Segmentation fault      /usr/bin/env LD_LIBRARY_PATH=/home/pasokan/mvapich2-2.3rc1/lib:/opt/ddn/ime/lib MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=vcn01 MPISPAWN_MPIRUN_HOSTIP=10.52.100.1 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=52794 MPISPAWN_MPIRUN_PORT=52794 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=4439 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_338_vcn01_4439 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='/home/pasokan/a.out' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/pasokan MPISPAWN_MPIRUN_RANK_0=0 /home/pasokan/mvapich2-2.3rc1/bin/mpispawn 0
> [vcn02:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [vcn02:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [vcn02:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [vcn02:mpispawn_1][report_error] connect() failed: Connection refused (111)
> 
> #
> 
> MVAPICH2 Version
> 
> mvapich2-2.3rc1
> 
> ulimit
> 
> clush -b -w vcn[01-02] ulimit -l 
> ---------------
> vcn[01-02] (2)
> ---------------
> unlimited
> 
> 
> KVM Host Information :-
> 
> OS version
> 
> CentOS Linux release 7.3.1611 (Core)
> 
> Kernel Version
> 
> 3.10.0-514.el7.x86_64
> 
> OFED info
> 
> MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0)
> 
> IB Card Info
> 
> ibv_devinfo 
> hca_id:    mlx5_0
>     transport:            InfiniBand (0)
>     fw_ver:                10.16.1200
>     node_guid:            248a:0703:00e2:f4b0
>     sys_image_guid:            248a:0703:00e2:f4b0
>     vendor_id:            0x02c9
>     vendor_part_id:            4113
>     hw_ver:                0x0
>     board_id:            MT_1230110019
>     phys_port_cnt:            1
>     Device ports:
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        4096 (5)
>             active_mtu:        4096 (5)
>             sm_lid:            4
>             port_lid:        18
>             port_lmc:        0x00
>             link_layer:        InfiniBand
> 
> 
> KVM Version
> 
> # /usr/libexec/qemu-kvm --version
> QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-126.el7), Copyright (c) 2003-2008 Fabrice Bellard
> 
> libvirt version
> 
> libvirt-3.2.0-14.el7_4.9.x86_64
> 
> KVM Guest Information :-
> 
> OS version
> 
> CentOS Linux release 7.4.1708 (Core)
> 
> Kernel Version
> 
> 3.10.0-693.17.1.el7.x86_64
> 
> 
> OFED info
> 
> MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):
> 
> IB Card Info
> 
> 
> # ibv_devinfo
> hca_id:    mlx5_0
>     transport:            InfiniBand (0)
>     fw_ver:                10.16.1200
>     node_guid:            0111:3344:7766:7790
>     sys_image_guid:            248a:0703:00e2:f4b0
>     vendor_id:            0x02c9
>     vendor_part_id:            4114
>     hw_ver:                0x0
>     board_id:            MT_1230110019
>     phys_port_cnt:            1
>     Device ports:
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        4096 (5)
>             active_mtu:        4096 (5)
>             sm_lid:            4
>             port_lid:        18
>             port_lmc:        0x00
>             link_layer:        InfiniBand
> 
> Regards,
> Pharthiphan Asokan
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180420/f64b0277/attachment-0001.html>


More information about the mvapich-discuss mailing list