[mvapich-discuss] Unable to run mpi hello world program on KVM guests with SR-IOV

Pharthiphan Asokan pasokan at ddn.com
Fri Apr 20 07:43:49 EDT 2018


Hi Folks,

Unable to run mpi hello world program on KVM guests with SR-IOV, it gets hung [no response on the stdout] and error messages on stdou when I do CTRL + C

Note:- It worked pretty straight forward when I had CentOS 7.3 + MOFED 4.3 with the latest mvapich2 release. but my application requirement push me to have CentOS 7.4 + MOFED 4.2

Environment details added in the mail. Please help us using mvapich2


system program (hostname) on two KVM guests

# mpirun_rsh -np 2 vcn01 vcn02 hostname
vcn01
vcn02
#


MPI hello world program on single client

# mpirun_rsh -np 1 vcn01 /home/pasokan/a.out
Hello world from processor vcn01, rank 0 out of 1 processors
#


MPI hello world program on two KVM guests


# mpirun_rsh -np 2 vcn01 vcn02  /home/pasokan/a.out
^C[vcn01:mpirun_rsh][signal_processor] Caught signal 2, killing job
[root at vcn01 pasokan]# ^C
[root at vcn01 pasokan]# [vcn01:mpispawn_0][error_sighandler] Caught error: Segmentation fault (signal 11)
/usr/bin/bash: line 1:  4444 Segmentation fault      /usr/bin/env LD_LIBRARY_PATH=/home/pasokan/mvapich2-2.3rc1/lib:/opt/ddn/ime/lib MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=vcn01 MPISPAWN_MPIRUN_HOSTIP=10.52.100.1 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=52794 MPISPAWN_MPIRUN_PORT=52794 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=4439 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_338_vcn01_4439 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='/home/pasokan/a.out' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/pasokan MPISPAWN_MPIRUN_RANK_0=0 /home/pasokan/mvapich2-2.3rc1/bin/mpispawn 0
[vcn02:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
[vcn02:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
[vcn02:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
[vcn02:mpispawn_1][report_error] connect() failed: Connection refused (111)

#

MVAPICH2 Version

mvapich2-2.3rc1

ulimit

clush -b -w vcn[01-02] ulimit -l
---------------
vcn[01-02] (2)
---------------
unlimited


KVM Host Information :-

OS version

CentOS Linux release 7.3.1611 (Core)

Kernel Version

3.10.0-514.el7.x86_64

OFED info

MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0)

IB Card Info

ibv_devinfo
hca_id:    mlx5_0
    transport:            InfiniBand (0)
    fw_ver:                10.16.1200
    node_guid:            248a:0703:00e2:f4b0
    sys_image_guid:            248a:0703:00e2:f4b0
    vendor_id:            0x02c9
    vendor_part_id:            4113
    hw_ver:                0x0
    board_id:            MT_1230110019
    phys_port_cnt:            1
    Device ports:
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            4
            port_lid:        18
            port_lmc:        0x00
            link_layer:        InfiniBand


KVM Version

# /usr/libexec/qemu-kvm --version
QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-126.el7), Copyright (c) 2003-2008 Fabrice Bellard

libvirt version

libvirt-3.2.0-14.el7_4.9.x86_64

KVM Guest Information :-

OS version

CentOS Linux release 7.4.1708 (Core)

Kernel Version

3.10.0-693.17.1.el7.x86_64


OFED info

MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):

IB Card Info


# ibv_devinfo
hca_id:    mlx5_0
    transport:            InfiniBand (0)
    fw_ver:                10.16.1200
    node_guid:            0111:3344:7766:7790
    sys_image_guid:            248a:0703:00e2:f4b0
    vendor_id:            0x02c9
    vendor_part_id:            4114
    hw_ver:                0x0
    board_id:            MT_1230110019
    phys_port_cnt:            1
    Device ports:
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            4
            port_lid:        18
            port_lmc:        0x00
            link_layer:        InfiniBand

Regards,
Pharthiphan Asokan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180420/0800c022/attachment.html>


More information about the mvapich-discuss mailing list