[mvapich-discuss] Unable to run mpi hello world program on KVM guests with SR-IOV
Pharthiphan Asokan
pasokan at ddn.com
Fri Apr 20 07:43:49 EDT 2018
Hi Folks,
Unable to run mpi hello world program on KVM guests with SR-IOV, it gets hung [no response on the stdout] and error messages on stdou when I do CTRL + C
Note:- It worked pretty straight forward when I had CentOS 7.3 + MOFED 4.3 with the latest mvapich2 release. but my application requirement push me to have CentOS 7.4 + MOFED 4.2
Environment details added in the mail. Please help us using mvapich2
system program (hostname) on two KVM guests
# mpirun_rsh -np 2 vcn01 vcn02 hostname
vcn01
vcn02
#
MPI hello world program on single client
# mpirun_rsh -np 1 vcn01 /home/pasokan/a.out
Hello world from processor vcn01, rank 0 out of 1 processors
#
MPI hello world program on two KVM guests
# mpirun_rsh -np 2 vcn01 vcn02 /home/pasokan/a.out
^C[vcn01:mpirun_rsh][signal_processor] Caught signal 2, killing job
[root at vcn01 pasokan]# ^C
[root at vcn01 pasokan]# [vcn01:mpispawn_0][error_sighandler] Caught error: Segmentation fault (signal 11)
/usr/bin/bash: line 1: 4444 Segmentation fault /usr/bin/env LD_LIBRARY_PATH=/home/pasokan/mvapich2-2.3rc1/lib:/opt/ddn/ime/lib MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=vcn01 MPISPAWN_MPIRUN_HOSTIP=10.52.100.1 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=52794 MPISPAWN_MPIRUN_PORT=52794 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=4439 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_338_vcn01_4439 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='/home/pasokan/a.out' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/pasokan MPISPAWN_MPIRUN_RANK_0=0 /home/pasokan/mvapich2-2.3rc1/bin/mpispawn 0
[vcn02:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
[vcn02:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
[vcn02:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
[vcn02:mpispawn_1][report_error] connect() failed: Connection refused (111)
#
MVAPICH2 Version
mvapich2-2.3rc1
ulimit
clush -b -w vcn[01-02] ulimit -l
---------------
vcn[01-02] (2)
---------------
unlimited
KVM Host Information :-
OS version
CentOS Linux release 7.3.1611 (Core)
Kernel Version
3.10.0-514.el7.x86_64
OFED info
MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0)
IB Card Info
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 10.16.1200
node_guid: 248a:0703:00e2:f4b0
sys_image_guid: 248a:0703:00e2:f4b0
vendor_id: 0x02c9
vendor_part_id: 4113
hw_ver: 0x0
board_id: MT_1230110019
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 4
port_lid: 18
port_lmc: 0x00
link_layer: InfiniBand
KVM Version
# /usr/libexec/qemu-kvm --version
QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-126.el7), Copyright (c) 2003-2008 Fabrice Bellard
libvirt version
libvirt-3.2.0-14.el7_4.9.x86_64
KVM Guest Information :-
OS version
CentOS Linux release 7.4.1708 (Core)
Kernel Version
3.10.0-693.17.1.el7.x86_64
OFED info
MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):
IB Card Info
# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 10.16.1200
node_guid: 0111:3344:7766:7790
sys_image_guid: 248a:0703:00e2:f4b0
vendor_id: 0x02c9
vendor_part_id: 4114
hw_ver: 0x0
board_id: MT_1230110019
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 4
port_lid: 18
port_lmc: 0x00
link_layer: InfiniBand
Regards,
Pharthiphan Asokan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180420/0800c022/attachment.html>
More information about the mvapich-discuss
mailing list