[mvapich-discuss] [MVAPICH2, RoCE problem]
Min-Woo Ahn
minwoo.ahn at csl.skku.edu
Tue Dec 5 06:17:35 EST 2017
Hello MVAPICH2 users,
I'm trying to using MVAPICH2 for running MPI programs(Linpack and NAS
parallel benchmarks) by using RDMA (RoCE, not infiniband) over 2
servers(total 16 cores for each server).
I have Mellnox SX1036 40GbE switch and CoonectX-3 pro adaptor for these
servers and CentOS 7 runs as host OS. And I tested RoCE communication
between 2 servers by ib_write_bw, and ib_send_bw.
But, when I run MPI program, it shows several runtime errors.
I tried to run NPB by mpiexec and mpirun_rsh but neither failed to run.
First, I tried to run NPB(workload: ft, class: C, # of process: 16) by
"mpiexec" command as below,
---------------------------Command-----------------------------
$mpiexec -ppn 8 -np 16 -hostfile /NAS/hostfile ./ft.C.16
----------------------------------------------------------------------
It shows message as below,
-------------------------------Debug message-------------------------------
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(449).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(313).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[cli_9]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(449).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(313).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device
[cli_15]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(449).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(313).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device
[cli_8]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(449).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(313).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device
[cli_14]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(449).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(313).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
------------------------------------------------------------------------------------
Next, I tried to run NPB(workload: ft, class: C, # of process: 16) by
"mpirun_rsh" command as below,
----------------------------------Command---------------------------------
$mpirun_rsh -debug -np 16 -hostfile /NAS/hostfile ./ft.C.16
---------------------------------------------------------------------------------
It shows message as below,
--------------------------Debug message----------------------------
execv: No such file or directory
/usr/bin/xterm -e /usr/bin/ssh -q 192.168.20.11 cd
/NAS/benchmarks/NPB3.3.1/NPB3.3-MPI/bin; /usr/bin/env
LD_LIBRARY_PATH=/usr/lib64/mvapich2/lib:/usr/lib64:/NAS/benchmarks/GotoBLAS2:/root/GotoBLAS2
MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=nfs1
MPISPAWN_MPIRUN_HOSTIP=10.201.209.154 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=40293 MPISPAWN_MPIRUN_PORT=40293 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=16 MPISPAWN_MPIRUN_ID=1385 MPISPAWN_ARGC=1
MPISPAWN_ARGV_0=/usr/bin/gdb MPDMAN_KVS_TEMPLATE=kvs_588_nfs1_1385
MPISPAWN_LOCAL_NPROCS=8 MPISPAWN_ARGV_1='./ft.C.16' MPISPAWN_ARGC=2
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/NAS/benchmarks/NPB3.3.1/NPB3.3-MPI/bin
MPISPAWN_MPIRUN_RANK_0=0 MPISPAWN_MPIRUN_RANK_1=2 MPISPAWN_MPIRUN_RANK_2=4
MPISPAWN_MPIRUN_RANK_3=6 MPISPAWN_MPIRUN_RANK_4=8 MPISPAWN_MPIRUN_RANK_5=10
MPISPAWN_MPIRUN_RANK_6=12 MPISPAWN_MPIRUN_RANK_7=14
/usr/lib64/mvapich2/bin/mpispawn 0 (null)
execv: No such file or directory
/usr/bin/xterm -e /usr/bin/ssh -q 192.168.20.12 cd
/NAS/benchmarks/NPB3.3.1/NPB3.3-MPI/bin; /usr/bin/env
LD_LIBRARY_PATH=/usr/lib64/mvapich2/lib:/usr/lib64:/NAS/benchmarks/GotoBLAS2:/root/GotoBLAS2
MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=nfs1
MPISPAWN_MPIRUN_HOSTIP=10.201.209.154 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=40293 MPISPAWN_MPIRUN_PORT=40293 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=16 MPISPAWN_MPIRUN_ID=1385 MPISPAWN_ARGC=1
MPISPAWN_ARGV_0=/usr/bin/gdb MPDMAN_KVS_TEMPLATE=kvs_588_nfs1_1385
MPISPAWN_LOCAL_NPROCS=8 MPISPAWN_ARGV_1='./ft.C.16' MPISPAWN_ARGC=2
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1
MPISPAWN_WORKING_DIR=/NAS/benchmarks/NPB3.3.1/NPB3.3-MPI/bin
MPISPAWN_MPIRUN_RANK_0=1 MPISPAWN_MPIRUN_RANK_1=3 MPISPAWN_MPIRUN_RANK_2=5
MPISPAWN_MPIRUN_RANK_3=7 MPISPAWN_MPIRUN_RANK_4=9 MPISPAWN_MPIRUN_RANK_5=11
MPISPAWN_MPIRUN_RANK_6=13 MPISPAWN_MPIRUN_RANK_7=15
/usr/lib64/mvapich2/bin/mpispawn 0 (null)
[nfs1:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2
mpispawn connections)
-----------------------------------------------------------------------------
192.168.20.11 and 192.168.20.12 are IP that indicated to use RoCE (40Gb/s).
And I execute mpiexec and mpirun_rsh on first(192.168.20.11) server.
Any solutions for this problem?
I need your help.
Thank you.
---------------------------------------------
Minwoo Ahn
Researcher/M.S. Candidate
Computer Systems Laboratory
Sungkyunkwan University
More information: http://csl.skku.edu/People/MWAhn
---------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171205/c1c1140a/attachment.html>
More information about the mvapich-discuss
mailing list