[mvapich-discuss] 回复: connect [mt_checkin]: Connection refused

201621070526 at std.uestc.edu.cn 201621070526 at std.uestc.edu.cn
Wed Mar 1 03:24:36 EST 2017


To supplement.

when I try to use other pocessor manager to run the test on multinodes . both mpirun and mpiexec got hanging for long time. without any output. in addition I use to top to check whether the excuteable was running, and I found processor runing on both nodes.  and the wried thing is that the master node' , which launch process, cpu occupation is faily 100% ? 

best regards,

prince.
  



201621070526 at std.uestc.edu.cn
 
发件人: 201621070526 at std.uestc.edu.cn
发送时间: 2017-02-28 16:26
收件人: mvapich-discuss
抄送: subramoni.1; ammar.ahmad.awan
主题: connect [mt_checkin]: Connection refused
Hi,I got a problem, while runing multi nodes mvapich program, that “connect [mt_checkin]: Connection refused”. it works well on single node.
to clearfy, I have already set password less login properly and the same piece of code can be run on multi nodes by using OPENMPI. here is more details about my seting and hardware infomantion.
any help will be grately appericated. 


prince at root-220:~$ cat single_hosts 172.16.18.220prince at root-200:~$ mpirun_rsh -n 2 -hostfile single_hosts MV2_SMP_USE_CMA=0 ./cpiProcess 1 on root-200Process 0 on root-200pi is approximately 3.1416009869231241, Error is 0.0000083333333309wall clock time = 0.000165prince at root-200:~$ cat hosts 172.16.18.220172.16.18.158prince at root-200:~$ mpirun_rsh -n 2 -hostfile hosts MV2_SMP_USE_CMA=0 ./cpiconnect [mt_checkin]: Connection refused[root-200:mpirun_rsh][child_handler] Error in init phase, aborting! (1/2 mpispawn connections)prince at root-200:~$ prince at root-200:~$ mpiname -aMVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrailCompilationCC: gcc    -DNDEBUG -DNVALGRIND -O2CXX: g++   -DNDEBUG -DNVALGRIND -O2F77: gfortran -L/lib -L/lib   -O2CA 'mlx4_0' CA type: MT4099
Number of ports: 1
Firmware version: 2.36.5000
Hardware version: 1
Node GUID: 0xe41d2d0300bf45c0
System image GUID: 0xe41d2d0300bf45c3
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 13
LMC: 0
SM lid: 7
Capability mask: 0x02514868
Port GUID: 0xe41d2d0300bf45c1
Link layer: InfiniBand
prince at root-200:~$ FC: gfortran   -O2

Configuration
--prefix=/usr/local/mvapich2 --with-cuda --with-device=ch3:mrail --with-rdma=gen2

prince at root-200:~$ ibstat

Configuration--prefix=/usr/local/mvapich2 --with-cuda --with-device=ch3:mrail --with-rdma=gen2
prince at root-200:~$ ibstat
Configuration--prefix=/usr/local/mvapich2 --with-cuda --with-device=ch3:mrail --with-rdma=gen2
prince at root-200:~$ ibstat
 

regrads!
prince.



201621070526 at std.uestc.edu.cn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170301/c33928d8/attachment-0001.html>


More information about the mvapich-discuss mailing list