[mvapich-discuss] connect [mt_checkin]: Connection refused

201621070526 at std.uestc.edu.cn 201621070526 at std.uestc.edu.cn
Wed Feb 15 21:09:19 EST 2017


Hi, Ammar.
 
I knew that MVAPICH2 and MVAPICH2-GDR are two different lib. At the beginning,  I tried MVAPICH2-GDR for follow the instruction from :http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/, and I got the problem like that( mpdere detas please refer to the first email. ):

mpirun_rsh -ssh -export -np 10 -hostfile mf ../get_local_rank collective/osu_allreduce D D
connect [mt_checkin]: Connection refused
[root0-SCW4350-220:mpirun_rsh][child_handler] Error in init phase, aborting! (1/2 mpispawn connections)
huang at root0-SCW4350-220:~/program/mvapich2-gdr/libexec/osu-micro-benchmarks/mpi$ [root0-SCW4350-220:mpispawn_0][report_error] connect() failed: Connection refused (111)

cause I use Ubuntu14.04, so I Use rpm2cpio to extract the library.  I wonder maybe this will matters. After that I try MVAPICH2 compile with cuda enable, I got the problem as mentioned in my last email.  both MVAPICH2 and MVAPICH2-GDR  works well on single node, but report the problems when come into multi nodes case. in addition,  the root have installed OPENMPI on these machine. so I always use full path to launch the job.




201621070526 at std.uestc.edu.cn
 
From: Ammar Ahmad Awan
Date: 2017-02-16 04:11
To: 201621070526 at std.uestc.edu.cn
CC: mvapich-discuss; Hari Subramoni
Subject: Re: [mvapich-discuss] connect [mt_checkin]: Connection refused
Hello,

I think there is a library mismatch here. Please note that MVAPICH2 and MVAPICH2-GDR are two different libraries. 

If you want to use MVAPICH2-GDR, you need to download the appropriate RPMs from here: http://mvapich.cse.ohio-state.edu/downloads/

In your last email, the error you are reporting seems to be coming from your MVAPICH2 build (maybe located somewhere else on your system) and not the MVAPICH2-GDR library. 

Please double check the paths for your builds and let us know. 


Regards,
Ammar


On Sun, Feb 12, 2017 at 9:48 PM, 201621070526 at std.uestc.edu.cn <201621070526 at std.uestc.edu.cn> wrote:
hi Hari,

I have test mpiexec.hydra on both intranode and internode, in intranode case it work porperly just like other job lanuchers. but as for internode case, it failed again...  here is the carsh info.

hd at root0-SCW4350-220:~/program/mvapich2$ ./bin/mpiexec.hydra -np 2 -f ../nccl/hosts_2 ./libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce 
GPU CUDA support is not configured. Please reconfigure MVAPICH2 library with --enable-cuda option.
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(514): 
MPID_Init(365).......: channel initialization failed
MPIDI_CH3_Init(137)..: 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4163 RUNNING AT 172.16.18.158
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at root0-SCW4350-220] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
[proxy:0:0 at root0-SCW4350-220] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at root0-SCW4350-220] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at root0-SCW4350-220] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at root0-SCW4350-220] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at root0-SCW4350-220] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at root0-SCW4350-220] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion


hd at root0-SCW4350-220:~/program/mvapich2$ cat ../nccl/hosts_2 
172.16.18.220
172.16.18.158

the compile configuation:

./configure --with-device=ch3:mrail --with-rdma=gen2  --enable-cuda=/usr/local/cuda --prefix=/home/hd/program/mvapich2


those are realted info, hope helps...


thanks again.
regards, 
hd.




201621070526 at std.uestc.edu.cn
 
From: Hari Subramoni
Date: 2017-02-10 21:02
To: 201621070526 at std.uestc.edu.cn
CC: mvapich-discuss
Subject: Re: [mvapich-discuss] connect [mt_checkin]: Connection refused
Can you please try using the hydra job launcher (mpiexec.hydra) to see if that works? You should be able to find it in the same place as mpirun_rsh. 

Thx, 
Hari. 

On Feb 10, 2017 3:19 AM, "201621070526 at std.uestc.edu.cn" <201621070526 at std.uestc.edu.cn> wrote:
hi,  Hari

I have already setup the ssh password less login for both nodes even localhost have been setup as well.  and I am not root user.  as you mentioned it might caused by firewall, I think it might not be the case ,because I have tested Openmpi and it works well.
so I suspect the is there anything wrong about /etc/hosts/  setting ?

as XIAOYI mentioned in the following Discussion he slolve the problem by properly set /etc/hosts , but I have NO idea about the detais....

http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2017-January/006289.html 


sincerely, HD 



201621070526 at std.uestc.edu.cn
 
From: Hari Subramoni
Date: 2017-02-10 08:46
To: 201621070526 at std.uestc.edu.cn
CC: mvapich-discuss
Subject: Re: [mvapich-discuss] connect [mt_checkin]: Connection refused
It looks like a system issue. It could be that password less ssh is not setup. This is very likely if the user is root. There could be some firewalls blocking access to the nodes in the host file. Can you please check on these?

Regards, 
Hari. 


On Feb 9, 2017 6:43 PM, "201621070526 at std.uestc.edu.cn" <201621070526 at std.uestc.edu.cn> wrote:
hi, I use the MVAPICH2.2-GDR got the same problem.

mpirun_rsh -ssh -export -np 10 -hostfile mf ../get_local_rank collective/osu_allreduce D D
connect [mt_checkin]: Connection refused
[root0-SCW4350-220:mpirun_rsh][child_handler] Error in init phase, aborting! (1/2 mpispawn connections)
huang at root0-SCW4350-220:~/program/mvapich2-gdr/libexec/osu-micro-benchmarks/mpi$ [root0-SCW4350-220:mpispawn_0][report_error] connect() failed: Connection refused (111)


here is the set of my /etc/hosts"
127.0.0.1 localhost
127.0.1.1 root0-SCW4350-220
172.16.18.220 node1
172.16.18.158 node2

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters


 here is the info of my MVAPICH and IB:

mpiname -a
MVAPICH2-GDR 2.2 Tue Oct 25 22:00:00 EST 2016 ch3:mrail

Compilation
CC: gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic   -DNDEBUG -DNVALGRIND -O2
CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic  -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/gfortran/modules  -O2
FC: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/gfortran/modules  -O2
Configuration
--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/opt/mvapich2/gdr/2.2/cuda8.0/gnu --exec-prefix=/opt/mvapich2/gdr/2.2/cuda8.0/gnu --bindir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/bin --sbindir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/sbin --sysconfdir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/etc --datadir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/share --includedir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/include --libdir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64 --libexecdir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/share/man --infodir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/share/info --disable-rpath --disable-static --enable-shared --disable-rdma-cm --disable-mcast --without-hydra-ckpointlib --with-core-direct --enable-cuda CPPFLAGS=-I/usr/local/cuda-8.0/include LDFLAGS=-L/usr/local/cuda-8.0/lib64 -Wl,-rpath,/usr/local/cuda-8.0/lib64 -Wl,-rpath,XORIGIN/placeholder -Wl,--build-id

ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.34.5000
Hardware version: 1
Node GUID: 0xe41d2d0300bf45c0
System image GUID: 0xe41d2d0300bf45c3
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 13
LMC: 0
SM lid: 7
Capability mask: 0x02514868
Port GUID: 0xe41d2d0300bf45c1
Link layer: InfiniBand

any help would be grately appericate...




201621070526 at std.uestc.edu.cn

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170216/0f249726/attachment-0001.html>


More information about the mvapich-discuss mailing list