[mvapich-discuss] connect [mt_checkin]: Connection refused

Ammar Ahmad Awan ammar.ahmad.awan at gmail.com
Wed Feb 15 15:11:37 EST 2017


Hello,

I think there is a library mismatch here. Please note that MVAPICH2 and
MVAPICH2-GDR are two different libraries.

If you want to use MVAPICH2-GDR, you need to download the appropriate RPMs
from here: http://mvapich.cse.ohio-state.edu/downloads/

In your last email, the error you are reporting seems to be coming from
your MVAPICH2 build (maybe located somewhere else on your system) and not
the MVAPICH2-GDR library.

Please double check the paths for your builds and let us know.


Regards,
Ammar


On Sun, Feb 12, 2017 at 9:48 PM, 201621070526 at std.uestc.edu.cn <
201621070526 at std.uestc.edu.cn> wrote:

> hi Hari,
>
> I have test mpiexec.hydra on both intranode and internode, in intranode
> case it work porperly just like other job lanuchers. but as for internode
> case, it failed again...  here is the carsh info.
>
> hd at root0-SCW4350-220:~/program/mvapich2$ ./bin/mpiexec.hydra -np 2 -f ../
> nccl/hosts_2 ./libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce
> GPU CUDA support is not configured. Please reconfigure MVAPICH2 library
> with --enable-cuda option.
> [cli_1]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(514):
> MPID_Init(365).......: channel initialization failed
> MPIDI_CH3_Init(137)..:
>
> ============================================================
> =======================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 4163 RUNNING AT 172.16.18.158
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ============================================================
> =======================
> [proxy:0:0 at root0-SCW4350-220] HYD_pmcd_pmip_control_cmd_cb (
> pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:0 at root0-SCW4350-220] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at root0-SCW4350-220] main (pm/pmiserv/pmip.c:206):
> demux engine error waiting for event
> [mpiexec at root0-SCW4350-220] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the
> processes terminated badly; aborting
> [mpiexec at root0-SCW4350-220] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned
> error waiting for completion
> [mpiexec at root0-SCW4350-220] HYD_pmci_wait_for_completion (
> pm/pmiserv/pmiserv_pmci.c:218): launcher returned error
> waiting for completion
> [mpiexec at root0-SCW4350-220] main (ui/mpich/mpiexec.c:344):
>  process manager error waiting for completion
>
>
> hd at root0-SCW4350-220:~/program/mvapich2$ cat ../nccl/hosts_2
> 172.16.18.220
> 172.16.18.158
>
> the compile configuation:
>
> ./configure --with-device=ch3:mrail --with-rdma=gen2  --
> enable-cuda=/usr/local/cuda --prefix=/home/hd/program/mvapich2
>
>
> those are realted info, hope helps...
>
>
> thanks again.
> regards,
> hd.
>
>
> ------------------------------
> 201621070526 at std.uestc.edu.cn
>
>
> *From:* Hari Subramoni <subramoni.1 at osu.edu>
> *Date:* 2017-02-10 21:02
> *To:* 201621070526 at std.uestc.edu.cn
> *CC:* mvapich-discuss <mvapich-discuss at cse.ohio-state.edu>
> *Subject:* Re: [mvapich-discuss] connect [mt_checkin]: Connection refused
> Can you please try using the hydra job launcher (mpiexec.hydra) to see if
> that works? You should be able to find it in the same place as mpirun_rsh.
>
> Thx,
> Hari.
>
> On Feb 10, 2017 3:19 AM, "201621070526 at std.uestc.edu.cn" <
> 201621070526 at std.uestc.edu.cn> wrote:
>
>> hi,  Hari
>>
>> I have already setup the ssh password less login for both nodes even
>> localhost have been setup as well.  and I am not root user.  as you
>> mentioned it might caused by firewall, I think it might not be the case
>> ,because I have tested Openmpi and it works well.
>> so I suspect the is there anything wrong about* /etc/hosts/*  setting ?
>>
>> as XIAOYI mentioned in the following Discussion he slolve the problem by
>> properly set */etc/hosts* , but I have NO idea about the detais....
>>
>> http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/
>> 2017-January/006289.html
>>
>>
>> sincerely, HD
>>
>> ------------------------------
>> 201621070526 at std.uestc.edu.cn
>>
>>
>> *From:* Hari Subramoni <subramoni.1 at osu.edu>
>> *Date:* 2017-02-10 08:46
>> *To:* 201621070526 at std.uestc.edu.cn
>> *CC:* mvapich-discuss <mvapich-discuss at cse.ohio-state.edu>
>> *Subject:* Re: [mvapich-discuss] connect [mt_checkin]: Connection refused
>> It looks like a system issue. It could be that password less ssh is not
>> setup. This is very likely if the user is root. There could be some
>> firewalls blocking access to the nodes in the host file. Can you please
>> check on these?
>>
>> Regards,
>> Hari.
>>
>>
>> On Feb 9, 2017 6:43 PM, "201621070526 at std.uestc.edu.cn" <
>> 201621070526 at std.uestc.edu.cn> wrote:
>>
>> hi, I use the MVAPICH2.2-GDR got the same problem.
>>
>> mpirun_rsh -ssh -export -np 10 -hostfile mf ../get_local_ran
>> k collective/osu_allreduce D D
>> connect [mt_checkin]: Connection refused
>> [root0-SCW4350-220:mpirun_rsh][child_handler] Error in init
>> phase, aborting! (1/2 mpispawn connections)
>> huang at root0-SCW4350-220:~/program/mvapich2-gdr/libexec/osu-m
>> icro-benchmarks/mpi$ [root0-SCW4350-220:mpispawn_0][report_
>> error] connect() failed: Connection refused (111)
>>
>>
>> *here is the set of my /etc/hosts"*
>> 127.0.0.1 localhost
>> 127.0.1.1 root0-SCW4350-220
>> 172.16.18.220 node1
>> 172.16.18.158 node2
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1     ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>>
>>
>> * here is the info of my MVAPICH and IB:*
>>
>> mpiname -a
>> MVAPICH2-GDR 2.2 Tue Oct 25 22:00:00 EST 2016 ch3:mrail
>>
>> Compilation
>> CC: gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexcepti
>> ons -fstack-protector-strong --param=ssp-buffer-size=4 -grec
>> ord-gcc-switches   -m64 -mtune=generic   -DNDEBUG -DNVALGRIND -O2
>> CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexcept
>> ions -fstack-protector-strong --param=ssp-buffer-size=4 -gre
>> cord-gcc-switches   -m64 -mtune=generic  -DNDEBUG -DNVALGRIND -O2
>> F77: gfortran -L/lib -L/lib -O2 -g -pipe -Wall -Wp,-D_FORTIF
>> Y_SOURCE=2 -fexceptions -fstack-protector-strong --param=
>> ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generi
>> c -I/opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/gfortran/modules  -O2
>> FC: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fex
>> ceptions -fstack-protector-strong --param=ssp-buffer-size=4
>> -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/
>> gdr/2.2/cuda8.0/gnu/lib64/gfortran/modules  -O2
>> Configuration
>> --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-g
>> nu --program-prefix= --disable-dependency-tracking --prefix=
>> /opt/mvapich2/gdr/2.2/cuda8.0/gnu --exec-prefix=/opt/mvapich
>> 2/gdr/2.2/cuda8.0/gnu --bindir=/opt/mvapich2/gdr/2.2/c
>> uda8.0/gnu/bin --sbindir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/
>> sbin --sysconfdir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/etc --
>> datadir=/opt/mvapich2/gdr/2.2/cuda8.0/gnu/share --includedir
>> =/opt/mvapich2/gdr/2.2/cuda8.0/gnu/include --libdir=/opt/
>> mvapich2/gdr/2.2/cuda8.0/gnu/lib64 --libexecdir=/opt/
>> mvapich2/gdr/2.2/cuda8.0/gnu/libexec --localstatedir=/var -
>> -sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.2/
>> cuda8.0/gnu/share/man --infodir=/opt/mvapich2/gdr/2.2/
>> cuda8.0/gnu/share/info --disable-rpath --disable-
>> static --enable-shared --disable-rdma-cm --disable-mcast --without-hydra-
>> ckpointlib --with-core-direct --enable-cuda CPPFLAGS=-I/usr/
>> local/cuda-8.0/include LDFLAGS=-L/usr/local/cuda-8.0/lib64 -
>> Wl,-rpath,/usr/local/cuda-8.0/lib64 -Wl,-rpath,XORIGIN/
>> placeholder -Wl,--build-id
>>
>> ibstat
>> CA 'mlx4_0'
>> CA type: MT4099
>> Number of ports: 1
>> Firmware version: 2.34.5000
>> Hardware version: 1
>> Node GUID: 0xe41d2d0300bf45c0
>> System image GUID: 0xe41d2d0300bf45c3
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 13
>> LMC: 0
>> SM lid: 7
>> Capability mask: 0x02514868
>> Port GUID: 0xe41d2d0300bf45c1
>> Link layer: InfiniBand
>>
>> any help would be grately appericate...
>>
>>
>> ------------------------------
>> 201621070526 at std.uestc.edu.cn
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170215/30ffc0f5/attachment-0001.html>


More information about the mvapich-discuss mailing list