[mvapich-discuss] Problems when I run the example of cpi using mpirun_rsh

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Dec 31 08:42:37 EST 2014


On Wed, Dec 31, 2014 at 11:26:10AM +0800, 马凯 wrote:
> Hello, I am a new user of MVAPICH2, and I encountered troubles when I started with it.
> First, I think I have installed it successfully, through this:
>     ./configure --disable-fortran --enable-cuda
>     make -j 4
>     make install
> There were not errors.
> 
> 
> But when I attempted to run the example of cpi in the directory of example, I encountered like this:
>     (1) I could connect node gpu-cluster-1 and gpu-cluster-4 through ssh without password;
>     (2) I run the cpi example separately on gpu-cluster-1 and gpu-cluster-4 using mpirun_rsh, and it worked OK, just like  this:
> run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-1 gpu-cluster-1 ./cpi
> Process 0 of 2 is on gpu-cluster-1
> Process 1 of 2 is on gpu-cluster-1
> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> wall clock time = 0.000089
> 
> 
> run at gpu-cluster-4:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-4 gpu-cluster-4 ./cpi
> Process 0 of 2 is on gpu-cluster-4
> Process 1 of 2 is on gpu-cluster-4
> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> wall clock time = 0.000134
>     (3) I run the cpi example both on gpu-cluster-1 and gpu-cluster-4 using mpiexec, and it worked OK, just like this:
> run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpiexec -np 2 -f hostfile ./cpi
> Process 0 of 2 is on gpu-cluster-1
> Process 1 of 2 is on gpu-cluster-4
> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> wall clock time = 0.000352
>     The content in hostfile is "gpu-cluster-1\ngpu-cluster-4"
> 
> 
>     (4)But, when I run cpi example, using mpirun_rsh, borh on gpu-cluster-1 and gpu-cluster-4, problem came out:
> 
> 
> run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 -hostfile hostfile ./cpi
> Process 1 of 2 is on gpu-cluster-4
> -----------------It stuck here, not going on ------------------------
> After a long time, I press Ctrl + C, and it present this:
> 
> 
> ^C[gpu-cluster-1:mpirun_rsh][signal_processor] Caught signal 2, killing job
> run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
> [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
> [gpu-cluster-4:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [gpu-cluster-4:mpispawn_1][report_error] connect() failed: Connection refused (111)
>  I have been confused for a long time, could you give me some help to resolve this problems?

Hello.  Sorry that you're facing this trouble.  It looks like this might
be related to an interaction with a firewall.  It seems that you're able
to ssh between the machines but connecting back to through another port
to the initial machine is being refused.  Can you check your firewall to
ensure that the initial machine can accept connections back from the
compute machines?

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list