[mvapich-discuss] Problems when I run the example of cpi using mpirun_rsh

马凯 makailove123 at 163.com
Tue Dec 30 22:26:10 EST 2014


Hello, I am a new user of MVAPICH2, and I encountered troubles when I started with it.
First, I think I have installed it successfully, through this:
    ./configure --disable-fortran --enable-cuda
    make -j 4
    make install
There were not errors.


But when I attempted to run the example of cpi in the directory of example, I encountered like this:
    (1) I could connect node gpu-cluster-1 and gpu-cluster-4 through ssh without password;
    (2) I run the cpi example separately on gpu-cluster-1 and gpu-cluster-4 using mpirun_rsh, and it worked OK, just like  this:
run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-1 gpu-cluster-1 ./cpi
Process 0 of 2 is on gpu-cluster-1
Process 1 of 2 is on gpu-cluster-1
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000089


run at gpu-cluster-4:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-4 gpu-cluster-4 ./cpi
Process 0 of 2 is on gpu-cluster-4
Process 1 of 2 is on gpu-cluster-4
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000134
    (3) I run the cpi example both on gpu-cluster-1 and gpu-cluster-4 using mpiexec, and it worked OK, just like this:
run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpiexec -np 2 -f hostfile ./cpi
Process 0 of 2 is on gpu-cluster-1
Process 1 of 2 is on gpu-cluster-4
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000352
    The content in hostfile is "gpu-cluster-1\ngpu-cluster-4"


    (4)But, when I run cpi example, using mpirun_rsh, borh on gpu-cluster-1 and gpu-cluster-4, problem came out:


run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 -hostfile hostfile ./cpi
Process 1 of 2 is on gpu-cluster-4
-----------------It stuck here, not going on ------------------------
After a long time, I press Ctrl + C, and it present this:


^C[gpu-cluster-1:mpirun_rsh][signal_processor] Caught signal 2, killing job
run at gpu-cluster-1:~/mvapich2-2.1rc1/examples$ [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu-cluster-4:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
[gpu-cluster-4:mpispawn_1][report_error] connect() failed: Connection refused (111)
 I have been confused for a long time, could you give me some help to resolve this problems?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141231/21d2f92d/attachment.html>


More information about the mvapich-discuss mailing list