[mvapich-discuss] Problem: MPI process (rank: 0, pid: 3109) exited with status 1...

Thu Aug 1 13:43:54 EDT 2013

Great news!  I'm cc'ing this to mvapich-discuss for everyone's
information.

On Thu, Aug 01, 2013 at 01:58:24PM +0800, li.luo at siat.ac.cn wrote:
> Hey man,
> 
> I have fixed the problem. The key is to add the following two lines to /etc/security/limits.conf
> 
> *               soft    memlock          unlimited
> *               hard    memlock          unlimited
> 
> 
> This works, but ulimit -l unlimited doesn't.
> 
> Thanks a lot.
> 
> //////////////
> 
> [liluo at gpu2 examples]$ mpirun -np 8 -hostfile hosts ./cpi
> Process 0 of 8 is on gpu1
> Process 1 of 8 is on gpu1
> Process 2 of 8 is on gpu1
> Process 3 of 8 is on gpu1
> Process 4 of 8 is on gpu2
> Process 5 of 8 is on gpu2
> Process 6 of 8 is on gpu2
> Process 7 of 8 is on gpu2
> pi is approximately 3.1415926544231247, Error is 0.0000000008333316
> wall clock time = 0.014795
> 
> 
>  
> 
> 
> 
> > -----Original Messages-----
> > From: "Jonathan Perkins" <perkinjo at cse.ohio-state.edu>
> > Sent Time: Wednesday, July 31, 2013
> > To: li.luo at siat.ac.cn
> > Cc: 
> > Subject: Re: Re: Re: Re: [mvapich-discuss] Problem: MPI process (rank: 0, pid: 3109) exited with status 1...
> > 
> > If you've rebooted the machine or changed runlevels since then, then
> > yes.  You can try
> > 
> > service iptables stop
> > 
> > if you want to be sure.
> > 
> > On Wed, Jul 31, 2013 at 11:06:33AM +0800, li.luo at siat.ac.cn wrote:
> > > 
> > > My command line to stop firewall is :
> > > 
> > > chkconfig --level 2345 iptables off
> > > 
> > > Is that OK?
> > > 
> > > 
> > > 
> > > > -----Original Messages-----
> > > > From: "Jonathan Perkins" <perkinjo at cse.ohio-state.edu>
> > > > Sent Time: Wednesday, July 31, 2013
> > > > To: li.luo at siat.ac.cn
> > > > Cc: 
> > > > Subject: Re: Re: Re: [mvapich-discuss] Problem: MPI process (rank: 0, pid: 3109) exited with status 1...
> > > > 
> > > > The hostname test proved that mpirun_rsh is able to launch programs
> > > > fine.  There must be a problem elsewhere.
> > > > 
> > > > Can you try the following?
> > > > 
> > > >     mpirun_rsh -np 8 gpu2 gpu2 ./cpi
> > > > 
> > > > and
> > > > 
> > > >     mpirun_rsh -np 2 gpu2 gpu2 MV2_USE_SHARED_MEM=0 ./cpi
> > > > 
> > > > If the first one works and the second one fails it would seem to
> > > > indicate that you have a problem with your infiniband network.
> > > > 
> > > > On Wed, Jul 31, 2013 at 10:22:09AM +0800, li.luo at siat.ac.cn wrote:
> > > > > 1.//////////////////////////////
> > > > > 
> > > > > [liluo at gpu2 examples]$ mpirun_rsh -np 8 ./cpi
> > > > > Without hostfile option, hostnames must be specified on command line.
> > > > > usage: mpirun_rsh [-v] [-sg group] [-rsh|-ssh] [-debug] -[tv] [-xterm] [-show] [-legacy] [-export] -np N(-hostfile hfile | h1 h2 ... hN) a.out args | -config configfile (-hostfile hfile | h1 h2 ... hN)]
> > > > > Where:
> > > > >         sg         => execute the processes as different group ID
> > > > >         rsh        => to use rsh for connecting
> > > > >         ssh        => to use ssh for connecting
> > > > >         debug      => run each process under the control of gdb
> > > > >         tv         => run each process under the control of totalview
> > > > >         xterm      => run remote processes under xterm
> > > > >         show       => show command for remote execution but don't run it
> > > > >         legacy     => use old startup method (1 ssh/process)
> > > > >         export     => automatically export environment to remote processes
> > > > >         np         => specify the number of processes
> > > > >         h1 h2...   => names of hosts where processes should run
> > > > > or      hostfile   => name of file containing hosts, one per line
> > > > >         a.out      => name of MPI binary
> > > > >         args       => arguments for MPI binary
> > > > >         config     => name of file containing the exe information: each line has the form -n numProc : exe args
> > > > > 
> > > > > 
> > > > > 2.//////////////////////////////
> > > > > [liluo at gpu2 examples]$ mpirun_rsh -np 8 -hostfile hosts hostname
> > > > > gpu1
> > > > > gpu1
> > > > > gpu1
> > > > > gpu1
> > > > > [gpu2:mpirun_rsh][connect_socket] could not connect to gpu1-ib:35724
> > > > > [gpu2:mpirun_rsh][wfe_thread] Internal error: transition failed
> > > > > gpu2
> > > > > gpu2
> > > > > gpu2
> > > > > gpu2
> > > > > 
> > > > > 
> > > > > > -----Original Messages-----
> > > > > > From: "Jonathan Perkins" <perkinjo at cse.ohio-state.edu>
> > > > > > Sent Time: Tuesday, July 30, 2013
> > > > > > To: li.luo at siat.ac.cn
> > > > > > Cc: 
> > > > > > Subject: Re: Re: [mvapich-discuss] Problem: MPI process (rank: 0, pid: 3109) exited with status 1...
> > > > > > 
> > > > > > I noticed that when you're running on one node you're using mpirun and
> > > > > > when you're trying the failed case you're using mpirun_rsh.  Can you try
> > > > > > using mpirun_rsh in both cases?
> > > > > > 
> > > > > > Also, can you replace "./cpi" with "hostname" in the failed case to see
> > > > > > if a non-mpi program is working with mpirun_rsh between the two nodes?
> > > > > > 
> > > > > > On Tue, Jul 30, 2013 at 03:48:21PM +0800, li.luo at siat.ac.cn wrote:
> > > > > > > I have shut down the firewall and set both nodes with  'ulimit –l unlimited'
> > > > > > > 
> > > > > > > 
> > > > > > > then modified the configure command line as:
> > > > > > > ./configure --prefix=/opt/mvapich2-1.9-gnu --disable-mcast --disable-fast --enable-g=dbg
> > > > > > > 
> > > > > > > 
> > > > > > > when I ran on a single node, it works:
> > > > > > > 
> > > > > > > 
> > > > > > > [liluo at gpu2 examples]$ mpirun -np 8 ./cpi
> > > > > > > Process 5 of 8 is on gpu2
> > > > > > > Process 6 of 8 is on gpu2
> > > > > > > Process 7 of 8 is on gpu2
> > > > > > > Process 0 of 8 is on gpu2
> > > > > > > Process 1 of 8 is on gpu2
> > > > > > > Process 2 of 8 is on gpu2
> > > > > > > Process 3 of 8 is on gpu2
> > > > > > > Process 4 of 8 is on gpu2
> > > > > > > pi is approximately 3.1415926544231247, Error is 0.0000000008333316
> > > > > > > wall clock time = 0.000132
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > But if I use  -hostfile and my hosts file is:
> > > > > > > gpu1-ib
> > > > > > > gpu2-ib
> > > > > > > error happens:
> > > > > > > 
> > > > > > > 
> > > > > > > [liluo at gpu2 examples]$ mpirun_rsh -np 8 -hostfile hosts ./cpi
> > > > > > > [cli_5]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [cli_6]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [gpu2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
> > > > > > > [gpu2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
> > > > > > > [gpu2:mpispawn_1][child_handler] MPI process (rank: 5, pid: 7363) exited with status 1
> > > > > > > [gpu2:mpispawn_1][child_handler] MPI process (rank: 6, pid: 7364) exited with status 1
> > > > > > > [cli_4]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [gpu2:mpispawn_1][child_handler] MPI process (rank: 4, pid: 7362) exited with status 1
> > > > > > > [cli_7]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [gpu2:mpispawn_1][child_handler] MPI process (rank: 7, pid: 7365) exited with status 1
> > > > > > > [cli_3]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [cli_0]: [gpu1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
> > > > > > > [gpu1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> > > > > > > aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [gpu1:mpispawn_0][child_handler] MPI process (rank: 3, pid: 11098) exited with status 1
> > > > > > > [gpu1:mpispawn_0][child_handler] MPI process (rank: 0, pid: 11095) exited with status 1
> > > > > > > [cli_1]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [gpu1:mpispawn_0][child_handler] MPI process (rank: 1, pid: 11096) exited with status 1
> > > > > > > [cli_2]: aborting job:
> > > > > > > Fatal error in MPI_Init:
> > > > > > > Other MPI error
> > > > > > > 
> > > > > > > 
> > > > > > > [gpu1:mpispawn_0][child_handler] MPI process (rank: 2, pid: 11097) exited with status 1
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Li Luo
> > > > > > > Shenzhen Institutes of Advanced Technology 
> > > > > > > Address: 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, P.R.China
> > > > > > > Tel: +86-755-86392312，+86-15899753087
> > > > > > > Email: li.luo at siat.ac.cn
> > > > > > > 
> > > > > > > 
> > > > > > > > -----Original Messages-----
> > > > > > > > From: "Jonathan Perkins" <perkinjo at cse.ohio-state.edu>
> > > > > > > > Sent Time: Tuesday, July 23, 2013
> > > > > > > > To: li.luo at siat.ac.cn
> > > > > > > > Cc: mvapich-discuss at cse.ohio-state.edu
> > > > > > > > Subject: Re: [mvapich-discuss] Problem: MPI process (rank: 0, pid: 3109) exited with status 1...
> > > > > > > > 
> > > > > > > > Hello.  Can you try a new debug build to see if we can get more output
> > > > > > > > from this failure?  Try adding the following to your configure line and
> > > > > > > > rebuild.
> > > > > > > > 
> > > > > > > >     --disable-fast --enable-g=dbg
> > > > > > > > 
> > > > > > > > Just some things to think about.  Do you have your locked memory limit
> > > > > > > > set high enough?  You can check your current value via `ulimit -l'.  We
> > > > > > > > suggest setting this to unlimited.  Also, do you have an active firewall
> > > > > > > > between the two nodes.  Both mpirun_rsh and mpiexec need to be able to
> > > > > > > > connect to each of the machines used by the MPI application using ports
> > > > > > > > other than those used by ssh.
> > > > > > > > 
> > > > > > > > On Mon, Jul 22, 2013 at 09:08:00PM +0800, li.luo at siat.ac.cn wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > I want to use MVAPICH2 for GPU-GPU communication. I have installed mvapich1.9 (by root) on my two nodes with configuration:
> > > > > > > > > 
> > > > > > > > > ./configure --prefix=/opt/mvapich2-1.9-gnu --enable-shared --enable-cuda --with-cuda=/home/liluo/lib/cuda_5.0/ --disable-mcast
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > and make by:
> > > > > > > > > 
> > > > > > > > > make -j4
> > > > > > > > > make install
> > > > > > > > > 
> > > > > > > > > Now I want to run the example cpi by my personal account liluo.
> > > > > > > > > 
> > > > > > > > > For np=2 on one single node, it works.
> > > > > > > > > 
> > > > > > > > > But it doesn't work for 2 nodes with hostfile as:
> > > > > > > > > 
> > > > > > > > > gpu1-ib
> > > > > > > > > 
> > > > > > > > > gpu2-ib
> > > > > > > > > 
> > > > > > > > >  the output error is the following:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > [liluo at gpu1 programs]$ mpirun_rsh -n 2 -hostfile hostfile ./cpi
> > > > > > > > > [cli_0]: aborting job:
> > > > > > > > > Fatal error in MPI_Init:
> > > > > > > > > Other MPI error
> > > > > > > > > 
> > > > > > > > > [gpu1:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3109) exited with status 1
> > > > > > > > > [gpu1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
> > > > > > > > > [gpu1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> > > > > > > > > [cli_1]: aborting job:
> > > > > > > > > Fatal error in MPI_Init:
> > > > > > > > > Other MPI error
> > > > > > > > > 
> > > > > > > > > [gpu2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
> > > > > > > > > [gpu2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
> > > > > > > > > [gpu2:mpispawn_1][child_handler] MPI process (rank: 1, pid: 3144) exited with status 1
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > //////////
> > > > > > > > > I use node gpu2-ib as the host node.
> > > > > > > > > I can successfully ping gpu1-ib with gpu2-ib.
> > > > > > > > > 
> > > > > > > > > And the installation folder for /opt/mvapich2-1.9-gnu and the current folder( where ./cpi is in) on node gpu2-ib have been exported to node gpu1-ib.
> > > > > > > > > 
> > > > > > > > > What can I do?

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo