[mvapich-discuss] Errors spawning processes with mpirun_rsh

Jaidev Sridhar sridharj at cse.ohio-state.edu
Mon Feb 23 12:25:12 EST 2009


Hi Rafael,

On Mon, 2009-02-23 at 18:08 +0100, Rafael Arco Arredondo wrote:
> Hi Jaidev,
> 
> Thank you for your prompt reply.
> 
> > The message indicates that the application terminated  with a non zero 
> > error code or crashed after launching. Can you check if it leaves any 
> > core files? You may need to set ulimit to  unlimited. For example, add 
> > ulimit -c unlimited in your ~/.bashrc.
> 
> Yes, a core file is generated after adding 'ulimit -c unlimited' to
> $HOME/.bashrc.

Can you send us the backtrace from this core file -
	$ gdb ./mpihello core.xyz
	(gdb) bt

If you have core files from both mvapich and mvapich2 runs, we'd like to
see them. This will provide more insights.

It'll be more useful if you can compile the libraries and your
application with debug symbols:
  * For mvapich2, configure the libraries with --enable-g=dbg and
    compile your application with mpicc -g
  * For mvapich, edit make.mvapich.gen2, add -g to CFLAGS and compile
    your application with mpicc -g

-Jaidev

> 
> > Can you also give us details of the cluster and any options you've 
> > enabled with MVAPICH / MVAPICH2?
> 
> It is a cluster of servers with AMD64 Opteron processors, an Infiniband
> network and Sun Grid Engine 6.2 as batch scheduler (anyway this error is
> reported both when SGE controls the jobs and when it doesn't, when
> mpirun_rsh is directly executed from the command line).
> 
> In order to compile MVAPICH, the PathScale compiler was used (for which
> the make.mvapich.gen2 script was accordingly edited), shared library
> support was enabled and the flag -DXRC was removed. The rest of the
> options, including the configuration files in $MVAPICH_HOME/etc, wasn't
> modified (i.e., default values are used).
> 
> As for MVAPICH2, it was compiled by invoking the configure script this
> way:
> 
> ./configure --enable-sharedlibs=gcc CC=pathcc F77=pathf90 F90=pathf90
> CXX=pathCC
> 
> And then plain 'make' and 'make install'. Again, the other options
> weren't changed.
> 
> MVAPICH and MVAPICH2 compile with no problems, so do programs compiled
> with mpicc. However, programs crash on the initialization stage after
> launching as you said.
> 
> Any ideas?
> 
> Thanks again,
> 
> Rafa
> 
> > On 02/23/2009 04:45 AM, Rafael Arco Arredondo wrote:
> > > Hello,
> > > 
> > > I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and
> > > MVAPICH2 1.2p1. As I commented in another email to the list some time
> > > ago, mpirun_rsh is the only mechanism we can use to create MPI processes
> > > in our configuration.
> > > 
> > > The command issued is:
> > > mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello
> > > 
> > > And the error reported by mpirun_rsh is:
> > > 
> > > Exit code -5 signaled from localhost
> > > MPI process terminated unexpectedly
> > > Killing remote processes...DONE
> > > 
> > > We also got this on some of our machines:
> > > 
> > > Child exited abnormally!
> > > Killing remote processes...DONE
> > > 
> > > mpihello is a simple hello world and this happens even when the
> > > processes are launched on localhost only.
> > > 
> > > OFED 1.2 is used as the underlying Infiniband libraries, and both
> > > MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail
> > > option, without XRC as indicated in the user's guide for OFED libraries
> > > prior to version 1.3.
> > > 
> > > Any help will be kindly appreciated.
> > > 
> > > Thank you in advance,
> > > 
> > > Rafa
> > >
> > > 
> 
> 



More information about the mvapich-discuss mailing list