[mvapich-discuss] timeout problems with mpiexec and bash

Andy Wettstein ajw at illinois.edu
Tue Apr 12 10:43:54 EDT 2011


On Mon, Apr 11, 2011 at 04:21:29PM -0400, Jonathan Perkins wrote:
> Andy:
> Hello, it looks like the timeout may be during the PMI
> communication/setup phase.  Are you using mvapich2-1.6?  If so, you
> can try to use mpiexec.hydra to see if you have the same problem.
> This launcher is installed in the default build of mvapich2 and is
> located in the bin directory of the mvapich2 installation.

This is with mvapich2 1.6.

I started up several jobs with both mpiexec.hydra and mpirun_rsh.
All jobs started normally with mpiexec.hydra.

I had about 3 out of 20 fail to start with mpirun_rsh. This was the
error message:
handle_mt_peer: fail to read...: Cannot allocate memory



> 
> On Mon, Apr 11, 2011 at 2:38 PM, Andy Wettstein <ajw at illinois.edu> wrote:
> > Hello,
> >
> > I've been having some problems with launching a 2000+ core job using
> > mpiexec 0.84 and mvapich2 1.6 when using bash as the shell. We're
> > running on Scientific Linux 6 (aka rhel 6).
> >
> > I get errors like this:
> >
> > [unset]: connect failed with timeout
> > [unset]: Unable to connect to taub511 on 39404
> > Fatal error in MPI_Init_thread:
> > Other MPI error, error stack:
> > MPIR_Init_thread(413): Initialization failed
> > MPID_Init(203).......: channel initialization failed
> > MPID_Init(514).......: PMI_Init returned -1
> >
> >
> > The machines we are using have 12 cores. Right now I'm launching on 192
> > x 12 so 2304 cores total.
> >
> > Smaller core counts seem to work ok. For instance, a 1200 core job just
> > launched fine. Switching the shell to tcsh also allows me to launch these
> > jobs. I haven't seen tcsh fail yet in starting this job.
> >
> > I'll attach the environment and limits that are set for these jobs.
> >
> > I asked on the mpiexec mailing list and they believed that I must be
> > hitting some timeout in the mvapich2 startup code.
> >
> > If you need any more info, just let me know.
> >
> > Thanks
> > andy
> >
> >
> > --
> > andy wettstein
> > unix administrator
> > department of physics
> > university of illinois at urbana-champaign
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
> 
> 
> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo

-- 
andy wettstein
unix administrator
department of physics
university of illinois at urbana-champaign



More information about the mvapich-discuss mailing list