[mvapich-discuss] timeout problems with mpiexec and bash

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Apr 12 11:12:31 EDT 2011


Good to hear that Hydra and mpirun_rsh (mostly) is working for you.
Can you let me know how much memory is available on the node that
you're launching from?

On Tue, Apr 12, 2011 at 10:43 AM, Andy Wettstein <ajw at illinois.edu> wrote:
> On Mon, Apr 11, 2011 at 04:21:29PM -0400, Jonathan Perkins wrote:
>> Andy:
>> Hello, it looks like the timeout may be during the PMI
>> communication/setup phase.  Are you using mvapich2-1.6?  If so, you
>> can try to use mpiexec.hydra to see if you have the same problem.
>> This launcher is installed in the default build of mvapich2 and is
>> located in the bin directory of the mvapich2 installation.
>
> This is with mvapich2 1.6.
>
> I started up several jobs with both mpiexec.hydra and mpirun_rsh.
> All jobs started normally with mpiexec.hydra.
>
> I had about 3 out of 20 fail to start with mpirun_rsh. This was the
> error message:
> handle_mt_peer: fail to read...: Cannot allocate memory
>
>
>
>>
>> On Mon, Apr 11, 2011 at 2:38 PM, Andy Wettstein <ajw at illinois.edu> wrote:
>> > Hello,
>> >
>> > I've been having some problems with launching a 2000+ core job using
>> > mpiexec 0.84 and mvapich2 1.6 when using bash as the shell. We're
>> > running on Scientific Linux 6 (aka rhel 6).
>> >
>> > I get errors like this:
>> >
>> > [unset]: connect failed with timeout
>> > [unset]: Unable to connect to taub511 on 39404
>> > Fatal error in MPI_Init_thread:
>> > Other MPI error, error stack:
>> > MPIR_Init_thread(413): Initialization failed
>> > MPID_Init(203).......: channel initialization failed
>> > MPID_Init(514).......: PMI_Init returned -1
>> >
>> >
>> > The machines we are using have 12 cores. Right now I'm launching on 192
>> > x 12 so 2304 cores total.
>> >
>> > Smaller core counts seem to work ok. For instance, a 1200 core job just
>> > launched fine. Switching the shell to tcsh also allows me to launch these
>> > jobs. I haven't seen tcsh fail yet in starting this job.
>> >
>> > I'll attach the environment and limits that are set for these jobs.
>> >
>> > I asked on the mpiexec mailing list and they believed that I must be
>> > hitting some timeout in the mvapich2 startup code.
>> >
>> > If you need any more info, just let me know.
>> >
>> > Thanks
>> > andy
>> >
>> >
>> > --
>> > andy wettstein
>> > unix administrator
>> > department of physics
>> > university of illinois at urbana-champaign
>> >
>> >
>> > _______________________________________________
>> > mvapich-discuss mailing list
>> > mvapich-discuss at cse.ohio-state.edu
>> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>
> --
> andy wettstein
> unix administrator
> department of physics
> university of illinois at urbana-champaign
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list