[mvapich-discuss] Followup: mvapich2 issue regarding mpd timeout in mpiexec

Dhabaleswar Panda panda at cse.ohio-state.edu
Thu Jun 5 23:27:02 EDT 2008


Hi David,

Thanks for your note here. Please feel free to use a higher value of 200
for larger cluster. We are exploring to see whether we can dynamically
adjust this value based on the system size. We are also forwarding this
note to MPICH2 folks.

FYI, in the upcoming MVAPICH2 release, we will be providing a
non-MPD-based scalable startup mechanism (mpirun_rsh, similar to the one
used in MVAPICH). This will help to launch MPI jobs on multi-thousand node
clusters with very little overhead. The upcoming release will be available
in a few weeks.

Thanks,

DK

On Thu, 29 May 2008 David_Kewley at Dell.com wrote:

> This is a followup to this thread:
>
> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2007-May/000834
> .html
>
> between Greg Bauer and Qi Gao.
>
> We had the same problem that Greg saw -- failure of mpiexec, with the
> characteristic error message "no msg recvd from mpd when expecting ack
> of request".  It was resolved for us by setting recvTimeout in
> mpiexec.py to a higher value, just as Greg suggested and Qi concurred.
> The default value is 20; we chose 200 (we did not experiment with values
> between these two, so lower may work in many cases).
>
> I think this change should be made permanent in MVAPICH2.  I do not
> think it will negatively impact anyone, because in the four cases where
> this timeout is used, if the timeout expires mpiexec immediately makes
> an error exit anyway.  So the worst consequence is that mpiexec would
> take longer to fail (3 minutes longer if 200 is used instead of 20).
> The user who encounters this timeout has to fix the root cause of the
> timeout in order to get any work done, so they are not likely to
> encounter it repeatedly and thereby lose lots of runtime simply because
> the timeout is large.  Is this analysis correct?
>
> Meanwhile, this change would clearly help at least some people with
> large clusters.  We see failure with the default recvTimeout between 900
> and 1000 processes; larger recvTimeout allows us to scale to 3000
> processes and beyond.
>
> The default setting does not cause failure if I make a simple, direct
> call to mpiexec.  I only see it when I use mpirun.lsf to launch a large
> job.  I think the failure in the LSF case is due to the longer time it
> presumably takes to launch LSF's TaskStarter for every process, etc.
> The time required seems to be O(#processes) in the LSF case.  (We have
> LSF 6.2, with a local custom wrapper script for TaskStarter).
>
> If you agree that this change to the value of recvTimeout is OK, please
> implement this one-line change in MVAPICH2, and consider contributing it
> upstream to MPICH2 as well.
>
> If you decline to make this change, at least it's now on the web that
> this change does fix the problem. :)
>
> Thanks,
> David
>
> David Kewley
> Dell Infrastructure Consulting Services
> Onsite Engineer at the Maui HPC Center
> Cell: 602-460-7617
> David_Kewley at Dell.com
>
> Dell Services: http://www.dell.com/services/
> How am I doing? Email my manager Russell_Kelly at Dell.com with any
> feedback.
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list