[mvapich-discuss] suggested minor feature

Jonathan L. Perkins perkinjo at cse.ohio-state.edu
Thu Sep 6 15:31:34 EDT 2007


Mark Potts wrote:
> Hi,
>    Despite the many issues I've raised about MVAPICH job cleanup and
>    timeouts (all resolved now it appears), I'd like to raise another
>    related issue -- a suggestion.
> 
>    We've found that a job that correctly has all processes call
>    MPI_Finalize() at the end of their communications stages, can
>    not permit any processes to terminate if it is desired for even
>    a single thread to continue to work.  That is, after MPI_Finalize()
>    is called and any processes correctly terminate there is only a
>    10 second window in which any remaining processes will be allowed
>    to run before mpirun_rsh kills the remaining children.  This
>    presents a problem for codes that naturally complete the job's
>    task in serial mode or codes in which debugging of a process is
>    needed after MPI_Finalize().
> 
>    The suggestion would be:
>       to provide the timeout period (currently 10 seconds) as a
>       VIADEV_* env variable, with default of 10, which users could
>       then modify when 10 seconds was too little time for a remaining
>       process.  By the same token this env variable could be used
>       to trim the timeout period to a smaller value, when a user
>       deemed 10 seconds to not be agressive enough.
>          regards,

Mark:
In light of your suggestion we took a look at how mpirun_rsh handles the 
termination of its children processes.  With a small change in the 
semantics we managed to remove the "timeout" entirely.

We now allow processes that exit cleanly to not effect the lifespan of 
other processes.  In the case that a process doesn't exit cleanly, the 
other processes will still destroyed like normal.

Can you try out the attached patch and let us know whether everything 
works in the way that an end user would expect?  We also welcome any 
further suggestions.  Thanks.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpirun_rsh.patch
Type: text/x-patch
Size: 1589 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070906/1da828ed/mpirun_rsh.bin


More information about the mvapich-discuss mailing list