[mvapich-discuss] suggested minor feature
Jonathan L. Perkins
perkinjo at cse.ohio-state.edu
Thu Sep 6 16:53:30 EDT 2007
Mark Potts wrote:
> Jonathan,
> It'll take me a few days to be able to get to look at this.
>
> In the meantime you could probably save me a little searching by
> telling me if this patch is to the baseline mvapich-0.9.9
> mpirun_rsh.c or to that same routine with the later patches. You
> guys have provided us with several patches that relate to
> timeouts and job cleanup in the past few weeks and I want to
> assure applying the patch to the right bits.
This patch should be applied to the already patched version of
mpirun_rsh.c. If you would like a patch against the mvapich-0.9.9
release version of mpirun_rsh.c just let me know.
> Thanks.
>
> regards,
>
> Jonathan L. Perkins wrote:
>> Mark Potts wrote:
>>> Hi,
>>> Despite the many issues I've raised about MVAPICH job cleanup and
>>> timeouts (all resolved now it appears), I'd like to raise another
>>> related issue -- a suggestion.
>>>
>>> We've found that a job that correctly has all processes call
>>> MPI_Finalize() at the end of their communications stages, can
>>> not permit any processes to terminate if it is desired for even
>>> a single thread to continue to work. That is, after MPI_Finalize()
>>> is called and any processes correctly terminate there is only a
>>> 10 second window in which any remaining processes will be allowed
>>> to run before mpirun_rsh kills the remaining children. This
>>> presents a problem for codes that naturally complete the job's
>>> task in serial mode or codes in which debugging of a process is
>>> needed after MPI_Finalize().
>>>
>>> The suggestion would be:
>>> to provide the timeout period (currently 10 seconds) as a
>>> VIADEV_* env variable, with default of 10, which users could
>>> then modify when 10 seconds was too little time for a remaining
>>> process. By the same token this env variable could be used
>>> to trim the timeout period to a smaller value, when a user
>>> deemed 10 seconds to not be agressive enough.
>>> regards,
>>
>> Mark:
>> In light of your suggestion we took a look at how mpirun_rsh handles
>> the termination of its children processes. With a small change in the
>> semantics we managed to remove the "timeout" entirely.
>>
>> We now allow processes that exit cleanly to not effect the lifespan of
>> other processes. In the case that a process doesn't exit cleanly, the
>> other processes will still destroyed like normal.
>>
>> Can you try out the attached patch and let us know whether everything
>> works in the way that an end user would expect? We also welcome any
>> further suggestions. Thanks.
>>
>
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list