[mvapich-discuss] suggested minor feature
Mark Potts
potts at hpcapplications.com
Thu Sep 6 15:59:37 EDT 2007
Jonathan,
It'll take me a few days to be able to get to look at this.
In the meantime you could probably save me a little searching by
telling me if this patch is to the baseline mvapich-0.9.9
mpirun_rsh.c or to that same routine with the later patches. You
guys have provided us with several patches that relate to
timeouts and job cleanup in the past few weeks and I want to
assure applying the patch to the right bits.
Thanks.
regards,
Jonathan L. Perkins wrote:
> Mark Potts wrote:
>> Hi,
>> Despite the many issues I've raised about MVAPICH job cleanup and
>> timeouts (all resolved now it appears), I'd like to raise another
>> related issue -- a suggestion.
>>
>> We've found that a job that correctly has all processes call
>> MPI_Finalize() at the end of their communications stages, can
>> not permit any processes to terminate if it is desired for even
>> a single thread to continue to work. That is, after MPI_Finalize()
>> is called and any processes correctly terminate there is only a
>> 10 second window in which any remaining processes will be allowed
>> to run before mpirun_rsh kills the remaining children. This
>> presents a problem for codes that naturally complete the job's
>> task in serial mode or codes in which debugging of a process is
>> needed after MPI_Finalize().
>>
>> The suggestion would be:
>> to provide the timeout period (currently 10 seconds) as a
>> VIADEV_* env variable, with default of 10, which users could
>> then modify when 10 seconds was too little time for a remaining
>> process. By the same token this env variable could be used
>> to trim the timeout period to a smaller value, when a user
>> deemed 10 seconds to not be agressive enough.
>> regards,
>
> Mark:
> In light of your suggestion we took a look at how mpirun_rsh handles the
> termination of its children processes. With a small change in the
> semantics we managed to remove the "timeout" entirely.
>
> We now allow processes that exit cleanly to not effect the lifespan of
> other processes. In the case that a process doesn't exit cleanly, the
> other processes will still destroyed like normal.
>
> Can you try out the attached patch and let us know whether everything
> works in the way that an end user would expect? We also welcome any
> further suggestions. Thanks.
>
--
***********************************
>> Mark J. Potts, PhD
>>
>> HPC Applications Inc.
>> phone: 410-992-8360 Bus
>> 410-313-9318 Home
>> 443-418-4375 Cell
>> email: potts at hpcapplications.com
>> potts at excray.com
***********************************
More information about the mvapich-discuss
mailing list