[mvapich-discuss] suggested minor feature

Mark Potts potts at hpcapplications.com
Wed Sep 5 11:14:04 EDT 2007


Hi,
    Despite the many issues I've raised about MVAPICH job cleanup and
    timeouts (all resolved now it appears), I'd like to raise another
    related issue -- a suggestion.

    We've found that a job that correctly has all processes call
    MPI_Finalize() at the end of their communications stages, can
    not permit any processes to terminate if it is desired for even
    a single thread to continue to work.  That is, after MPI_Finalize()
    is called and any processes correctly terminate there is only a
    10 second window in which any remaining processes will be allowed
    to run before mpirun_rsh kills the remaining children.  This
    presents a problem for codes that naturally complete the job's
    task in serial mode or codes in which debugging of a process is
    needed after MPI_Finalize().

    The suggestion would be:
       to provide the timeout period (currently 10 seconds) as a
       VIADEV_* env variable, with default of 10, which users could
       then modify when 10 seconds was too little time for a remaining
       process.  By the same token this env variable could be used
       to trim the timeout period to a smaller value, when a user
       deemed 10 seconds to not be agressive enough.
          regards,
-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list