[mvapich-discuss] Process Termination Detection with mpirun_rsh

Wed Sep 3 12:22:52 EDT 2008

Hi Fred and Tom,

This is to let you know that we have come up with a solution for the
"mpirun_rsh -rsh" problem. This solution also solves the hanging process
issue.

These solutions have been applied to the following versions:

MVAPICH 1.0 branch and trunk

MVAPICH2 1.2 trunk

These solutions will get reflected on the nightly tarballs tonight. Please
try these latest versions (tarballs or directly from SVN) and let us know
whether it solves all of your problems.

Thanks,

DK

On Wed, 20 Aug 2008, Stecher, Fred wrote:

> Tom,
> We use MVAPICH-1.0 which comes with mpirun_rsh. It has the same problem
> and we do not use a scheduler. We have to check the nodes when a run is
> aborted by the application. For a node that still has processes running
> even though they should have been aborted, we have to kill one process
> at a time to clear the node. I would think that this is a known problem
> and should be corrected soon.
>
>
> Fred
>
>
> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu
> [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Tom
> Crockett
> Sent: Tuesday, August 19, 2008 6:20 PM
> To: mvapich-discuss at cse.ohio-state.edu
> Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh
>
> Hi,
>
> I've recently installed MVAPICH2 1.2rc1 on my cluster, and have been
> experimenting with the new mpirun_rsh job launcher.  In general, I much
> prefer this simpler approach, and have found it to be faster and more
> reliable than MPD.  However, I'm having one fairly serious problem
> relating to termination detection when processes abort.
>
> Here's the scenario:
>
> 1. Launch an MPI job on multiple nodes via "mpirun_rsh -rsh", typically
> with multiple processes per node (multi-process, multi-core).
>
> 2. One process dies, e.g., with a segmentation violation, on some random
> node.
>
> 3. The node with the offending process seems to notice this locally; all
> the sibling processes and the local mpispawn process terminate.
> However, the remaining nodes (including the master) don't seem to
> notice; their processes continue to run (or more likely stall, waiting
> on communication which will never arrive).
>
>
> If I run this experiment on two nodes (for example) and look at the
> process state on the master node before the process dies on the remote
> node, I see two sets of "rsh" processes, with one active process and one
> defunct process in each set.  "ps" shows that each defunct "rsh" is a
> child of an active process.
>
> Following abnormal process termination on the remote node, there will be
> only one active rsh process and one defunct rsh process, confirming that
> the remote processes have cleaned up and exited.  So it seems that
> mpirun_rsh is not responding properly to the death of a child process.
>
> Here's a concrete example showing the process state on the master node
> following termination of the processes on the remote node:
>
> 11 [ty10] /bin/ps -utom -o 'user pid ppid s nice vsz rss pmem time
> fname'
> USER       PID  PPID S  NI    VSZ   RSS %MEM     TIME COMMAND
> tom       6218 14345 S   0   9000  1984  0.0 00:00:00 tcsh
> tom       6219  6218 S   0   1772   428  0.0 00:00:00 pbs_demu
> tom       6251  6218 S   0   9368  1588  0.0 00:00:00 28027.ty
> tom       6252  6251 S   0  12216  3072  0.0 00:00:00 pbsmvp2
> tom       6257  6252 S   0   5288   676  0.0 00:00:00 mpirun_r
> tom       6258  6257 S   0   6396   692  0.0 00:00:00 rsh
> tom       6261  6260 S   0   9784  2096  0.0 00:00:00 tcsh
> tom       6262  6258 Z   0      0     0  0.0 00:00:00 rsh
> tom       6307  6261 S   0   5492   712  0.0 00:00:00 mpispawn
> tom       6308  6307 R   0 8038032 19576  0.2 00:04:48 rand4
> tom       6309  6307 R   0 8038036 14416  0.1 00:05:06 rand4
> tom       6310  6307 R   0 8037904 14264  0.1 00:05:07 rand4
> tom       6311  6307 R   0 8038032 14328  0.1 00:05:06 rand4
>
> Interestingly, whether the master node detects the remote process
> termination seems to depend on how the remote process dies.  If I hit
> the remote process with a SIGTERM, mpirun_rsh seems to notice and things
> get cleaned up after a minute or two.  If it terminates with something
> else (e.g., a SIGSEGV), the job will sit there forever.
>
> Finally, it's not just remote nodes that suffer from this problem.  The
> behavior is the same if it's a local process on the master node that
> aborts -- the local rsh and its descendants disappear, but mpirun_rsh
> and processes on remote nodes persist.
>
> Now for a few more specifics about our environment:
>
> OS: SuSE Linux Enterprise Server 10 SP1
> Compiler:  PGI 7.1-4
> InfiniBand:  OFED 1.3
> Scheduler:  TORQUE 2.2.1
> Hardware Platform:  Dell SC1435 (Opteron 2218)
>
> Eventually, of course, the job scheduler will timeout the job and kill
> the master mpirun_rsh process, which seems to clean everything up OK.
> (In general, top-down kills by the scheduler seem to work fine.  It's
> bottom-up termination that's problematic.)  But much of our workload has
> very long runtimes (on the order of days to weeks), and my users don't
> want to wait that long only to find out that their job actually bombed
> with a segfault several days earlier.
>
> Any thoughts on what might be causing this and how to fix it?
>
> -Tom
>
> --
> Tom Crockett
>
> College of William and Mary               email:  twcroc at wm.edu
> IT/High Performance Computing Group       phone:  (757) 221-2762
> Savage House                              fax:    (757) 221-2023
> P.O. Box 8795
> Williamsburg, VA  23187-8795
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>