[mvapich-discuss] Process Termination Detection with mpirun_rsh

Wed Aug 20 11:26:57 EDT 2008

Tom,
We use MVAPICH-1.0 which comes with mpirun_rsh. It has the same problem
and we do not use a scheduler. We have to check the nodes when a run is
aborted by the application. For a node that still has processes running
even though they should have been aborted, we have to kill one process
at a time to clear the node. I would think that this is a known problem
and should be corrected soon.

Fred

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Tom
Crockett
Sent: Tuesday, August 19, 2008 6:20 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh

Hi,

I've recently installed MVAPICH2 1.2rc1 on my cluster, and have been
experimenting with the new mpirun_rsh job launcher.  In general, I much
prefer this simpler approach, and have found it to be faster and more
reliable than MPD.  However, I'm having one fairly serious problem
relating to termination detection when processes abort.

Here's the scenario:

1. Launch an MPI job on multiple nodes via "mpirun_rsh -rsh", typically
with multiple processes per node (multi-process, multi-core).

2. One process dies, e.g., with a segmentation violation, on some random
node.

3. The node with the offending process seems to notice this locally; all
the sibling processes and the local mpispawn process terminate. 
However, the remaining nodes (including the master) don't seem to
notice; their processes continue to run (or more likely stall, waiting
on communication which will never arrive).

If I run this experiment on two nodes (for example) and look at the
process state on the master node before the process dies on the remote
node, I see two sets of "rsh" processes, with one active process and one
defunct process in each set.  "ps" shows that each defunct "rsh" is a
child of an active process.

Following abnormal process termination on the remote node, there will be
only one active rsh process and one defunct rsh process, confirming that
the remote processes have cleaned up and exited.  So it seems that
mpirun_rsh is not responding properly to the death of a child process.

Here's a concrete example showing the process state on the master node
following termination of the processes on the remote node:

11 [ty10] /bin/ps -utom -o 'user pid ppid s nice vsz rss pmem time
fname'
USER       PID  PPID S  NI    VSZ   RSS %MEM     TIME COMMAND
tom       6218 14345 S   0   9000  1984  0.0 00:00:00 tcsh
tom       6219  6218 S   0   1772   428  0.0 00:00:00 pbs_demu
tom       6251  6218 S   0   9368  1588  0.0 00:00:00 28027.ty
tom       6252  6251 S   0  12216  3072  0.0 00:00:00 pbsmvp2
tom       6257  6252 S   0   5288   676  0.0 00:00:00 mpirun_r
tom       6258  6257 S   0   6396   692  0.0 00:00:00 rsh
tom       6261  6260 S   0   9784  2096  0.0 00:00:00 tcsh
tom       6262  6258 Z   0      0     0  0.0 00:00:00 rsh
tom       6307  6261 S   0   5492   712  0.0 00:00:00 mpispawn
tom       6308  6307 R   0 8038032 19576  0.2 00:04:48 rand4
tom       6309  6307 R   0 8038036 14416  0.1 00:05:06 rand4
tom       6310  6307 R   0 8037904 14264  0.1 00:05:07 rand4
tom       6311  6307 R   0 8038032 14328  0.1 00:05:06 rand4

Interestingly, whether the master node detects the remote process
termination seems to depend on how the remote process dies.  If I hit
the remote process with a SIGTERM, mpirun_rsh seems to notice and things
get cleaned up after a minute or two.  If it terminates with something
else (e.g., a SIGSEGV), the job will sit there forever.

Finally, it's not just remote nodes that suffer from this problem.  The
behavior is the same if it's a local process on the master node that
aborts -- the local rsh and its descendants disappear, but mpirun_rsh
and processes on remote nodes persist.

Now for a few more specifics about our environment:

OS: SuSE Linux Enterprise Server 10 SP1
Compiler:  PGI 7.1-4
InfiniBand:  OFED 1.3
Scheduler:  TORQUE 2.2.1
Hardware Platform:  Dell SC1435 (Opteron 2218)

Eventually, of course, the job scheduler will timeout the job and kill
the master mpirun_rsh process, which seems to clean everything up OK. 
(In general, top-down kills by the scheduler seem to work fine.  It's
bottom-up termination that's problematic.)  But much of our workload has
very long runtimes (on the order of days to weeks), and my users don't
want to wait that long only to find out that their job actually bombed
with a segfault several days earlier.

Any thoughts on what might be causing this and how to fix it?

-Tom

--
Tom Crockett

College of William and Mary               email:  twcroc at wm.edu
IT/High Performance Computing Group       phone:  (757) 221-2762
Savage House                              fax:    (757) 221-2023
P.O. Box 8795
Williamsburg, VA  23187-8795

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss