[mvapich-discuss] Process Termination Detection with mpirun_rsh

Tue Aug 19 19:19:31 EDT 2008

Hi,

I've recently installed MVAPICH2 1.2rc1 on my cluster, and have been 
experimenting with the new mpirun_rsh job launcher.  In general, I much 
prefer this simpler approach, and have found it to be faster and more 
reliable than MPD.  However, I'm having one fairly serious problem 
relating to termination detection when processes abort.

Here's the scenario:

1. Launch an MPI job on multiple nodes via "mpirun_rsh -rsh", typically 
with multiple processes per node (multi-process, multi-core).

2. One process dies, e.g., with a segmentation violation, on some random 
node.

3. The node with the offending process seems to notice this locally; all 
the sibling processes and the local mpispawn process terminate. 
However, the remaining nodes (including the master) don't seem to 
notice; their processes continue to run (or more likely stall, waiting 
on communication which will never arrive).

If I run this experiment on two nodes (for example) and look at the 
process state on the master node before the process dies on the remote 
node, I see two sets of "rsh" processes, with one active process and one 
defunct process in each set.  "ps" shows that each defunct "rsh" is a 
child of an active process.

Following abnormal process termination on the remote node, there will be 
only one active rsh process and one defunct rsh process, confirming that 
the remote processes have cleaned up and exited.  So it seems that 
mpirun_rsh is not responding properly to the death of a child process.

Here's a concrete example showing the process state on the master node 
following termination of the processes on the remote node:

11 [ty10] /bin/ps -utom -o 'user pid ppid s nice vsz rss pmem time fname'
USER       PID  PPID S  NI    VSZ   RSS %MEM     TIME COMMAND
tom       6218 14345 S   0   9000  1984  0.0 00:00:00 tcsh
tom       6219  6218 S   0   1772   428  0.0 00:00:00 pbs_demu
tom       6251  6218 S   0   9368  1588  0.0 00:00:00 28027.ty
tom       6252  6251 S   0  12216  3072  0.0 00:00:00 pbsmvp2
tom       6257  6252 S   0   5288   676  0.0 00:00:00 mpirun_r
tom       6258  6257 S   0   6396   692  0.0 00:00:00 rsh
tom       6261  6260 S   0   9784  2096  0.0 00:00:00 tcsh
tom       6262  6258 Z   0      0     0  0.0 00:00:00 rsh
tom       6307  6261 S   0   5492   712  0.0 00:00:00 mpispawn
tom       6308  6307 R   0 8038032 19576  0.2 00:04:48 rand4
tom       6309  6307 R   0 8038036 14416  0.1 00:05:06 rand4
tom       6310  6307 R   0 8037904 14264  0.1 00:05:07 rand4
tom       6311  6307 R   0 8038032 14328  0.1 00:05:06 rand4

Interestingly, whether the master node detects the remote process 
termination seems to depend on how the remote process dies.  If I hit 
the remote process with a SIGTERM, mpirun_rsh seems to notice and things 
get cleaned up after a minute or two.  If it terminates with something 
else (e.g., a SIGSEGV), the job will sit there forever.

Finally, it's not just remote nodes that suffer from this problem.  The 
behavior is the same if it's a local process on the master node that 
aborts -- the local rsh and its descendants disappear, but mpirun_rsh 
and processes on remote nodes persist.

Now for a few more specifics about our environment:

OS: SuSE Linux Enterprise Server 10 SP1
Compiler:  PGI 7.1-4
InfiniBand:  OFED 1.3
Scheduler:  TORQUE 2.2.1
Hardware Platform:  Dell SC1435 (Opteron 2218)

Eventually, of course, the job scheduler will timeout the job and kill 
the master mpirun_rsh process, which seems to clean everything up OK. 
(In general, top-down kills by the scheduler seem to work fine.  It's 
bottom-up termination that's problematic.)  But much of our workload has 
very long runtimes (on the order of days to weeks), and my users don't 
want to wait that long only to find out that their job actually bombed 
with a segfault several days earlier.

Any thoughts on what might be causing this and how to fix it?

-Tom

-- 
Tom Crockett

College of William and Mary               email:  twcroc at wm.edu
IT/High Performance Computing Group       phone:  (757) 221-2762
Savage House                              fax:    (757) 221-2023
P.O. Box 8795
Williamsburg, VA  23187-8795