[mvapich-discuss] Process Termination Detection with mpirun_rsh
Tom Crockett
twcroc at wm.edu
Tue Aug 19 19:19:31 EDT 2008
Hi,
I've recently installed MVAPICH2 1.2rc1 on my cluster, and have been
experimenting with the new mpirun_rsh job launcher. In general, I much
prefer this simpler approach, and have found it to be faster and more
reliable than MPD. However, I'm having one fairly serious problem
relating to termination detection when processes abort.
Here's the scenario:
1. Launch an MPI job on multiple nodes via "mpirun_rsh -rsh", typically
with multiple processes per node (multi-process, multi-core).
2. One process dies, e.g., with a segmentation violation, on some random
node.
3. The node with the offending process seems to notice this locally; all
the sibling processes and the local mpispawn process terminate.
However, the remaining nodes (including the master) don't seem to
notice; their processes continue to run (or more likely stall, waiting
on communication which will never arrive).
If I run this experiment on two nodes (for example) and look at the
process state on the master node before the process dies on the remote
node, I see two sets of "rsh" processes, with one active process and one
defunct process in each set. "ps" shows that each defunct "rsh" is a
child of an active process.
Following abnormal process termination on the remote node, there will be
only one active rsh process and one defunct rsh process, confirming that
the remote processes have cleaned up and exited. So it seems that
mpirun_rsh is not responding properly to the death of a child process.
Here's a concrete example showing the process state on the master node
following termination of the processes on the remote node:
11 [ty10] /bin/ps -utom -o 'user pid ppid s nice vsz rss pmem time fname'
USER PID PPID S NI VSZ RSS %MEM TIME COMMAND
tom 6218 14345 S 0 9000 1984 0.0 00:00:00 tcsh
tom 6219 6218 S 0 1772 428 0.0 00:00:00 pbs_demu
tom 6251 6218 S 0 9368 1588 0.0 00:00:00 28027.ty
tom 6252 6251 S 0 12216 3072 0.0 00:00:00 pbsmvp2
tom 6257 6252 S 0 5288 676 0.0 00:00:00 mpirun_r
tom 6258 6257 S 0 6396 692 0.0 00:00:00 rsh
tom 6261 6260 S 0 9784 2096 0.0 00:00:00 tcsh
tom 6262 6258 Z 0 0 0 0.0 00:00:00 rsh
tom 6307 6261 S 0 5492 712 0.0 00:00:00 mpispawn
tom 6308 6307 R 0 8038032 19576 0.2 00:04:48 rand4
tom 6309 6307 R 0 8038036 14416 0.1 00:05:06 rand4
tom 6310 6307 R 0 8037904 14264 0.1 00:05:07 rand4
tom 6311 6307 R 0 8038032 14328 0.1 00:05:06 rand4
Interestingly, whether the master node detects the remote process
termination seems to depend on how the remote process dies. If I hit
the remote process with a SIGTERM, mpirun_rsh seems to notice and things
get cleaned up after a minute or two. If it terminates with something
else (e.g., a SIGSEGV), the job will sit there forever.
Finally, it's not just remote nodes that suffer from this problem. The
behavior is the same if it's a local process on the master node that
aborts -- the local rsh and its descendants disappear, but mpirun_rsh
and processes on remote nodes persist.
Now for a few more specifics about our environment:
OS: SuSE Linux Enterprise Server 10 SP1
Compiler: PGI 7.1-4
InfiniBand: OFED 1.3
Scheduler: TORQUE 2.2.1
Hardware Platform: Dell SC1435 (Opteron 2218)
Eventually, of course, the job scheduler will timeout the job and kill
the master mpirun_rsh process, which seems to clean everything up OK.
(In general, top-down kills by the scheduler seem to work fine. It's
bottom-up termination that's problematic.) But much of our workload has
very long runtimes (on the order of days to weeks), and my users don't
want to wait that long only to find out that their job actually bombed
with a segfault several days earlier.
Any thoughts on what might be causing this and how to fix it?
-Tom
--
Tom Crockett
College of William and Mary email: twcroc at wm.edu
IT/High Performance Computing Group phone: (757) 221-2762
Savage House fax: (757) 221-2023
P.O. Box 8795
Williamsburg, VA 23187-8795
More information about the mvapich-discuss
mailing list