[mvapich-discuss] Intermittent hanging after application exits at scale with MVAPICH2

Gregory Bauer gbauer at ncsa.uiuc.edu
Thu Jul 19 16:21:37 EDT 2007


We are using mvapich2-0.9.8p2 (with the patch applied that addresses a 
start-up scalability issue) built via the make.mvapich2.ofa 
(--with-device=osu_ch3:mrail --with-rdma=gen2) script and with ofed-1.2 
and python-2.3.4.

I recently ran a series of 1024 tasks (128 nodes, 8 cores per node) jobs 
(via PBS). Out of 8 jobs, two jobs were left in a state where the 
application had exited but the mpd's for each task still remained (the 
launch process was still in mpiexec).

I have attached output from ps and from gdb for the backtrace.

The application output is such that it thinks it exited correctly. It is 
just that mpiexec doesn't return and PBS eventually kills the job after 
it exceeds the job wallclock time.

Any ideas?

-Greg


-------------- next part --------------
[gbauer at abe0573 ~]$ !ps
ps -fugbauer
UID        PID  PPID  C STIME TTY          TIME CMD
gbauer   22744  5389  0 11:46 ?        00:00:00 -tcsh
gbauer   22772 22744  0 11:46 ?        00:00:00 pbs_demux
gbauer   22885 22744  0 11:46 ?        00:00:00 /bin/csh /var/spool/torque/mom_priv/jobs/18699.abem5.SC
gbauer   23017     1  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23018     1  0 11:46 ?        00:00:00 ssh -x -n abe0572 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0573 -p 41562  --ncpus=1 -e -d
gbauer   23019     1  0 11:46 ?        00:00:00 ssh -x -n abe0571 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0573 -p 41562  --ncpus=1 -e -d
gbauer   23020     1  0 11:46 ?        00:00:00 ssh -x -n abe0570 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0573 -p 41562  --ncpus=1 -e -d
gbauer   23021     1  0 11:46 ?        00:00:00 ssh -x -n abe0569 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0573 -p 41562  --ncpus=1 -e -d
gbauer   23031     1  0 11:46 ?        00:00:00 ssh -x -n abe0567 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0572 -p 54313  --ncpus=1 -e -d
gbauer   23078     1  0 11:46 ?        00:00:00 ssh -x -n abe0536 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0564 -p 54926  --ncpus=1 -e -d
gbauer   23085     1  0 11:46 ?        00:00:00 ssh -x -n abe0529 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0561 -p 48609  --ncpus=1 -e -d
gbauer   23090     1  0 11:46 ?        00:00:00 ssh -x -n abe0524 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0537 -p 60740  --ncpus=1 -e -d
gbauer   23091     1  0 11:46 ?        00:00:00 ssh -x -n abe0523 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0537 -p 60740  --ncpus=1 -e -d
gbauer   23099     1  0 11:46 ?        00:00:00 ssh -x -n abe0515 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0539 -p 38092  --ncpus=1 -e -d
gbauer   23102     1  0 11:46 ?        00:00:00 ssh -x -n abe0512 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0540 -p 38364  --ncpus=1 -e -d
gbauer   23106     1  0 11:46 ?        00:00:00 ssh -x -n abe0510 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0540 -p 38364  --ncpus=1 -e -d
gbauer   23109     1  0 11:46 ?        00:00:00 ssh -x -n abe0507 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0547 -p 49529  --ncpus=1 -e -d
gbauer   23127     1  0 11:46 ?        00:00:00 ssh -x -n abe0489 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0563 -p 52970  --ncpus=1 -e -d
gbauer   23129     1  0 11:46 ?        00:00:00 ssh -x -n abe0487 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0532 -p 60282  --ncpus=1 -e -d
gbauer   23132     1  0 11:46 ?        00:00:00 ssh -x -n abe0484 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0549 -p 59222  --ncpus=1 -e -d
gbauer   23135     1  0 11:46 ?        00:00:00 ssh -x -n abe0482 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0549 -p 59222  --ncpus=1 -e -d
gbauer   23136     1  0 11:46 ?        00:00:00 ssh -x -n abe0481 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0549 -p 59222  --ncpus=1 -e -d
gbauer   23141     1  0 11:46 ?        00:00:00 ssh -x -n abe0477 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0568 -p 59358  --ncpus=1 -e -d
gbauer   23142     1  0 11:46 ?        00:00:00 ssh -x -n abe0476 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0566 -p 51182  --ncpus=1 -e -d
gbauer   23144     1  0 11:46 ?        00:00:00 ssh -x -n abe0474 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0566 -p 51182  --ncpus=1 -e -d
gbauer   23147     1  0 11:46 ?        00:00:00 ssh -x -n abe0471 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0565 -p 34331  --ncpus=1 -e -d
gbauer   23148     1  0 11:46 ?        00:00:00 ssh -x -n abe0470 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0565 -p 34331  --ncpus=1 -e -d
gbauer   23152     1  0 11:46 ?        00:00:00 ssh -x -n abe0468 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0536 -p 47656  --ncpus=1 -e -d
gbauer   23229     1  0 11:46 ?        00:00:00 ssh -x -n abe0447 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0508 -p 50870  --ncpus=1 -e -d
gbauer   23231     1  0 11:46 ?        00:00:00 ssh -x -n abe0446 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0508 -p 50870  --ncpus=1 -e -d
gbauer   23232     1  0 11:46 ?        00:00:00 ssh -x -n abe0445 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0508 -p 50870  --ncpus=1 -e -d
gbauer   23233     1  0 11:46 ?        00:00:00 ssh -x -n abe0444 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0533 -p 35859  --ncpus=1 -e -d
gbauer   23234     1  0 11:46 ?        00:00:00 ssh -x -n abe0443 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0533 -p 35859  --ncpus=1 -e -d
gbauer   23235     1  0 11:46 ?        00:00:00 ssh -x -n abe0442 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0533 -p 35859  --ncpus=1 -e -d
gbauer   23367     1  0 11:46 ?        00:00:00 ssh -x -n abe0439 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0526 -p 42820  --ncpus=1 -e -d
gbauer   23368     1  0 11:46 ?        00:00:00 ssh -x -n abe0438 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0526 -p 42820  --ncpus=1 -e -d
gbauer   23369     1  0 11:46 ?        00:00:00 ssh -x -n abe0437 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0526 -p 42820  --ncpus=1 -e -d
gbauer   23370     1  0 11:46 ?        00:00:00 ssh -x -n abe0436 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0521 -p 59167  --ncpus=1 -e -d
gbauer   23371     1  0 11:46 ?        00:00:00 ssh -x -n abe0435 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0521 -p 59167  --ncpus=1 -e -d
gbauer   23372     1  0 11:46 ?        00:00:00 ssh -x -n abe0434 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0521 -p 59167  --ncpus=1 -e -d
gbauer   23373     1  0 11:46 ?        00:00:00 ssh -x -n abe0432 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0521 -p 59167  --ncpus=1 -e -d
gbauer   23382     1  0 11:46 ?        00:00:00 ssh -x -n abe0431 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0518 -p 35191  --ncpus=1 -e -d
gbauer   23383     1  0 11:46 ?        00:00:00 ssh -x -n abe0430 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py  -h abe0518 -p 35191  --ncpus=1 -e -d
gbauer   23401 22885  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpiexec -machinefile /var/spool/torque/aux//18699.ab
gbauer   23402 23017  0 11:46 ?        00:00:01 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23403 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23404 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23405 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23406 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23407 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23408 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23409 23017  0 11:46 ?        00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer   23554 23552  0 12:12 ?        00:00:00 sshd: gbauer at pts/0
gbauer   23555 23554  0 12:12 pts/0    00:00:00 -tcsh
gbauer   23675 23555  0 12:13 pts/0    00:00:00 ps -fugbauer
[gbauer at abe0573 ~]$ gdb -p 23409
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 23409
Reading symbols from /usr/bin/python2.3...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /usr/lib64/libpython2.3.so.1.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpython2.3.so.1.0
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182904536224 (LWP 23409)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/tls/libm.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/tls/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/python2.3/lib-dynload/timemodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/timemodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/_socketmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/_socketmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/_ssl.so...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/_ssl.so
Reading symbols from /lib64/libssl.so.4...(no debugging symbols found)...done.
Loaded symbols for /lib64/libssl.so.4
Reading symbols from /lib64/libcrypto.so.4...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypto.so.4
Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgssapi_krb5.so.2
Reading symbols from /usr/lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /usr/lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libk5crypto.so.3
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /usr/lib64/python2.3/lib-dynload/selectmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/selectmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/strop.so...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/strop.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/cPickle.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/cPickle.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/cStringIO.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/cStringIO.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/md5module.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/md5module.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/mathmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/mathmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/_random.so...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/_random.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/pwdmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/pwdmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/grpmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/grpmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/syslogmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/syslogmodule.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2

0x0000002a95de68f5 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) where
#0  0x0000002a95de68f5 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x0000002a971b1b4e in ?? () from /usr/lib64/python2.3/lib-dynload/selectmodule.so
#2  0x0000002a956f563f in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#3  0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#4  0x0000002a956f598a in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#5  0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#6  0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#7  0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#8  0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#9  0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#10 0x0000002a956b3d9d in PyFunction_SetClosure () from /usr/lib64/libpython2.3.so.1.0
#11 0x0000002a956a1390 in PyObject_Call () from /usr/lib64/libpython2.3.so.1.0
#12 0x0000002a956f499f in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#13 0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#14 0x0000002a956f598a in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#15 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#16 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#17 0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#18 0x0000002a956f7412 in PyEval_EvalCode () from /usr/lib64/libpython2.3.so.1.0
#19 0x0000002a95710039 in PyErr_Display () from /usr/lib64/libpython2.3.so.1.0
#20 0x0000002a9571101d in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.3.so.1.0
#21 0x0000002a95716718 in Py_Main () from /usr/lib64/libpython2.3.so.1.0
#22 0x0000002a95d433fb in __libc_start_main () from /lib64/tls/libc.so.6
#23 0x00000000004006ba in _start ()
#24 0x0000007fbfffe528 in ?? ()
#25 0x000000000000001c in ?? ()
#26 0x0000000000000005 in ?? ()
#27 0x0000007fbfffe920 in ?? ()
#28 0x0000007fbfffe92a in ?? ()
#29 0x0000007fbfffe967 in ?? ()
#30 0x0000007fbfffe971 in ?? ()
#31 0x0000007fbfffe974 in ?? ()
#32 0x0000000000000000 in ?? ()
(gdb)


More information about the mvapich-discuss mailing list