[mvapich-discuss] Intermittent hanging after application exits at
scale with MVAPICH2
Gregory Bauer
gbauer at ncsa.uiuc.edu
Thu Jul 19 16:21:37 EDT 2007
We are using mvapich2-0.9.8p2 (with the patch applied that addresses a
start-up scalability issue) built via the make.mvapich2.ofa
(--with-device=osu_ch3:mrail --with-rdma=gen2) script and with ofed-1.2
and python-2.3.4.
I recently ran a series of 1024 tasks (128 nodes, 8 cores per node) jobs
(via PBS). Out of 8 jobs, two jobs were left in a state where the
application had exited but the mpd's for each task still remained (the
launch process was still in mpiexec).
I have attached output from ps and from gdb for the backtrace.
The application output is such that it thinks it exited correctly. It is
just that mpiexec doesn't return and PBS eventually kills the job after
it exceeds the job wallclock time.
Any ideas?
-Greg
-------------- next part --------------
[gbauer at abe0573 ~]$ !ps
ps -fugbauer
UID PID PPID C STIME TTY TIME CMD
gbauer 22744 5389 0 11:46 ? 00:00:00 -tcsh
gbauer 22772 22744 0 11:46 ? 00:00:00 pbs_demux
gbauer 22885 22744 0 11:46 ? 00:00:00 /bin/csh /var/spool/torque/mom_priv/jobs/18699.abem5.SC
gbauer 23017 1 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23018 1 0 11:46 ? 00:00:00 ssh -x -n abe0572 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0573 -p 41562 --ncpus=1 -e -d
gbauer 23019 1 0 11:46 ? 00:00:00 ssh -x -n abe0571 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0573 -p 41562 --ncpus=1 -e -d
gbauer 23020 1 0 11:46 ? 00:00:00 ssh -x -n abe0570 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0573 -p 41562 --ncpus=1 -e -d
gbauer 23021 1 0 11:46 ? 00:00:00 ssh -x -n abe0569 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0573 -p 41562 --ncpus=1 -e -d
gbauer 23031 1 0 11:46 ? 00:00:00 ssh -x -n abe0567 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0572 -p 54313 --ncpus=1 -e -d
gbauer 23078 1 0 11:46 ? 00:00:00 ssh -x -n abe0536 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0564 -p 54926 --ncpus=1 -e -d
gbauer 23085 1 0 11:46 ? 00:00:00 ssh -x -n abe0529 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0561 -p 48609 --ncpus=1 -e -d
gbauer 23090 1 0 11:46 ? 00:00:00 ssh -x -n abe0524 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0537 -p 60740 --ncpus=1 -e -d
gbauer 23091 1 0 11:46 ? 00:00:00 ssh -x -n abe0523 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0537 -p 60740 --ncpus=1 -e -d
gbauer 23099 1 0 11:46 ? 00:00:00 ssh -x -n abe0515 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0539 -p 38092 --ncpus=1 -e -d
gbauer 23102 1 0 11:46 ? 00:00:00 ssh -x -n abe0512 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0540 -p 38364 --ncpus=1 -e -d
gbauer 23106 1 0 11:46 ? 00:00:00 ssh -x -n abe0510 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0540 -p 38364 --ncpus=1 -e -d
gbauer 23109 1 0 11:46 ? 00:00:00 ssh -x -n abe0507 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0547 -p 49529 --ncpus=1 -e -d
gbauer 23127 1 0 11:46 ? 00:00:00 ssh -x -n abe0489 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0563 -p 52970 --ncpus=1 -e -d
gbauer 23129 1 0 11:46 ? 00:00:00 ssh -x -n abe0487 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0532 -p 60282 --ncpus=1 -e -d
gbauer 23132 1 0 11:46 ? 00:00:00 ssh -x -n abe0484 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0549 -p 59222 --ncpus=1 -e -d
gbauer 23135 1 0 11:46 ? 00:00:00 ssh -x -n abe0482 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0549 -p 59222 --ncpus=1 -e -d
gbauer 23136 1 0 11:46 ? 00:00:00 ssh -x -n abe0481 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0549 -p 59222 --ncpus=1 -e -d
gbauer 23141 1 0 11:46 ? 00:00:00 ssh -x -n abe0477 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0568 -p 59358 --ncpus=1 -e -d
gbauer 23142 1 0 11:46 ? 00:00:00 ssh -x -n abe0476 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0566 -p 51182 --ncpus=1 -e -d
gbauer 23144 1 0 11:46 ? 00:00:00 ssh -x -n abe0474 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0566 -p 51182 --ncpus=1 -e -d
gbauer 23147 1 0 11:46 ? 00:00:00 ssh -x -n abe0471 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0565 -p 34331 --ncpus=1 -e -d
gbauer 23148 1 0 11:46 ? 00:00:00 ssh -x -n abe0470 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0565 -p 34331 --ncpus=1 -e -d
gbauer 23152 1 0 11:46 ? 00:00:00 ssh -x -n abe0468 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0536 -p 47656 --ncpus=1 -e -d
gbauer 23229 1 0 11:46 ? 00:00:00 ssh -x -n abe0447 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0508 -p 50870 --ncpus=1 -e -d
gbauer 23231 1 0 11:46 ? 00:00:00 ssh -x -n abe0446 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0508 -p 50870 --ncpus=1 -e -d
gbauer 23232 1 0 11:46 ? 00:00:00 ssh -x -n abe0445 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0508 -p 50870 --ncpus=1 -e -d
gbauer 23233 1 0 11:46 ? 00:00:00 ssh -x -n abe0444 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0533 -p 35859 --ncpus=1 -e -d
gbauer 23234 1 0 11:46 ? 00:00:00 ssh -x -n abe0443 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0533 -p 35859 --ncpus=1 -e -d
gbauer 23235 1 0 11:46 ? 00:00:00 ssh -x -n abe0442 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0533 -p 35859 --ncpus=1 -e -d
gbauer 23367 1 0 11:46 ? 00:00:00 ssh -x -n abe0439 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0526 -p 42820 --ncpus=1 -e -d
gbauer 23368 1 0 11:46 ? 00:00:00 ssh -x -n abe0438 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0526 -p 42820 --ncpus=1 -e -d
gbauer 23369 1 0 11:46 ? 00:00:00 ssh -x -n abe0437 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0526 -p 42820 --ncpus=1 -e -d
gbauer 23370 1 0 11:46 ? 00:00:00 ssh -x -n abe0436 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0521 -p 59167 --ncpus=1 -e -d
gbauer 23371 1 0 11:46 ? 00:00:00 ssh -x -n abe0435 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0521 -p 59167 --ncpus=1 -e -d
gbauer 23372 1 0 11:46 ? 00:00:00 ssh -x -n abe0434 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0521 -p 59167 --ncpus=1 -e -d
gbauer 23373 1 0 11:46 ? 00:00:00 ssh -x -n abe0432 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0521 -p 59167 --ncpus=1 -e -d
gbauer 23382 1 0 11:46 ? 00:00:00 ssh -x -n abe0431 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0518 -p 35191 --ncpus=1 -e -d
gbauer 23383 1 0 11:46 ? 00:00:00 ssh -x -n abe0430 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py -h abe0518 -p 35191 --ncpus=1 -e -d
gbauer 23401 22885 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpiexec -machinefile /var/spool/torque/aux//18699.ab
gbauer 23402 23017 0 11:46 ? 00:00:01 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23403 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23404 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23405 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23406 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23407 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23408 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23409 23017 0 11:46 ? 00:00:00 python2.3 /usr/local/mvapich2-0.9.8p2patched-intel-ofed-1.2/bin/mpd.py --ncpus=1 -e -d
gbauer 23554 23552 0 12:12 ? 00:00:00 sshd: gbauer at pts/0
gbauer 23555 23554 0 12:12 pts/0 00:00:00 -tcsh
gbauer 23675 23555 0 12:13 pts/0 00:00:00 ps -fugbauer
[gbauer at abe0573 ~]$ gdb -p 23409
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 23409
Reading symbols from /usr/bin/python2.3...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /usr/lib64/libpython2.3.so.1.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpython2.3.so.1.0
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182904536224 (LWP 23409)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/tls/libm.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/tls/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/python2.3/lib-dynload/timemodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/timemodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/_socketmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/_socketmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/_ssl.so...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/_ssl.so
Reading symbols from /lib64/libssl.so.4...(no debugging symbols found)...done.
Loaded symbols for /lib64/libssl.so.4
Reading symbols from /lib64/libcrypto.so.4...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypto.so.4
Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgssapi_krb5.so.2
Reading symbols from /usr/lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /usr/lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libk5crypto.so.3
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /usr/lib64/python2.3/lib-dynload/selectmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/selectmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/strop.so...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/strop.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/cPickle.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/cPickle.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/cStringIO.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/cStringIO.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/md5module.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/md5module.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/mathmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/mathmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/_random.so...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/_random.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/pwdmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/pwdmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/grpmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/grpmodule.so
Reading symbols from /usr/lib64/python2.3/lib-dynload/syslogmodule.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/python2.3/lib-dynload/syslogmodule.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x0000002a95de68f5 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) where
#0 0x0000002a95de68f5 in __select_nocancel () from /lib64/tls/libc.so.6
#1 0x0000002a971b1b4e in ?? () from /usr/lib64/python2.3/lib-dynload/selectmodule.so
#2 0x0000002a956f563f in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#3 0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#4 0x0000002a956f598a in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#5 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#6 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#7 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#8 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#9 0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#10 0x0000002a956b3d9d in PyFunction_SetClosure () from /usr/lib64/libpython2.3.so.1.0
#11 0x0000002a956a1390 in PyObject_Call () from /usr/lib64/libpython2.3.so.1.0
#12 0x0000002a956f499f in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#13 0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#14 0x0000002a956f598a in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#15 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#16 0x0000002a956f67de in _PyEval_SliceIndex () from /usr/lib64/libpython2.3.so.1.0
#17 0x0000002a956f71ae in PyEval_EvalCodeEx () from /usr/lib64/libpython2.3.so.1.0
#18 0x0000002a956f7412 in PyEval_EvalCode () from /usr/lib64/libpython2.3.so.1.0
#19 0x0000002a95710039 in PyErr_Display () from /usr/lib64/libpython2.3.so.1.0
#20 0x0000002a9571101d in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.3.so.1.0
#21 0x0000002a95716718 in Py_Main () from /usr/lib64/libpython2.3.so.1.0
#22 0x0000002a95d433fb in __libc_start_main () from /lib64/tls/libc.so.6
#23 0x00000000004006ba in _start ()
#24 0x0000007fbfffe528 in ?? ()
#25 0x000000000000001c in ?? ()
#26 0x0000000000000005 in ?? ()
#27 0x0000007fbfffe920 in ?? ()
#28 0x0000007fbfffe92a in ?? ()
#29 0x0000007fbfffe967 in ?? ()
#30 0x0000007fbfffe971 in ?? ()
#31 0x0000007fbfffe974 in ?? ()
#32 0x0000000000000000 in ?? ()
(gdb)
More information about the mvapich-discuss
mailing list