[mvapich-discuss] MPI process not terminating
Kin Fai Tse
kftse20031207 at gmail.com
Sat Oct 25 11:46:24 EDT 2014
Dear all,
In our cluster, we are facing random mpi programs not terminating when
issuing a termination signal.
The version of mvapich we tried is 1.9 and 2.1a, which both gives the same
issue.
After a user press ctrl+C during a mpi program run or using qdel from
Torque PBS to terminate a running mpi program, there is a chance for the
terminating process to not complete forever.
Here is one instance that is produced by running VASP package
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp 00000000005F3D2B Unknown Unknown Unknown
vasp 0000000000A95BC4 Unknown Unknown Unknown
vasp 0000000000AB1612 Unknown Unknown Unknown
vasp 0000000000437505 Unknown Unknown Unknown
vasp 00000000004182CC Unknown Unknown Unknown
libc.so.6 0000003DA2A1ECDD Unknown Unknown Unknown
vasp 00000000004181C9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp 00000000005F3D51 Unknown Unknown Unknown
vasp 0000000000A95BC4 Unknown Unknown Unknown
vasp 0000000000AB1612 Unknown Unknown Unknown
vasp 0000000000437505 Unknown Unknown Unknown
vasp 00000000004182CC Unknown Unknown Unknown
libc.so.6 0000003DA2A1ECDD Unknown Unknown Unknown
vasp 00000000004181C9 Unknown Unknown Unknown
(Had been waiting for an hour and the termination is not completed yet.)
^C
[mpiexec at z0-14] Sending Ctrl-C to processes as requested
[mpiexec at z0-14] Press Ctrl-C again to force abort
^C
Ctrl-C caught... cleaning up processes
[proxy:0:0 at z0-14] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:913):
assert (!closed)
failed
[proxy:0:0 at z0-14] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback ret
urned error status
[proxy:0:0 at z0-14] main (pm/pmiserv/pmip.c:206): demux engine error waiting
for event
And the process is not responsive again.
I currently have no idea on the cause of the bad termination, can anyone
point me to debugging this issue?
Best regards,
Tse Kin Fai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141025/c243e963/attachment.html>
More information about the mvapich-discuss
mailing list