[mvapich-discuss] MPI process not terminating

Kin Fai Tse kftse20031207 at gmail.com
Sat Oct 25 11:46:24 EDT 2014


Dear all,

In our cluster, we are facing random mpi programs not terminating when
issuing a termination signal.

The version of mvapich we tried is 1.9 and 2.1a, which both gives the same
issue.

After a user press ctrl+C during a mpi program run or using qdel from
Torque PBS to terminate a running mpi program, there is a chance for the
terminating process to not complete forever.

Here is one instance that is produced by running VASP package

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

vasp               00000000005F3D2B  Unknown               Unknown  Unknown

vasp               0000000000A95BC4  Unknown               Unknown  Unknown

vasp               0000000000AB1612  Unknown               Unknown  Unknown

vasp               0000000000437505  Unknown               Unknown  Unknown

vasp               00000000004182CC  Unknown               Unknown  Unknown

libc.so.6          0000003DA2A1ECDD  Unknown               Unknown  Unknown

vasp               00000000004181C9  Unknown               Unknown  Unknown

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

vasp               00000000005F3D51  Unknown               Unknown  Unknown

vasp               0000000000A95BC4  Unknown               Unknown  Unknown

vasp               0000000000AB1612  Unknown               Unknown  Unknown

vasp               0000000000437505  Unknown               Unknown  Unknown

vasp               00000000004182CC  Unknown               Unknown  Unknown

libc.so.6          0000003DA2A1ECDD  Unknown               Unknown  Unknown

vasp               00000000004181C9  Unknown               Unknown  Unknown


(Had been waiting for an hour and the termination is not completed yet.)

^C
[mpiexec at z0-14] Sending Ctrl-C to processes as requested

[mpiexec at z0-14] Press Ctrl-C again to force abort

^C
Ctrl-C caught... cleaning up processes

[proxy:0:0 at z0-14] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:913):
assert (!closed)
 failed

[proxy:0:0 at z0-14] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback ret
urned error status

[proxy:0:0 at z0-14] main (pm/pmiserv/pmip.c:206): demux engine error waiting
for event

And the process is not responsive again.

I currently have no idea on the cause of the bad termination, can anyone
point me to debugging this issue?

Best regards,
Tse Kin Fai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141025/c243e963/attachment.html>


More information about the mvapich-discuss mailing list