[mvapich-discuss] MPI process not terminating
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Sun Oct 26 09:41:26 EDT 2014
On Sat, Oct 25, 2014 at 11:46:24PM +0800, Kin Fai Tse wrote:
> Dear all,
>
> In our cluster, we are facing random mpi programs not terminating when
> issuing a termination signal.
>
> The version of mvapich we tried is 1.9 and 2.1a, which both gives the same
> issue.
>
> After a user press ctrl+C during a mpi program run or using qdel from
> Torque PBS to terminate a running mpi program, there is a chance for the
> terminating process to not complete forever.
>
> Here is one instance that is produced by running VASP package
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> vasp 00000000005F3D2B Unknown Unknown Unknown
>
> vasp 0000000000A95BC4 Unknown Unknown Unknown
>
> vasp 0000000000AB1612 Unknown Unknown Unknown
>
> vasp 0000000000437505 Unknown Unknown Unknown
>
> vasp 00000000004182CC Unknown Unknown Unknown
>
> libc.so.6 0000003DA2A1ECDD Unknown Unknown Unknown
>
> vasp 00000000004181C9 Unknown Unknown Unknown
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> vasp 00000000005F3D51 Unknown Unknown Unknown
>
> vasp 0000000000A95BC4 Unknown Unknown Unknown
>
> vasp 0000000000AB1612 Unknown Unknown Unknown
>
> vasp 0000000000437505 Unknown Unknown Unknown
>
> vasp 00000000004182CC Unknown Unknown Unknown
>
> libc.so.6 0000003DA2A1ECDD Unknown Unknown Unknown
>
> vasp 00000000004181C9 Unknown Unknown Unknown
>
>
> (Had been waiting for an hour and the termination is not completed yet.)
>
> ^C
> [mpiexec at z0-14] Sending Ctrl-C to processes as requested
>
> [mpiexec at z0-14] Press Ctrl-C again to force abort
>
> ^C
> Ctrl-C caught... cleaning up processes
>
> [proxy:0:0 at z0-14] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:913):
> assert (!closed)
> failed
>
> [proxy:0:0 at z0-14] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback ret
> urned error status
>
> [proxy:0:0 at z0-14] main (pm/pmiserv/pmip.c:206): demux engine error waiting
> for event
>
> And the process is not responsive again.
>
> I currently have no idea on the cause of the bad termination, can anyone
> point me to debugging this issue?
Hello Tse Kin Fai, is this problem is specific to an application or does it
happen for all applications?
We've seen in certain cases that the application/library can get into a
state where both mpirun_rsh and hydra are unable to kill the processes
cleanly and have to send signal 9 instead of signal 15 in order to
interrupt and kill.
--
Jonathan Perkins
More information about the mvapich-discuss
mailing list