[mvapich-discuss] MPI process not terminating

Sun Oct 26 09:41:26 EDT 2014

On Sat, Oct 25, 2014 at 11:46:24PM +0800, Kin Fai Tse wrote:
> Dear all,
> 
> In our cluster, we are facing random mpi programs not terminating when
> issuing a termination signal.
> 
> The version of mvapich we tried is 1.9 and 2.1a, which both gives the same
> issue.
> 
> After a user press ctrl+C during a mpi program run or using qdel from
> Torque PBS to terminate a running mpi program, there is a chance for the
> terminating process to not complete forever.
> 
> Here is one instance that is produced by running VASP package
> 
> forrtl: error (78): process killed (SIGTERM)
> 
> Image              PC                Routine            Line        Source
> 
> vasp               00000000005F3D2B  Unknown               Unknown  Unknown
> 
> vasp               0000000000A95BC4  Unknown               Unknown  Unknown
> 
> vasp               0000000000AB1612  Unknown               Unknown  Unknown
> 
> vasp               0000000000437505  Unknown               Unknown  Unknown
> 
> vasp               00000000004182CC  Unknown               Unknown  Unknown
> 
> libc.so.6          0000003DA2A1ECDD  Unknown               Unknown  Unknown
> 
> vasp               00000000004181C9  Unknown               Unknown  Unknown
> 
> forrtl: error (78): process killed (SIGTERM)
> 
> Image              PC                Routine            Line        Source
> 
> vasp               00000000005F3D51  Unknown               Unknown  Unknown
> 
> vasp               0000000000A95BC4  Unknown               Unknown  Unknown
> 
> vasp               0000000000AB1612  Unknown               Unknown  Unknown
> 
> vasp               0000000000437505  Unknown               Unknown  Unknown
> 
> vasp               00000000004182CC  Unknown               Unknown  Unknown
> 
> libc.so.6          0000003DA2A1ECDD  Unknown               Unknown  Unknown
> 
> vasp               00000000004181C9  Unknown               Unknown  Unknown
> 
> 
> (Had been waiting for an hour and the termination is not completed yet.)
> 
> ^C
> [mpiexec at z0-14] Sending Ctrl-C to processes as requested
> 
> [mpiexec at z0-14] Press Ctrl-C again to force abort
> 
> ^C
> Ctrl-C caught... cleaning up processes
> 
> [proxy:0:0 at z0-14] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:913):
> assert (!closed)
>  failed
> 
> [proxy:0:0 at z0-14] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback ret
> urned error status
> 
> [proxy:0:0 at z0-14] main (pm/pmiserv/pmip.c:206): demux engine error waiting
> for event
> 
> And the process is not responsive again.
> 
> I currently have no idea on the cause of the bad termination, can anyone
> point me to debugging this issue?

Hello Tse Kin Fai, is this problem is specific to an application or does it
happen for all applications?

We've seen in certain cases that the application/library can get into a
state where both mpirun_rsh and hydra are unable to kill the processes
cleanly and have to send signal 9 instead of signal 15 in order to
interrupt and kill.

-- 
Jonathan Perkins