[mvapich-discuss] [kftse20031207 at gmail.com: Re: MPI process not terminating]

Wed Oct 29 07:28:10 EDT 2014

Thanks for the update.  Let us know if you run into any further issues.

----- Forwarded message from Kin Fai Tse <kftse20031207 at gmail.com> -----

Date: Wed, 29 Oct 2014 11:00:35 +0800
From: Kin Fai Tse <kftse20031207 at gmail.com>
To: Jonathan Perkins <perkinjo at cse.ohio-state.edu>
Subject: Re: MPI process not terminating

After ~200 terminations recorded so far, mpirun_rsh can always kill the
program.
We will thus switch to mpirun_rsh instead of mpirun used before.

Thanks for your advice
Kin Fai

2014-10-26 21:41 GMT+08:00 Jonathan Perkins <perkinjo at cse.ohio-state.edu>:

> On Sat, Oct 25, 2014 at 11:46:24PM +0800, Kin Fai Tse wrote:
> > Dear all,
> >
> > In our cluster, we are facing random mpi programs not terminating when
> > issuing a termination signal.
> >
> > The version of mvapich we tried is 1.9 and 2.1a, which both gives the
> same
> > issue.
> >
> > After a user press ctrl+C during a mpi program run or using qdel from
> > Torque PBS to terminate a running mpi program, there is a chance for the
> > terminating process to not complete forever.
> >
> > Here is one instance that is produced by running VASP package
> >
> > forrtl: error (78): process killed (SIGTERM)
> >
> > Image              PC                Routine            Line
> Source
> >
> > vasp               00000000005F3D2B  Unknown               Unknown
> Unknown
> >
> > vasp               0000000000A95BC4  Unknown               Unknown
> Unknown
> >
> > vasp               0000000000AB1612  Unknown               Unknown
> Unknown
> >
> > vasp               0000000000437505  Unknown               Unknown
> Unknown
> >
> > vasp               00000000004182CC  Unknown               Unknown
> Unknown
> >
> > libc.so.6          0000003DA2A1ECDD  Unknown               Unknown
> Unknown
> >
> > vasp               00000000004181C9  Unknown               Unknown
> Unknown
> >
> > forrtl: error (78): process killed (SIGTERM)
> >
> > Image              PC                Routine            Line
> Source
> >
> > vasp               00000000005F3D51  Unknown               Unknown
> Unknown
> >
> > vasp               0000000000A95BC4  Unknown               Unknown
> Unknown
> >
> > vasp               0000000000AB1612  Unknown               Unknown
> Unknown
> >
> > vasp               0000000000437505  Unknown               Unknown
> Unknown
> >
> > vasp               00000000004182CC  Unknown               Unknown
> Unknown
> >
> > libc.so.6          0000003DA2A1ECDD  Unknown               Unknown
> Unknown
> >
> > vasp               00000000004181C9  Unknown               Unknown
> Unknown
> >
> >
> > (Had been waiting for an hour and the termination is not completed yet.)
> >
> > ^C
> > [mpiexec at z0-14] Sending Ctrl-C to processes as requested
> >
> > [mpiexec at z0-14] Press Ctrl-C again to force abort
> >
> > ^C
> > Ctrl-C caught... cleaning up processes
> >
> > [proxy:0:0 at z0-14] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913):
> > assert (!closed)
> >  failed
> >
> > [proxy:0:0 at z0-14] HYDT_dmxu_poll_wait_for_event
> > (tools/demux/demux_poll.c:76): callback ret
> > urned error status
> >
> > [proxy:0:0 at z0-14] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting
> > for event
> >
> > And the process is not responsive again.
> >
> > I currently have no idea on the cause of the bad termination, can anyone
> > point me to debugging this issue?
>
> Hello Tse Kin Fai, is this problem is specific to an application or does it
> happen for all applications?
>
> We've seen in certain cases that the application/library can get into a
> state where both mpirun_rsh and hydra are unable to kill the processes
> cleanly and have to send signal 9 instead of signal 15 in order to
> interrupt and kill.
>
> --
> Jonathan Perkins
>

2014-10-28 2:10 GMT+08:00 Jonathan Perkins <perkinjo at cse.ohio-state.edu>:

> Hi, our launcher mpirun_rsh has this support where it sends SIGTERM
> before SIGKILL if it is detected after a few seconds that it did not
> terminate properly.
>
> Can you try using mpirun_rsh instead of mpiexec to see if you have the
> same problem?
>
> P.S. I've cc'd an internal developer list.
>
> On Mon, Oct 27, 2014 at 11:44:56AM +0800, Kin Fai Tse wrote:
> > We currently suspect it is universal for all program, but we cannot
> exclude
> > the possibility that multiple program used the same library.
> > This situation occur rather randomly, about 1 in 10-30 time a job being
> > killed will be running indefinitely. It would be difficult to identify
> such
> > library.
> >
> > I think of a temporary fix would be sending SIGKILL after ~30s of sending
> > SIGTERM if the program is still running, is there an official way in
> > MVAPICH for us to do this?
> >
> > On Sunday, October 26, 2014, Jonathan Perkins <
> perkinjo at cse.ohio-state.edu>
> > wrote:
> >
> > > On Sat, Oct 25, 2014 at 11:46:24PM +0800, Kin Fai Tse wrote:
> > > > Dear all,
> > > >
> > > > In our cluster, we are facing random mpi programs not terminating
> when
> > > > issuing a termination signal.
> > > >
> > > > The version of mvapich we tried is 1.9 and 2.1a, which both gives the
> > > same
> > > > issue.
> > > >
> > > > After a user press ctrl+C during a mpi program run or using qdel from
> > > > Torque PBS to terminate a running mpi program, there is a chance for
> the
> > > > terminating process to not complete forever.
> > > >
> > > > Here is one instance that is produced by running VASP package
> > > >
> > > > forrtl: error (78): process killed (SIGTERM)
> > > >
> > > > Image              PC                Routine            Line
> > > Source
> > > >
> > > > vasp               00000000005F3D2B  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               0000000000A95BC4  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               0000000000AB1612  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               0000000000437505  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               00000000004182CC  Unknown               Unknown
> > > Unknown
> > > >
> > > > libc.so.6          0000003DA2A1ECDD  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               00000000004181C9  Unknown               Unknown
> > > Unknown
> > > >
> > > > forrtl: error (78): process killed (SIGTERM)
> > > >
> > > > Image              PC                Routine            Line
> > > Source
> > > >
> > > > vasp               00000000005F3D51  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               0000000000A95BC4  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               0000000000AB1612  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               0000000000437505  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               00000000004182CC  Unknown               Unknown
> > > Unknown
> > > >
> > > > libc.so.6          0000003DA2A1ECDD  Unknown               Unknown
> > > Unknown
> > > >
> > > > vasp               00000000004181C9  Unknown               Unknown
> > > Unknown
> > > >
> > > >
> > > > (Had been waiting for an hour and the termination is not completed
> yet.)
> > > >
> > > > ^C
> > > > [mpiexec at z0-14] Sending Ctrl-C to processes as requested
> > > >
> > > > [mpiexec at z0-14] Press Ctrl-C again to force abort
> > > >
> > > > ^C
> > > > Ctrl-C caught... cleaning up processes
> > > >
> > > > [proxy:0:0 at z0-14] HYD_pmcd_pmip_control_cmd_cb
> > > (pm/pmiserv/pmip_cb.c:913):
> > > > assert (!closed)
> > > >  failed
> > > >
> > > > [proxy:0:0 at z0-14] HYDT_dmxu_poll_wait_for_event
> > > > (tools/demux/demux_poll.c:76): callback ret
> > > > urned error status
> > > >
> > > > [proxy:0:0 at z0-14] main (pm/pmiserv/pmip.c:206): demux engine error
> > > waiting
> > > > for event
> > > >
> > > > And the process is not responsive again.
> > > >
> > > > I currently have no idea on the cause of the bad termination, can
> anyone
> > > > point me to debugging this issue?
> > >
> > > Hello Tse Kin Fai, is this problem is specific to an application or
> does it
> > > happen for all applications?
> > >
> > > We've seen in certain cases that the application/library can get into a
> > > state where both mpirun_rsh and hydra are unable to kill the processes
> > > cleanly and have to send signal 9 instead of signal 15 in order to
> > > interrupt and kill.
> > >
> > > --
> > > Jonathan Perkins
> > >
>
> --
> Jonathan Perkins
>

----- End forwarded message -----

-- 
Jonathan Perkins