[mvapich-discuss] qdel doesn't work with grid engine & mvapich2

Karl Schulz karl at tacc.utexas.edu
Fri Oct 5 07:23:44 EDT 2012


Hello Xing,

You can configure an epilog script at the queue level.  For example:

$ qconf -sq normal | grep epilog
epilog                /share/sge6.2/default/pe_scripts/epilog.sh

This is just a shell script. Note that it will run at the end of each job on the master compute host associated with the job.  Assuming you are using an ssh-based job-launch mechanism, you can use the $PE_HOSTFILE environment variable to identify all the hosts associated with a job, ssh to them, and terminate any remaining user processes.

The other suggestion I mentioned to avoid scheduling to hosts which have zombie user processes is to add a load-threshold to your queue.  For example:

$ qconf -sq normal | grep load
load_thresholds       load_short=0.1

In this case, the normal queue will not schedule new jobs to hosts which have a short load > 0.1.  See "man queue_conf" for more details, but this effectively ensures you are scheduling only to quiescent hosts.  Of course, if you don't also include a prolog mechanism which cleans up any left-over user processes, then adding a load_threshold like this can mean that you reduce your pool of eligible hosts to run on if there are left over processes generating load.  But, you can also easily catch this condition by periodically scanning for hosts which have no SGE slots taken, but continue to have load.

As an example, assuming that you have compute hosts with 16 cores, a "qhost -l slots=16" will show all hosts which do not have any jobs actively scheduled (and their load).    If the load remains high on these hosts, then they likely have zombie user processes and need to be cleaned up prior to scheduling the next job.  A quick one-liner to look for hosts with a load greater than 1.0 for these might be:

$ qhost -l slots=16 | awk '{ if ($4 > 1.0) print $4" "$1}' | sort -n

Hope that helps,

Karl

On Oct 3, 2012, at 6:41 PM, Xing Wang wrote:

> Hi Karl,
> 
> Thanks for this helpful advice!
> 
> I also believe we're running mavapich2 with loose integration under SGE, since I just followed the standard 3 steps: "configure+make+make install" during installation. I also go for the first option since using "qdel" is inevitable in our group and we hope "qdel" cleans up all the process on corresponding nodes.
> 
> So could you tell me more details about how to implement this mechanism in the epilogue process? Where and what should I add to sge? If it's convenient, could you give me some script examples that work in your situation? We just need to log in the compute node where the "qdel" jobs is running and clean up all the processes belong to this jobs.
> 
> Here is my current parallel environment settings. 
> 
> [root at turnbull ~]# qconf -sp mvapich2
> pe_name mvapich2
> slots 9999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
> stop_proc_args NONE
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary TRUE
> 
> I'm really appreciating your kind help!
> Sincerely,
> Xing
> 
> On 12/09/28, Karl Schulz wrote:
> > Hello Xing,
> > 
> > It looks like you are most likely running with loose integration under SGE which means that a qdel will only terminate the master process on the first compute node associated with your job. 
> > 
> > A common practice in these situations is to implement a mechanism in the epilogue process for SGE which is a utility of your own creation which you can configure to run at the end of each SGE job and in this case, would ensure that all remaining user-processes were terminated, cleanup /tmp, etc.
> > 
> > Another option to avoid scheduling new jobs on nodes with left-over processes is to add a load threshold to the queue in question. With this, you can prevent the scheduling of new jobs on hosts which have a runtime load over a value you prescribe. 
> > 
> > Hope that helps,
> > 
> > Karl
> > 
> > 
> > On Sep 28, 2012, at 4:07 PM, Xing Wang wrote:
> > 
> > > Hi all,
> > > 
> > > Thanks for reading the email. 
> > > I'm running grid engine 6.2u5 with mvapich2_1.6.1-p1. We meet with a problem about "qdel" and sincerely wish for your kind help!
> > > 
> > > The "qdel" command could only delete the jobs ID in the queue, but couldn't clean up the process in the nodes, which means the "deleted" jobs would keep on running in the compute nodes and finally slow down the calculation speed. However, if the jobs finish by themselves without "qdel", there is no such problems.
> > > 
> > > I noticed there might be a problem of "tight" and "loose" integration of mvapich2. Could it be the reason here? Your comment/advice/help would be highly appreciated.
> > > (We tried Mvapich2_1.8. However this version has some problems to assign jobs to multiple nodes. So we have to use mvapich2_1.6 here.)
> > > 
> > > Here are some technical details:
> > > I. Hardware: 
> > > 1. Xeon(R) CPU E5-2620 0 @ 2.00GHz (Module#: 45) 
> > > 2. IB adapter: Mellanox Technologies MT 26428.
> > > 
> > > II. Software
> > > 1. OS: Rocks 6.0.2 (CentOS6.2)
> > > 2. Compiler: intel Fortran & C++ Composer XE 2011
> > > 3. MPI: mvapich2_1.6.1-p1
> > > 3. Queue: grid engine 6.2u5
> > > 
> > > III. Scripts:
> > > 
> > > #!/bin/bash
> > > #$ -N your_jobname
> > > #$ -q <queue_name>
> > > #$ -pe mvapich2 <process_num>
> > > #$ -l h_rt=48:00:00
> > > #$ -cwd
> > > # combine SGE standard output and error files
> > > #$ -o $JOB_NAME.o$JOB_ID
> > > #$ -e $JOB_NAME.e$JOB_ID
> > > #$ -V
> > > echo "Got $NSLOTS processors."
> > > MPI_HOME=/share/apps/mvapich2/1.6.1-p1/bin
> > > $MPI_HOME/mpirun_rsh -hostfile $TMPDIR/machines -n $NSLOTS <command name> <command args>
> > > 
> > > Thanks for the help!
> > > --
> > > Sincerely, 
> > > WANG, Xing
> > > 
> > > Graduate Student 
> > > Department of Engineering Physics & 
> > > Nuclear Engineering, UW-Madison
> > > 1509 University Ave.
> > > Madison, WI, 53706 
> > > 
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> --
> Sincerely, 
> WANG, Xing
> 
> Graduate Student 
> Department of Engineering Physics & 
> Nuclear Engineering, UW-Madison
> Room 137, 1509 University Ave.
> Madison, WI, 53706 
> (Cell)608-320-7086




More information about the mvapich-discuss mailing list