[mvapich-discuss] qdel doesn't work with grid engine & mvapich2

Xing Wang xwang348 at wisc.edu
Wed Oct 3 19:41:25 EDT 2012


Hi Karl,

Thanks for this helpful advice!


I also believe we're running mavapich2 with loose integration under SGE, since I just followed the standard 3 steps: "configure+make+make install" during installation. I also go for the first option since using "qdel" is inevitable in our group and we hope "qdel" cleans up all the process on corresponding nodes. 


So could you tell me more details about how to implement this mechanism in the epilogue process? Where and what should I add to sge? If it's convenient, could you give me some script examples that work in your situation? We just need to log in the compute node where the "qdel" jobs is running and clean up all the processes belong to this jobs.


Here is my current parallel environment settings. 


[root at turnbull ~]# qconf -sp mvapich2
pe_name mvapich2
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
stop_proc_args NONE
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE 


I'm really appreciating your kind help!
Sincerely,
Xing

On 12/09/28, Karl Schulz  wrote:
> Hello Xing,
> 
> It looks like you are most likely running with loose integration under SGE which means that a qdel will only terminate the master process on the first compute node associated with your job. 
> 
> A common practice in these situations is to implement a mechanism in the epilogue process for SGE which is a utility of your own creation which you can configure to run at the end of each SGE job and in this case, would ensure that all remaining user-processes were terminated, cleanup /tmp, etc.
> 
> Another option to avoid scheduling new jobs on nodes with left-over processes is to add a load threshold to the queue in question. With this, you can prevent the scheduling of new jobs on hosts which have a runtime load over a value you prescribe. 
> 
> Hope that helps,
> 
> Karl
> 
> 
> On Sep 28, 2012, at 4:07 PM, Xing Wang wrote:
> 
> > Hi all,
> > 
> > Thanks for reading the email. 
> > I'm running grid engine 6.2u5 with mvapich2_1.6.1-p1. We meet with a problem about "qdel" and sincerely wish for your kind help!
> > 
> > The "qdel" command could only delete the jobs ID in the queue, but couldn't clean up the process in the nodes, which means the "deleted" jobs would keep on running in the compute nodes and finally slow down the calculation speed. However, if the jobs finish by themselves without "qdel", there is no such problems.
> > 
> > I noticed there might be a problem of "tight" and "loose" integration of mvapich2. Could it be the reason here? Your comment/advice/help would be highly appreciated.
> > (We tried Mvapich2_1.8. However this version has some problems to assign jobs to multiple nodes. So we have to use mvapich2_1.6 here.)
> > 
> > Here are some technical details:
> > I. Hardware: 
> > 1. Xeon(R) CPU E5-2620 0 @ 2.00GHz (Module#: 45) 
> > 2. IB adapter: Mellanox Technologies MT 26428.
> > 
> > II. Software
> > 1. OS: Rocks 6.0.2 (CentOS6.2)
> > 2. Compiler: intel Fortran & C++ Composer XE 2011
> > 3. MPI: mvapich2_1.6.1-p1
> > 3. Queue: grid engine 6.2u5
> > 
> > III. Scripts:
> > 
> > #!/bin/bash
> > #$ -N your_jobname
> > #$ -q <queue_name>
> > #$ -pe mvapich2 <process_num>
> > #$ -l h_rt=48:00:00
> > #$ -cwd
> > # combine SGE standard output and error files
> > #$ -o $JOB_NAME.o$JOB_ID
> > #$ -e $JOB_NAME.e$JOB_ID
> > #$ -V
> > echo "Got $NSLOTS processors."
> > MPI_HOME=/share/apps/mvapich2/1.6.1-p1/bin
> > $MPI_HOME/mpirun_rsh -hostfile $TMPDIR/machines -n $NSLOTS <command name> <command args>
> > 
> > Thanks for the help!
> > --
> > Sincerely, 
> > WANG, Xing
> > 
> > Graduate Student 
> > Department of Engineering Physics & 
> > Nuclear Engineering, UW-Madison
> > 1509 University Ave.
> > Madison, WI, 53706 
> > 
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--
Sincerely, 
WANG, Xing

Graduate Student 
Department of Engineering Physics & 
Nuclear Engineering, UW-Madison
Room 137, 1509 University Ave.
Madison, WI, 53706 
(Cell)608-320-7086
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20121003/c08144b2/attachment.html


More information about the mvapich-discuss mailing list