[mvapich-discuss] How to run mvapich2.0 with PBS

Terrence.LIAO at total.com Terrence.LIAO at total.com
Thu Mar 13 07:38:37 EDT 2008


Hi, Joshua and mvapich-discuss readers,
 
My problem is solved.  It was come from the lockable memory on the pbs head node is set back to 32K by pbs_mom with mpdboot in side the pbs submission script even though the  limits.conf file has explicitly set the lockable memory to be 65MB.  ulimit -l output on the submitted job shows the head node is 32K and reset of the nodes are 65MB.  We now set the lockable memory in the pbs_mom config file to unlimited, this solve the problem.

Thank you very much.

-- Terrence



-----Terrence LIAO/HOU/US/EP/Corp wrote: -----


To: Joshua Bernstein <jbernstein at penguincomputing.com>@GROUP
From: Terrence LIAO/HOU/US/EP/Corp
Date: 03/11/2008 04:24PM
cc: Terrence.LIAO at total.com, mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] How to run mvapich2.0 with PBS


Hi, Joshua, 
I have already tried that.  The same problem though. 

Thanks. 

-- Terrence 
-------------------------------------------------------- 
Terrence Liao, Ph.D. 
Research Computer Scientist 
TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 
1201 Louisiana, Suite 1800, Houston, TX 77002  
Tel: 713.647.3498  Fax: 713.647.3638 
Email:  terrence.liao at total.com 



-----Joshua Bernstein <jbernstein at penguincomputing.com> wrote: ----- 


To: Terrence.LIAO at total.com 
From: Joshua Bernstein <jbernstein at penguincomputing.com> 
Date: 03/11/2008 02:15PM 
cc: mvapich-discuss at cse.ohio-state.edu 
Subject: Re: [mvapich-discuss] How to run mvapich2.0 with PBS 

Hi Terrence, 

     You'll want to be using the mpiexec from OSU, rather then then mpiexec 
that comes with MVAPICH. 

http://www.osc.edu/~pw/mpiexec/index.php 

This version of mpiexec understands how to startup processes using the 
"tm", or Task Management interface of PBS. It will honor the host list 
from $PBS_NODEFILE, and will correctly allow for PBS to manage and 
monitor the child MPI processes. 

-Joshua Bernstein 
Software Engineer 
Penguin Computing 

Terrence.LIAO at total.com wrote: 
> Dear MVAPICH, 
>   
> I am trying to use MOAB/TORQUE on mvapich2-1.0.1, but has problem.   
> My pbs script run fine when the executable is NOT using mpi, but get 
> this kind of error on the MPI executable. 
> 
> rank 3 in job 1 nod284_55165 caused collective abort of all ranks 
> exit status of rank 3: return code 1 
> 
> Also,  There is No problem to run the MPI executable interactively.  And 
> no problem to run the job under PBS is I did mpdboot outside the PBS script. 
> 
> Below is the my output log from this qsub: 
> 
> [t02871 at master1 tmp]$ 
> Thu Feb 28 10:28:15 CST 2008 
> uname -n = nod284 
> ..... PBS_O_HOST = master1 
> ..... PBS_O_QUEU = batch 
> ..... PBS_O_WORKDIR = /home/t02871/codes/tmp 
> ..... PBS_ENVIRONMENT = PBS_BATCH 
> ..... PBS_JOBID = 403.master1 
> ..... PBS_JOBNAME = t02871_102814 
> ..... PBS_NODEFILE = /var/spool/torque/aux//403.master1 
> ..... PBS_QUEUE = batch 
> ..... PBS_O_SHELL = /bin/bash 
> ..... cp -f /var/spool/torque/aux//403.master1 ./t02871.102814.hosts 
> ..... create mpd node list from /var/spool/torque/aux//403.master1 to 
> ./t02871.102814.mpdhosts 
> cat ./t02871.102814.mpdhosts 
> nod284 
> nod277 
> nod283 
> nod291 
> 
> ..... /home/t02871/mvapich2-1.0.1/bin/mpdboot -n 4 -f 
> ./t02871.102814.mpdhosts --verbose 
> running mpdallexit on nod284 
> LAUNCHED mpd on nod284 via 
> RUNNING: mpd on nod284 
> LAUNCHED mpd on nod277 via nod284 
> LAUNCHED mpd on nod283 via nod284 
> LAUNCHED mpd on nod291 via nod284 
> RUNNING: mpd on nod283 
> RUNNING: mpd on nod291 
> RUNNING: mpd on nod277 
> ..... /home/t02871/mvapich2-1.0.1/bin/mpdtrace 
> nod284 
> nod277 
> nod291 
> nod283 
> ..... /home/t02871/mvapich2-1.0.1/bin/mpiexec -machinefile 
> ./t02871.102814.hosts -np 16 /home/t02871/codes/mpi_oneway_bandwidth.exeV2 S 
> rank 3 in job 1 nod284_55165 caused collective abort of all ranks 
> exit status of rank 3: return code 1 
> rank 2 in job 1 nod284_55165 caused collective abort of all ranks 
> exit status of rank 2: killed by signal 9 
> rank 1 in job 1 nod284_55165 caused collective abort of all ranks 
> exit status of rank 1: killed by signal 9 
> ..... /home/t02871/mvapich2-1.0. 
>   
>   
> Thank you very much. 
>   
> -- Terrence 
> 
> 
> ------------------------------------------------------------------------ 
> 
> _______________________________________________ 
> mvapich-discuss mailing list 
> mvapich-discuss at cse.ohio-state.edu 
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080313/f22ac8b2/attachment-0001.html


More information about the mvapich-discuss mailing list