[mvapich-discuss] How to run mvapich2.0 with PBS

Terrence.LIAO at total.com Terrence.LIAO at total.com
Tue Mar 11 09:20:46 EDT 2008


Dear MVAPICH,
 
I am trying to use MOAB/TORQUE on mvapich2-1.0.1, but has problem.   My pbs script run fine when the executable is NOT using mpi, but get this kind of error on the MPI executable.

rank 3 in job 1 nod284_55165 caused collective abort of all ranks
exit status of rank 3: return code 1

Also,  There is No problem to run the MPI executable interactively.  And no problem to run the job under PBS is I did mpdboot outside the PBS script.

Below is the my output log from this qsub:

[t02871 at master1 tmp]$ 
Thu Feb 28 10:28:15 CST 2008
uname -n = nod284
..... PBS_O_HOST = master1
..... PBS_O_QUEU = batch
..... PBS_O_WORKDIR = /home/t02871/codes/tmp
..... PBS_ENVIRONMENT = PBS_BATCH
..... PBS_JOBID = 403.master1
..... PBS_JOBNAME = t02871_102814
..... PBS_NODEFILE = /var/spool/torque/aux//403.master1
..... PBS_QUEUE = batch
..... PBS_O_SHELL = /bin/bash
..... cp -f /var/spool/torque/aux//403.master1 ./t02871.102814.hosts
..... create mpd node list from /var/spool/torque/aux//403.master1 to ./t02871.102814.mpdhosts
cat ./t02871.102814.mpdhosts
nod284
nod277
nod283
nod291

..... /home/t02871/mvapich2-1.0.1/bin/mpdboot -n 4 -f
./t02871.102814.mpdhosts --verbose
running mpdallexit on nod284
LAUNCHED mpd on nod284 via
RUNNING: mpd on nod284
LAUNCHED mpd on nod277 via nod284
LAUNCHED mpd on nod283 via nod284
LAUNCHED mpd on nod291 via nod284
RUNNING: mpd on nod283
RUNNING: mpd on nod291
RUNNING: mpd on nod277
..... /home/t02871/mvapich2-1.0.1/bin/mpdtrace
nod284
nod277
nod291
nod283
..... /home/t02871/mvapich2-1.0.1/bin/mpiexec -machinefile
./t02871.102814.hosts -np 16 /home/t02871/codes/mpi_oneway_bandwidth.exeV2 S
rank 3 in job 1 nod284_55165 caused collective abort of all ranks
exit status of rank 3: return code 1
rank 2 in job 1 nod284_55165 caused collective abort of all ranks
exit status of rank 2: killed by signal 9
rank 1 in job 1 nod284_55165 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
..... /home/t02871/mvapich2-1.0.
 
 
Thank you very much.
 
-- Terrence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080311/82c848e5/attachment-0001.html


More information about the mvapich-discuss mailing list