[mvapich-discuss] How to run mvapich2.0 with PBS
Terrence.LIAO at total.com
Terrence.LIAO at total.com
Tue Mar 11 17:24:38 EDT 2008
Hi, Joshua,
I have already tried that. The same problem though.
Thanks.
-- Terrence
--------------------------------------------------------
Terrence Liao, Ph.D.
Research Computer Scientist
TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
1201 Louisiana, Suite 1800, Houston, TX 77002
Tel: 713.647.3498 Fax: 713.647.3638
Email: terrence.liao at total.com
-----Joshua Bernstein <jbernstein at penguincomputing.com> wrote: -----
To: Terrence.LIAO at total.com
From: Joshua Bernstein <jbernstein at penguincomputing.com>
Date: 03/11/2008 02:15PM
cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] How to run mvapich2.0 with PBS
Hi Terrence,
You'll want to be using the mpiexec from OSU, rather then then mpiexec
that comes with MVAPICH.
http://www.osc.edu/~pw/mpiexec/index.php
This version of mpiexec understands how to startup processes using the
"tm", or Task Management interface of PBS. It will honor the host list
from $PBS_NODEFILE, and will correctly allow for PBS to manage and
monitor the child MPI processes.
-Joshua Bernstein
Software Engineer
Penguin Computing
Terrence.LIAO at total.com wrote:
> Dear MVAPICH,
>
> I am trying to use MOAB/TORQUE on mvapich2-1.0.1, but has problem.
> My pbs script run fine when the executable is NOT using mpi, but get
> this kind of error on the MPI executable.
>
> rank 3 in job 1 nod284_55165 caused collective abort of all ranks
> exit status of rank 3: return code 1
>
> Also, There is No problem to run the MPI executable interactively. And
> no problem to run the job under PBS is I did mpdboot outside the PBS script.
>
> Below is the my output log from this qsub:
>
> [t02871 at master1 tmp]$
> Thu Feb 28 10:28:15 CST 2008
> uname -n = nod284
> ..... PBS_O_HOST = master1
> ..... PBS_O_QUEU = batch
> ..... PBS_O_WORKDIR = /home/t02871/codes/tmp
> ..... PBS_ENVIRONMENT = PBS_BATCH
> ..... PBS_JOBID = 403.master1
> ..... PBS_JOBNAME = t02871_102814
> ..... PBS_NODEFILE = /var/spool/torque/aux//403.master1
> ..... PBS_QUEUE = batch
> ..... PBS_O_SHELL = /bin/bash
> ..... cp -f /var/spool/torque/aux//403.master1 ./t02871.102814.hosts
> ..... create mpd node list from /var/spool/torque/aux//403.master1 to
> ./t02871.102814.mpdhosts
> cat ./t02871.102814.mpdhosts
> nod284
> nod277
> nod283
> nod291
>
> ..... /home/t02871/mvapich2-1.0.1/bin/mpdboot -n 4 -f
> ./t02871.102814.mpdhosts --verbose
> running mpdallexit on nod284
> LAUNCHED mpd on nod284 via
> RUNNING: mpd on nod284
> LAUNCHED mpd on nod277 via nod284
> LAUNCHED mpd on nod283 via nod284
> LAUNCHED mpd on nod291 via nod284
> RUNNING: mpd on nod283
> RUNNING: mpd on nod291
> RUNNING: mpd on nod277
> ..... /home/t02871/mvapich2-1.0.1/bin/mpdtrace
> nod284
> nod277
> nod291
> nod283
> ..... /home/t02871/mvapich2-1.0.1/bin/mpiexec -machinefile
> ./t02871.102814.hosts -np 16 /home/t02871/codes/mpi_oneway_bandwidth.exeV2 S
> rank 3 in job 1 nod284_55165 caused collective abort of all ranks
> exit status of rank 3: return code 1
> rank 2 in job 1 nod284_55165 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> rank 1 in job 1 nod284_55165 caused collective abort of all ranks
> exit status of rank 1: killed by signal 9
> ..... /home/t02871/mvapich2-1.0.
>
>
> Thank you very much.
>
> -- Terrence
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080311/f613b556/attachment-0001.html
More information about the mvapich-discuss
mailing list