[mvapich-discuss] run error when use pbs

Jaidev Sridhar sridharj at cse.ohio-state.edu
Wed Dec 17 15:15:23 EST 2008


Thanks for letting us know that it works now, we'll consider putting this
in the FAQ.

-Jaidev

On Wed, Dec 17, 2008 at 01:38:16PM +0900, luxingjing wrote:
 > 
 >    Hi,
 > 
 >    I am sorry for havingn not inform you that the problem is resolved.
 > 
 >    It is nothing wrong with mvapich1.1, it is the result of PBS, the PBS
 >     does not
 > 
 >    Allow user to "ssh other node", instead we have to do like bellow:
 > 
 >     mpirun_rsh -rsh -np ....,
 > 
 >    Now it works .
 > 
 >    Thank you for advice.
 > 
 >    -Eric
 > 
 >    -----Original Message-----
 >    From: 'Jaidev Sridhar' [mailto:sridharj at cse.ohio-state.edu]
 >    Sent: Wednesday, December 17, 2008 6:15 AM
 >    To: luxingjing
 >    Subject: Re: [mvapich-discuss] run error when use pbs
 > 
 > 
 >    Looks like the cpi application is crashing. Can you set 'ulimit -c
 >    unlimited'
 > 
 >    in your bash profile and see if we get any core dumps?
 > 
 > 
 >    -Jaidev
 > 
 > 
 >    On Tue, Dec 16, 2008 at 11:13:35AM +0900, luxingjing wrote:
 > 
 >     >
 > 
 >     >    Hi,
 > 
 >     >
 > 
 >     >    Thank you for your repley, but it seems not the problem.
 > 
 >     >
 > 
 >     >    Now my pbs script is:
 > 
 >     >
 > 
 >     >
 > 
 >     >    #!/bin/sh
 > 
 >     >
 > 
 >     >    #PBS -N cpi
 > 
 >     >
 > 
 >     >    #PBS -l nodes=1:ppn=1
 > 
 >     >
 > 
 >     >    #PBS -q dawning
 > 
 >     >
 > 
 >     >    #PBS -o cpi1
 > 
 >     >
 > 
 >     >    #PBS -e cpi1.e
 > 
 >     >
 > 
 >     >    cd $PBS_O_WORKDIR
 > 
 >     >
 > 
 >     >    declare -a no
 > 
 >     >
 > 
 >     >    count=0
 > 
 >     >
 > 
 >     >    for i in $( uniq $PBS_NODEFILE )
 > 
 >     >
 > 
 >     >    do
 > 
 >     >
 > 
 >     >       echo $i
 > 
 >     >
 > 
 >     >       echo $count
 > 
 >     >
 > 
 >     >       no[$count]=$i
 > 
 >     >
 > 
 >     >       count=$(($count + 1))
 > 
 >     >
 > 
 >     >    done
 > 
 >     >
 > 
 >     >    export UPC_NODES="${no[0]} ${no[1]} ${no[2]} ${no[3]}"
 > 
 >     >
 > 
 >     >    #PBS -V
 > 
 >     >
 > 
 >     >    exec 1>/home/paraorc/lxj/test/hosts
 > 
 >     >
 > 
 >     >    echo "${no[0]}"
 > 
 >     >
 > 
 >     >    exec 1<&-
 > 
 >     >
 > 
 >     >
 > 
 >     >    /home/paraorc/lxj/mvapich1.1/bin/mpirun_rsh -np 1 -hostfile
 > 
 >     >    /home/paraorc/lxj/test/hosts  /home/paraorc/lxj/test/cpi
 > 
 >     >
 > 
 >     >    Bash
 > 
 >     >
 > 
 >     >    But the error is still there ,Error is:
 > 
 >     >
 > 
 >     >    Child exited abnormally!
 > 
 >     >
 > 
 >     >    Killing remote processes...DONE
 > 
 >     >
 > 
 >     >    .The network is infiniband, and use openfabrics1.1, the mvapich
 >    is
 > 
 >     >    1.1too. I wonder if mvapich1.1 support the openfabrics-1.1 ,
 > 
 >     >
 > 
 >     >    And when I install the mvapich, I removed the CFLAG CDXRC for
 >    errors
 > 
 >     >    as bellow, Does it matter ?
 > 
 >     >
 > 
 >     >     viainit.c: In function `create_srq':
 > 
 >     >
 > 
 >     >    viainit.c:427: warning: assignment makes pointer from integer
 >    without
 > 
 >     >    a cast
 > 
 >     >
 > 
 >     >    viainit.c:428: error: structure has no member named
 >    `xrc_srq_num'
 > 
 >     >
 > 
 >     >    viainit.c:428: error: structure has no member named
 >    `xrc_srq_num'
 > 
 >     >
 > 
 >     >    viainit.c: In function `xrc_init':
 > 
 >     >
 > 
 >     >    viainit.c:1144: error: `IBV_DEVICE_XRC' undeclared (first use in
 >    this
 > 
 >     >    function)
 > 
 >     >
 > 
 >     >    viainit.c:1144: error: (Each undeclared identifier is reported
 >    only
 > 
 >     >    once
 > 
 >     >
 > 
 >     >    viainit.c:1144: error: for each function it appears in.)
 > 
 >     >
 > 
 >     >    viainit.c:1161: warning: assignment makes pointer from integer
 >    without
 > 
 >     >    a cast
 > 
 >     >
 > 
 >     >    make[3]: *** [viainit.o] Error 1
 > 
 >     >
 > 
 >     >    Exit status from make was 2
 > 
 >     >
 > 
 >     >    make[2]: *** [mpilib] Error 1
 > 
 >     >
 > 
 >     >    make[1]: *** [mpi-modules] Error 2
 > 
 >     >
 > 
 >     >    make: *** [mpi] Error 2
 > 
 >     >
 > 
 >     >    Failure in building MVAPICH.
 > 
 >     >
 > 
 >     >
 > 
 >     >    I have tried all day for the problem, but I have not got it
 >    resovled
 > 
 >     >    now. Thank you for your help
 > 
 >     >
 > 
 >     >
 > 
 >     >    -Eric
 > 
 >     >
 > 
 >     >
 > 
 >     >    -----Original Message-----
 > 
 >     >    From: Jaidev Sridhar [mailto:sridharj at cse.ohio-state.edu]
 > 
 >     >    Sent: Tuesday, December 16, 2008 11:43 AM
 > 
 >     >    To: luxingjing
 > 
 >     >    Cc: mvapich-discuss at cse.ohio-state.edu
 > 
 >     >    Subject: Re: [mvapich-discuss] run error when use pbs
 > 
 >     >
 > 
 >     >
 > 
 >     >    Hi,
 > 
 >     >
 > 
 >     >
 > 
 >     >    Your command line is wrong. You should use -
 > 
 >     >
 > 
 >     >       mpirun_rsh -np x -hostfile /path/to/file /path/to/app
 > 
 >     >
 > 
 >     >
 > 
 >     >    -Jaidev
 > 
 >     >
 > 
 >     >
 > 
 >     >    On Monday 15 December 2008 06:42 AM, luxingjing wrote:
 > 
 >     >
 > 
 >     >    > Hi,
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >   Recently, I installed mvapich1.1 and the network is
 >    infiniband. In
 > 
 >     >    the
 > 
 >     >
 > 
 >     >    > last, I install brkeley_upc-2.8 whose conduit is
 >    infiniband-ibv,
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > And the upcrun will use mpirun( mvapich ) to layout the
 >    thread.
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >   I write the nodes from $PBS_NODEFILE to a file hosts, and
 > 
 >     >    MPIRUNCMD is
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 >    MPIRUN_CMD="${MPIRUN_CMD:-/home/paraorc/lxj/mvapich1.1/bin/mpirun
 > 
 >     >
 > 
 >     >    > -machinefile /home/paraorc/lxj/test/hosts -np %N %C }
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > But when I qsub hello.pb, in the file hello.e the errors are:
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > Child exited abnormally!
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > Killing remote processes...DONE
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > Wish your help. Thank you!
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > Eric
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > __________ Information from ESET NOD32 Antivirus, version of
 >    virus
 > 
 >     >
 > 
 >     >    > signature database 3230 (20080701) __________
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > The message was checked by ESET NOD32 Antivirus.
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > http://www.eset.com
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 >    ----------------------------------------------------------------------
 > 
 >     >    --
 > 
 >     >
 > 
 >     >    >
 > 
 >     >
 > 
 >     >    > _______________________________________________
 > 
 >     >
 > 
 >     >    > mvapich-discuss mailing list
 > 
 >     >
 > 
 >     >    > mvapich-discuss at cse.ohio-state.edu
 > 
 >     >
 > 
 >     >    >
 >    http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
 > 
 >     >
 > 
 >     >
 > 
 >     >
 > 
 >     >    __________ Information from ESET NOD32 Antivirus, version of
 >    virus
 > 
 >     >    signature database 3230 (20080701) __________
 > 
 >     >
 > 
 >     >
 > 
 >     >    The message was checked by ESET NOD32 Antivirus.
 > 
 >     >
 > 
 >     >
 > 
 >     >    http://www.eset.com
 > 
 > 
 >    --
 > 
 >    You can rent this space for only $5 a week.

-- 
You can rent this space for only $5 a week.



More information about the mvapich-discuss mailing list