[mvapich-discuss] mpirun_rsh and Chelsio cluster running RHEL5

Hari Subramoni subramon at cse.ohio-state.edu
Wed Nov 4 22:11:22 EST 2009


For mpirun_rsh , all the environment variables must be passed on the
command line or through the parameters file. It will not take it from the
environment like mpiexec/mpd will.

Thx,
Hari.

On Wed, 4 Nov 2009, Bryan Putnam wrote:

> On Wed, 4 Nov 2009, Jonathan Perkins wrote:
>
> > On Wed, Nov 04, 2009 at 08:31:20PM -0500, Bryan Putnam wrote:
> > > On Wed, 4 Nov 2009, Dhabaleswar Panda wrote:
> > >
> > > > Bryan - Thanks for the report. This seems to be an issue when PBS is being
> > > > used with mpirun_rsh of mvapich2.  Are you able to launch jobs using
> > > > mpirun_rsh directly as outlined in the mvapich2 user guide in the
> > > > following section:
> > > >
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-170005.2.1
> > >
> > > DK,
> > >
> > > Yes, I see the same problem if I do
> > >
> > > mpirun_rsh -np 2 host1 host2 ./a.out
> > >
> > > Note that PBS is actually not involved at the point I was running the
> > > example. PBS simply set up the file $PBS_NODEFILE (which is a list of
> > > hosts). If I do
> > >
> > > cat $PBS_NODEFILE ./hostfile
> > > mpirun_rsh -hostfile ./hostfile -np 2 ./a.out
> > >
> > > I see the same problem. Note that mpirun_rsh does work as exepected on our
> > > other cluster which is Infiniband rather than iWARP. Both clusters are
> > > RHEL5. Please let me know if there is additional info you need.
> >
> > I didn't realize that you were using iWARP the last time you posted
> > this.  I think the issue is related to a variable not being set on the
> > mpirun_rsh command line.
> >
> > Try using...
> > mpirun_rsh -np 2 host1 host2 MV2_USE_IWARP_MODE=1 ./a.out
>
> OK that's interesting. That did appear to fix the problem. I knew that I
> needed to have MV2_USE_IWARP_MODE=1, but I had it in the environement as
>
> coates-adm 1012% env | grep MV
> MVAPICH2_HOME=/apps/rhel5/mvapich2-1.4/64/ib-intel-11.1.038
> MV2_USE_SHMEM_COLL=0
> MV2_USE_RDMA_CM=1
> MV2_USE_IWARP_MODE=1
>
> So, it appears that it actually needs to be on the mpirun_rsh command line
> as well?
>
> Thanks,
> Bryan
>
>
> >
> > Please see the following link for more information:
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-220005.2.6
> >
> > >
> > > Thanks,
> > > Bryan
> > >
> > > >
> > > > Thanks,
> > > >
> > > > DK
> > > >
> > > >
> > > > On Wed, 4 Nov 2009, Bryan Putnam wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > When attempting to use "mpirun_rsh" (from mvapich2-1.4 or mvapich2-1.4rc2)
> > > > > on a Chelsio cluster, I get errors which are reproduced below. Note that I
> > > > > don't see these errors with mvapich2 when using MPD/mpiexec or the
> > > > > mpiexec.hydra launcher from the mpich2 distribution.
> > > > >
> > > > > Thanks,
> > > > > Bryan
> > > > > ===============================================
> > > > >
> > > > > coates-a012 1005% mpif90 hellof.f -o hellof
> > > > > coates-a012 1006% cat $PBS_NODEFILE
> > > > > coates-a012
> > > > > coates-a012
> > > > > coates-a012
> > > > > coates-a012
> > > > > coates-a040
> > > > > coates-a040
> > > > > coates-a040
> > > > > coates-a040
> > > > > coates-a012 1007% mpirun_rsh -hostfile $PBS_NODEFILE -np 8 ./hellof
> > > > > Fatal error in MPI_Init:
> > > > > Error message texts are not available
> > > > > Exit code -5 signaled from coates-a040
> > > > > MPI process (rank: 4) terminated unexpectedly on
> > > > > coates-a040.rcac.purdue.edu
> > > > > Fatal error in MPI_Init:
> > > > > Error message texts are not available
> > > > > Fatal error in MPI_Init:
> > > > > Error message texts are not available
> > > > > MPI process (rank: 0) terminated unexpectedly on
> > > > > coates-a012.rcac.purdue.edu
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > Image              PC                Routine            Line        Source
> > > > > libc.so.6          000000341045492F  Unknown               Unknown
> > > > > Unknown
> > > > > libc.so.6          0000003410463C15  Unknown               Unknown
> > > > > Unknown
> > > > > libc.so.6          000000341045EF68  Unknown               Unknown
> > > > > Unknown
> > > > > libmthca-rdmav2.s  00002B000ADB7C26  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    000000341180722C  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    00000034118082A3  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C76298  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C3C992  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C751DC  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009BFBB6D  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C4B715  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C46116  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C4566A  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B0009C45560  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400F61  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > > Unknown
> > > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400E09  Unknown               Unknown
> > > > > Unknown
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > Image              PC                Routine            Line        Source
> > > > > libpthread.so.0    0000003410C0D590  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    00000034118090D4  Unknown               Unknown
> > > > > Unknown
> > > > > libcxgb3-rdmav2.s  00002B7457929C78  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    000000341180722C  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    00000034118082A3  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565E1298  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565A7992  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565E01DC  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B7456566B6D  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565B6715  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565B1116  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565B066A  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002B74565B0560  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400F61  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > > Unknown
> > > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400E09  Unknown               Unknown
> > > > > Unknown
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > Image              PC                Routine            Line        Source
> > > > > ld-linux-x86-64.s  0000003410014BA6  Unknown               Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s  000000341000610F  Unknown               Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s  0000003410007D33  Unknown               Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s  0000003410010C4D  Unknown               Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s  000000341000CE96  Unknown               Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s  000000341001064C  Unknown               Unknown
> > > > > Unknown
> > > > > libdl.so.2         0000003410800F9A  Unknown               Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s  000000341000CE96  Unknown               Unknown
> > > > > Unknown
> > > > > libdl.so.2         000000341080150D  Unknown               Unknown
> > > > > Unknown
> > > > > libdl.so.2         0000003410800F11  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    0000003411807151  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    0000003411807410  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    000000341180825F  Unknown               Unknown
> > > > > Unknown
> > > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E8701298  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E86C7992  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E87001DC  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E8686B6D  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E86D6715  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E86D1116  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E86D066A  Unknown               Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1    00002BA2E86D0560  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400F61  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > > Unknown
> > > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > > Unknown
> > > > > hellof             0000000000400E09  Unknown               Unknown
> > > > > Unknown
> > > > > coates-a012 1008%
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > >
> > > >
> > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> > --
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
> >
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list