[mvapich-discuss] mpirun_rsh and Chelsio cluster running RHEL5
Hari Subramoni
subramon at cse.ohio-state.edu
Wed Nov 4 22:11:22 EST 2009
For mpirun_rsh , all the environment variables must be passed on the
command line or through the parameters file. It will not take it from the
environment like mpiexec/mpd will.
Thx,
Hari.
On Wed, 4 Nov 2009, Bryan Putnam wrote:
> On Wed, 4 Nov 2009, Jonathan Perkins wrote:
>
> > On Wed, Nov 04, 2009 at 08:31:20PM -0500, Bryan Putnam wrote:
> > > On Wed, 4 Nov 2009, Dhabaleswar Panda wrote:
> > >
> > > > Bryan - Thanks for the report. This seems to be an issue when PBS is being
> > > > used with mpirun_rsh of mvapich2. Are you able to launch jobs using
> > > > mpirun_rsh directly as outlined in the mvapich2 user guide in the
> > > > following section:
> > > >
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-170005.2.1
> > >
> > > DK,
> > >
> > > Yes, I see the same problem if I do
> > >
> > > mpirun_rsh -np 2 host1 host2 ./a.out
> > >
> > > Note that PBS is actually not involved at the point I was running the
> > > example. PBS simply set up the file $PBS_NODEFILE (which is a list of
> > > hosts). If I do
> > >
> > > cat $PBS_NODEFILE ./hostfile
> > > mpirun_rsh -hostfile ./hostfile -np 2 ./a.out
> > >
> > > I see the same problem. Note that mpirun_rsh does work as exepected on our
> > > other cluster which is Infiniband rather than iWARP. Both clusters are
> > > RHEL5. Please let me know if there is additional info you need.
> >
> > I didn't realize that you were using iWARP the last time you posted
> > this. I think the issue is related to a variable not being set on the
> > mpirun_rsh command line.
> >
> > Try using...
> > mpirun_rsh -np 2 host1 host2 MV2_USE_IWARP_MODE=1 ./a.out
>
> OK that's interesting. That did appear to fix the problem. I knew that I
> needed to have MV2_USE_IWARP_MODE=1, but I had it in the environement as
>
> coates-adm 1012% env | grep MV
> MVAPICH2_HOME=/apps/rhel5/mvapich2-1.4/64/ib-intel-11.1.038
> MV2_USE_SHMEM_COLL=0
> MV2_USE_RDMA_CM=1
> MV2_USE_IWARP_MODE=1
>
> So, it appears that it actually needs to be on the mpirun_rsh command line
> as well?
>
> Thanks,
> Bryan
>
>
> >
> > Please see the following link for more information:
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-220005.2.6
> >
> > >
> > > Thanks,
> > > Bryan
> > >
> > > >
> > > > Thanks,
> > > >
> > > > DK
> > > >
> > > >
> > > > On Wed, 4 Nov 2009, Bryan Putnam wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > When attempting to use "mpirun_rsh" (from mvapich2-1.4 or mvapich2-1.4rc2)
> > > > > on a Chelsio cluster, I get errors which are reproduced below. Note that I
> > > > > don't see these errors with mvapich2 when using MPD/mpiexec or the
> > > > > mpiexec.hydra launcher from the mpich2 distribution.
> > > > >
> > > > > Thanks,
> > > > > Bryan
> > > > > ===============================================
> > > > >
> > > > > coates-a012 1005% mpif90 hellof.f -o hellof
> > > > > coates-a012 1006% cat $PBS_NODEFILE
> > > > > coates-a012
> > > > > coates-a012
> > > > > coates-a012
> > > > > coates-a012
> > > > > coates-a040
> > > > > coates-a040
> > > > > coates-a040
> > > > > coates-a040
> > > > > coates-a012 1007% mpirun_rsh -hostfile $PBS_NODEFILE -np 8 ./hellof
> > > > > Fatal error in MPI_Init:
> > > > > Error message texts are not available
> > > > > Exit code -5 signaled from coates-a040
> > > > > MPI process (rank: 4) terminated unexpectedly on
> > > > > coates-a040.rcac.purdue.edu
> > > > > Fatal error in MPI_Init:
> > > > > Error message texts are not available
> > > > > Fatal error in MPI_Init:
> > > > > Error message texts are not available
> > > > > MPI process (rank: 0) terminated unexpectedly on
> > > > > coates-a012.rcac.purdue.edu
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > Image PC Routine Line Source
> > > > > libc.so.6 000000341045492F Unknown Unknown
> > > > > Unknown
> > > > > libc.so.6 0000003410463C15 Unknown Unknown
> > > > > Unknown
> > > > > libc.so.6 000000341045EF68 Unknown Unknown
> > > > > Unknown
> > > > > libmthca-rdmav2.s 00002B000ADB7C26 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 000000341180722C Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 00000034118082A3 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 000000341180707B Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C76298 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C3C992 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C751DC Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009BFBB6D Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C4B715 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C46116 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C4566A Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B0009C45560 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400F61 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400EFC Unknown Unknown
> > > > > Unknown
> > > > > libc.so.6 000000341041D994 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400E09 Unknown Unknown
> > > > > Unknown
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > Image PC Routine Line Source
> > > > > libpthread.so.0 0000003410C0D590 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 00000034118090D4 Unknown Unknown
> > > > > Unknown
> > > > > libcxgb3-rdmav2.s 00002B7457929C78 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 000000341180722C Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 00000034118082A3 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 000000341180707B Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565E1298 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565A7992 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565E01DC Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B7456566B6D Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565B6715 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565B1116 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565B066A Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002B74565B0560 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400F61 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400EFC Unknown Unknown
> > > > > Unknown
> > > > > libc.so.6 000000341041D994 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400E09 Unknown Unknown
> > > > > Unknown
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > Image PC Routine Line Source
> > > > > ld-linux-x86-64.s 0000003410014BA6 Unknown Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s 000000341000610F Unknown Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s 0000003410007D33 Unknown Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s 0000003410010C4D Unknown Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s 000000341000CE96 Unknown Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s 000000341001064C Unknown Unknown
> > > > > Unknown
> > > > > libdl.so.2 0000003410800F9A Unknown Unknown
> > > > > Unknown
> > > > > ld-linux-x86-64.s 000000341000CE96 Unknown Unknown
> > > > > Unknown
> > > > > libdl.so.2 000000341080150D Unknown Unknown
> > > > > Unknown
> > > > > libdl.so.2 0000003410800F11 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 0000003411807151 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 0000003411807410 Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 000000341180825F Unknown Unknown
> > > > > Unknown
> > > > > libibverbs.so.1 000000341180707B Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E8701298 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E86C7992 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E87001DC Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E8686B6D Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E86D6715 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E86D1116 Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E86D066A Unknown Unknown
> > > > > Unknown
> > > > > libmpich.so.1.1 00002BA2E86D0560 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400F61 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400EFC Unknown Unknown
> > > > > Unknown
> > > > > libc.so.6 000000341041D994 Unknown Unknown
> > > > > Unknown
> > > > > hellof 0000000000400E09 Unknown Unknown
> > > > > Unknown
> > > > > coates-a012 1008%
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > >
> > > >
> > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> > --
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
> >
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list