[mvapich-discuss] mpirun_rsh and Chelsio cluster running RHEL5
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Wed Nov 4 20:49:51 EST 2009
On Wed, Nov 04, 2009 at 08:31:20PM -0500, Bryan Putnam wrote:
> On Wed, 4 Nov 2009, Dhabaleswar Panda wrote:
>
> > Bryan - Thanks for the report. This seems to be an issue when PBS is being
> > used with mpirun_rsh of mvapich2. Are you able to launch jobs using
> > mpirun_rsh directly as outlined in the mvapich2 user guide in the
> > following section:
> >
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-170005.2.1
>
> DK,
>
> Yes, I see the same problem if I do
>
> mpirun_rsh -np 2 host1 host2 ./a.out
>
> Note that PBS is actually not involved at the point I was running the
> example. PBS simply set up the file $PBS_NODEFILE (which is a list of
> hosts). If I do
>
> cat $PBS_NODEFILE ./hostfile
> mpirun_rsh -hostfile ./hostfile -np 2 ./a.out
>
> I see the same problem. Note that mpirun_rsh does work as exepected on our
> other cluster which is Infiniband rather than iWARP. Both clusters are
> RHEL5. Please let me know if there is additional info you need.
I didn't realize that you were using iWARP the last time you posted
this. I think the issue is related to a variable not being set on the
mpirun_rsh command line.
Try using...
mpirun_rsh -np 2 host1 host2 MV2_USE_IWARP_MODE=1 ./a.out
Please see the following link for more information:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-220005.2.6
>
> Thanks,
> Bryan
>
> >
> > Thanks,
> >
> > DK
> >
> >
> > On Wed, 4 Nov 2009, Bryan Putnam wrote:
> >
> > > Hi All,
> > >
> > > When attempting to use "mpirun_rsh" (from mvapich2-1.4 or mvapich2-1.4rc2)
> > > on a Chelsio cluster, I get errors which are reproduced below. Note that I
> > > don't see these errors with mvapich2 when using MPD/mpiexec or the
> > > mpiexec.hydra launcher from the mpich2 distribution.
> > >
> > > Thanks,
> > > Bryan
> > > ===============================================
> > >
> > > coates-a012 1005% mpif90 hellof.f -o hellof
> > > coates-a012 1006% cat $PBS_NODEFILE
> > > coates-a012
> > > coates-a012
> > > coates-a012
> > > coates-a012
> > > coates-a040
> > > coates-a040
> > > coates-a040
> > > coates-a040
> > > coates-a012 1007% mpirun_rsh -hostfile $PBS_NODEFILE -np 8 ./hellof
> > > Fatal error in MPI_Init:
> > > Error message texts are not available
> > > Exit code -5 signaled from coates-a040
> > > MPI process (rank: 4) terminated unexpectedly on
> > > coates-a040.rcac.purdue.edu
> > > Fatal error in MPI_Init:
> > > Error message texts are not available
> > > Fatal error in MPI_Init:
> > > Error message texts are not available
> > > MPI process (rank: 0) terminated unexpectedly on
> > > coates-a012.rcac.purdue.edu
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image PC Routine Line Source
> > > libc.so.6 000000341045492F Unknown Unknown
> > > Unknown
> > > libc.so.6 0000003410463C15 Unknown Unknown
> > > Unknown
> > > libc.so.6 000000341045EF68 Unknown Unknown
> > > Unknown
> > > libmthca-rdmav2.s 00002B000ADB7C26 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 000000341180722C Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 00000034118082A3 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 000000341180707B Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C76298 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C3C992 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C751DC Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009BFBB6D Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C4B715 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C46116 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C4566A Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B0009C45560 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400F61 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400EFC Unknown Unknown
> > > Unknown
> > > libc.so.6 000000341041D994 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400E09 Unknown Unknown
> > > Unknown
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image PC Routine Line Source
> > > libpthread.so.0 0000003410C0D590 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 00000034118090D4 Unknown Unknown
> > > Unknown
> > > libcxgb3-rdmav2.s 00002B7457929C78 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 000000341180722C Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 00000034118082A3 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 000000341180707B Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565E1298 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565A7992 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565E01DC Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B7456566B6D Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565B6715 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565B1116 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565B066A Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002B74565B0560 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400F61 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400EFC Unknown Unknown
> > > Unknown
> > > libc.so.6 000000341041D994 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400E09 Unknown Unknown
> > > Unknown
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image PC Routine Line Source
> > > ld-linux-x86-64.s 0000003410014BA6 Unknown Unknown
> > > Unknown
> > > ld-linux-x86-64.s 000000341000610F Unknown Unknown
> > > Unknown
> > > ld-linux-x86-64.s 0000003410007D33 Unknown Unknown
> > > Unknown
> > > ld-linux-x86-64.s 0000003410010C4D Unknown Unknown
> > > Unknown
> > > ld-linux-x86-64.s 000000341000CE96 Unknown Unknown
> > > Unknown
> > > ld-linux-x86-64.s 000000341001064C Unknown Unknown
> > > Unknown
> > > libdl.so.2 0000003410800F9A Unknown Unknown
> > > Unknown
> > > ld-linux-x86-64.s 000000341000CE96 Unknown Unknown
> > > Unknown
> > > libdl.so.2 000000341080150D Unknown Unknown
> > > Unknown
> > > libdl.so.2 0000003410800F11 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 0000003411807151 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 0000003411807410 Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 000000341180825F Unknown Unknown
> > > Unknown
> > > libibverbs.so.1 000000341180707B Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E8701298 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E86C7992 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E87001DC Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E8686B6D Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E86D6715 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E86D1116 Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E86D066A Unknown Unknown
> > > Unknown
> > > libmpich.so.1.1 00002BA2E86D0560 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400F61 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400EFC Unknown Unknown
> > > Unknown
> > > libc.so.6 000000341041D994 Unknown Unknown
> > > Unknown
> > > hellof 0000000000400E09 Unknown Unknown
> > > Unknown
> > > coates-a012 1008%
> > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
> >
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091104/a87a73d0/attachment.bin
More information about the mvapich-discuss
mailing list