[mvapich-discuss] mpirun_rsh and Chelsio cluster running RHEL5

Bryan Putnam bfp at purdue.edu
Wed Nov 4 21:35:10 EST 2009


On Wed, 4 Nov 2009, Dhabaleswar Panda wrote:

> > On Wed, 4 Nov 2009, Dhabaleswar Panda wrote:
> >
> > > Bryan - Thanks for the report. This seems to be an issue when PBS is being
> > > used with mpirun_rsh of mvapich2.  Are you able to launch jobs using
> > > mpirun_rsh directly as outlined in the mvapich2 user guide in the
> > > following section:
> > >
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-170005.2.1
> >
> > DK,
> >
> > Yes, I see the same problem if I do
> >
> > mpirun_rsh -np 2 host1 host2 ./a.out
> >
> > Note that PBS is actually not involved at the point I was running the
> > example. PBS simply set up the file $PBS_NODEFILE (which is a list of
> > hosts). If I do
> >
> > cat $PBS_NODEFILE ./hostfile
> > mpirun_rsh -hostfile ./hostfile -np 2 ./a.out
> >
> > I see the same problem. Note that mpirun_rsh does work as exepected on our
> > other cluster which is Infiniband rather than iWARP. Both clusters are
> > RHEL5. Please let me know if there is additional info you need.
> 
> You need to use MV2_USE_iWARP_MODE=1 in the mpirun_rsh command. You also
> need to create the file /etc/MV2.conf with the local IP address to be used
> by RDMA_CM.  Section 5.2.6 of MVAPICH2 1.4 user guide indicates how to use
> mpirun_rsh with iWARP devices.
> 
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-220005.2.6
> 
> Let us know whether the steps indicated in this section help you.

DK,

Thanks, we do already have the /etc/mv2.conf file set up. It was the fact 
that the MV2_USE_IWARP_MODE variable needed to be set on the command line 
rather than in the env that seemed to cause the problems.

Thanks,
Bryan

 > 
> Thanks,
> 
> DK
> 
> 
> > Thanks,
> > Bryan
> >
> > >
> > > Thanks,
> > >
> > > DK
> > >
> > >
> > > On Wed, 4 Nov 2009, Bryan Putnam wrote:
> > >
> > > > Hi All,
> > > >
> > > > When attempting to use "mpirun_rsh" (from mvapich2-1.4 or mvapich2-1.4rc2)
> > > > on a Chelsio cluster, I get errors which are reproduced below. Note that I
> > > > don't see these errors with mvapich2 when using MPD/mpiexec or the
> > > > mpiexec.hydra launcher from the mpich2 distribution.
> > > >
> > > > Thanks,
> > > > Bryan
> > > > ===============================================
> > > >
> > > > coates-a012 1005% mpif90 hellof.f -o hellof
> > > > coates-a012 1006% cat $PBS_NODEFILE
> > > > coates-a012
> > > > coates-a012
> > > > coates-a012
> > > > coates-a012
> > > > coates-a040
> > > > coates-a040
> > > > coates-a040
> > > > coates-a040
> > > > coates-a012 1007% mpirun_rsh -hostfile $PBS_NODEFILE -np 8 ./hellof
> > > > Fatal error in MPI_Init:
> > > > Error message texts are not available
> > > > Exit code -5 signaled from coates-a040
> > > > MPI process (rank: 4) terminated unexpectedly on
> > > > coates-a040.rcac.purdue.edu
> > > > Fatal error in MPI_Init:
> > > > Error message texts are not available
> > > > Fatal error in MPI_Init:
> > > > Error message texts are not available
> > > > MPI process (rank: 0) terminated unexpectedly on
> > > > coates-a012.rcac.purdue.edu
> > > > forrtl: error (69): process interrupted (SIGINT)
> > > > Image              PC                Routine            Line        Source
> > > > libc.so.6          000000341045492F  Unknown               Unknown
> > > > Unknown
> > > > libc.so.6          0000003410463C15  Unknown               Unknown
> > > > Unknown
> > > > libc.so.6          000000341045EF68  Unknown               Unknown
> > > > Unknown
> > > > libmthca-rdmav2.s  00002B000ADB7C26  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    000000341180722C  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    00000034118082A3  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C76298  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C3C992  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C751DC  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009BFBB6D  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C4B715  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C46116  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C4566A  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B0009C45560  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400F61  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > Unknown
> > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400E09  Unknown               Unknown
> > > > Unknown
> > > > forrtl: error (69): process interrupted (SIGINT)
> > > > Image              PC                Routine            Line        Source
> > > > libpthread.so.0    0000003410C0D590  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    00000034118090D4  Unknown               Unknown
> > > > Unknown
> > > > libcxgb3-rdmav2.s  00002B7457929C78  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    000000341180722C  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    00000034118082A3  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565E1298  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565A7992  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565E01DC  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B7456566B6D  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565B6715  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565B1116  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565B066A  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002B74565B0560  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400F61  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > Unknown
> > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400E09  Unknown               Unknown
> > > > Unknown
> > > > forrtl: error (69): process interrupted (SIGINT)
> > > > Image              PC                Routine            Line        Source
> > > > ld-linux-x86-64.s  0000003410014BA6  Unknown               Unknown
> > > > Unknown
> > > > ld-linux-x86-64.s  000000341000610F  Unknown               Unknown
> > > > Unknown
> > > > ld-linux-x86-64.s  0000003410007D33  Unknown               Unknown
> > > > Unknown
> > > > ld-linux-x86-64.s  0000003410010C4D  Unknown               Unknown
> > > > Unknown
> > > > ld-linux-x86-64.s  000000341000CE96  Unknown               Unknown
> > > > Unknown
> > > > ld-linux-x86-64.s  000000341001064C  Unknown               Unknown
> > > > Unknown
> > > > libdl.so.2         0000003410800F9A  Unknown               Unknown
> > > > Unknown
> > > > ld-linux-x86-64.s  000000341000CE96  Unknown               Unknown
> > > > Unknown
> > > > libdl.so.2         000000341080150D  Unknown               Unknown
> > > > Unknown
> > > > libdl.so.2         0000003410800F11  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    0000003411807151  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    0000003411807410  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    000000341180825F  Unknown               Unknown
> > > > Unknown
> > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E8701298  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E86C7992  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E87001DC  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E8686B6D  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E86D6715  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E86D1116  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E86D066A  Unknown               Unknown
> > > > Unknown
> > > > libmpich.so.1.1    00002BA2E86D0560  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400F61  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > Unknown
> > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > Unknown
> > > > hellof             0000000000400E09  Unknown               Unknown
> > > > Unknown
> > > > coates-a012 1008%
> > > >
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > >
> > >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> 
> 




More information about the mvapich-discuss mailing list