[mvapich-discuss] mpirun_rsh and Chelsio cluster running RHEL5

Bryan Putnam bfp at purdue.edu
Thu Nov 5 08:36:26 EST 2009


On Wed, 4 Nov 2009, Dhabaleswar Panda wrote:

> > > > > Bryan - Thanks for the report. This seems to be an issue when PBS is being
> > > > > used with mpirun_rsh of mvapich2.  Are you able to launch jobs using
> > > > > mpirun_rsh directly as outlined in the mvapich2 user guide in the
> > > > > following section:
> > > > >
> > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-170005.2.1
> > > >
> > > > DK,
> > > >
> > > > Yes, I see the same problem if I do
> > > >
> > > > mpirun_rsh -np 2 host1 host2 ./a.out
> > > >
> > > > Note that PBS is actually not involved at the point I was running the
> > > > example. PBS simply set up the file $PBS_NODEFILE (which is a list of
> > > > hosts). If I do
> > > >
> > > > cat $PBS_NODEFILE ./hostfile
> > > > mpirun_rsh -hostfile ./hostfile -np 2 ./a.out
> > > >
> > > > I see the same problem. Note that mpirun_rsh does work as exepected on our
> > > > other cluster which is Infiniband rather than iWARP. Both clusters are
> > > > RHEL5. Please let me know if there is additional info you need.
> > >
> > > I didn't realize that you were using iWARP the last time you posted
> > > this.  I think the issue is related to a variable not being set on the
> > > mpirun_rsh command line.
> > >
> > > Try using...
> > > mpirun_rsh -np 2 host1 host2 MV2_USE_IWARP_MODE=1 ./a.out
> >
> > OK that's interesting. That did appear to fix the problem. I knew that I
> > needed to have MV2_USE_IWARP_MODE=1, but I had it in the environement as
> >
> > coates-adm 1012% env | grep MV
> > MVAPICH2_HOME=/apps/rhel5/mvapich2-1.4/64/ib-intel-11.1.038
> > MV2_USE_SHMEM_COLL=0
> > MV2_USE_RDMA_CM=1
> > MV2_USE_IWARP_MODE=1
> >
> > So, it appears that it actually needs to be on the mpirun_rsh command line
> > as well?
> 
> Bryan,
> 
> Very good to know that the problem got resolved. I will also suggest to
> remove the `MV2_USE_SHMEM_COLL=0' restriction to get best performance for
> collectives and applications on large-scale clusters. By default,
> SHMEM_COLL is on (1) and you should use it.
> 
> Let us know if you encounter any additional problems.

Thanks DK et al for all the help. 

Actually, I had to set MV2_USE_SHMEM_COLL=0 in order to get the Intel MPI 
benchmarks to complete successfully (when using mpiexec/mpd and 
mpiexec.hydra). I found this to be the case on both our IB cluster and our 
iWARP cluster. Anyway, I'll do some experimenting with mpirun_rsh for wide 
jobs and let you know what I see.

Thanks!
Bryan

 > 
> Thanks,
> 
> DK
> 
> > Thanks,
> > Bryan
> >
> >
> > >
> > > Please see the following link for more information:
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4.html#x1-220005.2.6
> > >
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > DK
> > > > >
> > > > >
> > > > > On Wed, 4 Nov 2009, Bryan Putnam wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > When attempting to use "mpirun_rsh" (from mvapich2-1.4 or mvapich2-1.4rc2)
> > > > > > on a Chelsio cluster, I get errors which are reproduced below. Note that I
> > > > > > don't see these errors with mvapich2 when using MPD/mpiexec or the
> > > > > > mpiexec.hydra launcher from the mpich2 distribution.
> > > > > >
> > > > > > Thanks,
> > > > > > Bryan
> > > > > > ===============================================
> > > > > >
> > > > > > coates-a012 1005% mpif90 hellof.f -o hellof
> > > > > > coates-a012 1006% cat $PBS_NODEFILE
> > > > > > coates-a012
> > > > > > coates-a012
> > > > > > coates-a012
> > > > > > coates-a012
> > > > > > coates-a040
> > > > > > coates-a040
> > > > > > coates-a040
> > > > > > coates-a040
> > > > > > coates-a012 1007% mpirun_rsh -hostfile $PBS_NODEFILE -np 8 ./hellof
> > > > > > Fatal error in MPI_Init:
> > > > > > Error message texts are not available
> > > > > > Exit code -5 signaled from coates-a040
> > > > > > MPI process (rank: 4) terminated unexpectedly on
> > > > > > coates-a040.rcac.purdue.edu
> > > > > > Fatal error in MPI_Init:
> > > > > > Error message texts are not available
> > > > > > Fatal error in MPI_Init:
> > > > > > Error message texts are not available
> > > > > > MPI process (rank: 0) terminated unexpectedly on
> > > > > > coates-a012.rcac.purdue.edu
> > > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > > Image              PC                Routine            Line        Source
> > > > > > libc.so.6          000000341045492F  Unknown               Unknown
> > > > > > Unknown
> > > > > > libc.so.6          0000003410463C15  Unknown               Unknown
> > > > > > Unknown
> > > > > > libc.so.6          000000341045EF68  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmthca-rdmav2.s  00002B000ADB7C26  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    000000341180722C  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    00000034118082A3  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C76298  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C3C992  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C751DC  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009BFBB6D  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C4B715  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C46116  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C4566A  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B0009C45560  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400F61  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > > > Unknown
> > > > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400E09  Unknown               Unknown
> > > > > > Unknown
> > > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > > Image              PC                Routine            Line        Source
> > > > > > libpthread.so.0    0000003410C0D590  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    00000034118090D4  Unknown               Unknown
> > > > > > Unknown
> > > > > > libcxgb3-rdmav2.s  00002B7457929C78  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    000000341180722C  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    00000034118082A3  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565E1298  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565A7992  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565E01DC  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B7456566B6D  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565B6715  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565B1116  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565B066A  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002B74565B0560  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400F61  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > > > Unknown
> > > > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400E09  Unknown               Unknown
> > > > > > Unknown
> > > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > > Image              PC                Routine            Line        Source
> > > > > > ld-linux-x86-64.s  0000003410014BA6  Unknown               Unknown
> > > > > > Unknown
> > > > > > ld-linux-x86-64.s  000000341000610F  Unknown               Unknown
> > > > > > Unknown
> > > > > > ld-linux-x86-64.s  0000003410007D33  Unknown               Unknown
> > > > > > Unknown
> > > > > > ld-linux-x86-64.s  0000003410010C4D  Unknown               Unknown
> > > > > > Unknown
> > > > > > ld-linux-x86-64.s  000000341000CE96  Unknown               Unknown
> > > > > > Unknown
> > > > > > ld-linux-x86-64.s  000000341001064C  Unknown               Unknown
> > > > > > Unknown
> > > > > > libdl.so.2         0000003410800F9A  Unknown               Unknown
> > > > > > Unknown
> > > > > > ld-linux-x86-64.s  000000341000CE96  Unknown               Unknown
> > > > > > Unknown
> > > > > > libdl.so.2         000000341080150D  Unknown               Unknown
> > > > > > Unknown
> > > > > > libdl.so.2         0000003410800F11  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    0000003411807151  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    0000003411807410  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    000000341180825F  Unknown               Unknown
> > > > > > Unknown
> > > > > > libibverbs.so.1    000000341180707B  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E8701298  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E86C7992  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E87001DC  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E8686B6D  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E86D6715  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E86D1116  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E86D066A  Unknown               Unknown
> > > > > > Unknown
> > > > > > libmpich.so.1.1    00002BA2E86D0560  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400F61  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400EFC  Unknown               Unknown
> > > > > > Unknown
> > > > > > libc.so.6          000000341041D994  Unknown               Unknown
> > > > > > Unknown
> > > > > > hellof             0000000000400E09  Unknown               Unknown
> > > > > > Unknown
> > > > > > coates-a012 1008%
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> > > --
> > > Jonathan Perkins
> > > http://www.cse.ohio-state.edu/~perkinjo
> > >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> 
> 




More information about the mvapich-discuss mailing list