[mvapich-discuss] Fail to run MPI program using MVAPICH2-1.5.1

Ting-jen Yen yentj at infowrap.com.tw
Tue Sep 28 23:42:11 EDT 2010


Hello,

  Correction to my own previous post.  MVAPICH2 1.5.1p1 does
work with my simple "hello world" test program.  However, it still
have problem with the LINPACK benchmark program comes with
Intel MKL.  So it probably has nothing to do with
"--enable-romio --with-file-system=lustre" parameters.

I ran the program using:
mpirun_rsh -np 4 hc86 hc86 hc87 hc87 ./xhpl
The program would just hang there.

Related processes on the first node (using "ps auxw" )
------------------------------------------------------
test001 29529  0.0  0.0  21112   760 pts/4    S+   10:56   0:00
mpirun_rsh -np 4 hc86 hc86 hc87 hc87 ./xhpl
test001 29531  0.0  0.0  63832  1092 pts/4    S+   10:56
0:00 /bin/bash -c cd /home/test001/pbs-test/mpi/linpack; /usr/bin
test001 29532  0.0  0.0  58372  3224 pts/4    S+   10:56
0:00 /usr/bin/ssh -q hc87 cd /home/test001/pbs-test/mpi/linpack;
test001 29533  0.0  0.0  23180   904 pts/4    S+   10:56
0:00 /opt/mvapich2-1.5.1p1/bin/mpispawn 0
test001 29536 99.6  0.5  81352 41664 pts/4    RLl+ 10:56   0:59 ./xhpl
test001 29537  100  0.4  80824 37000 pts/4    RLl+ 10:56   1:00 ./xhpl
-------------------------------------------------------
Related processes on second nodes:
----------------------------------------------------
test001 12578  0.0  0.0  63832  1096 ?        Ss   10:56   0:00 bash -c
cd /home/test001/pbs-test/mpi/linpack; /usr/bin/env 
test001 12579  0.1  0.0  23180   896 ?        S    10:56
0:00 /opt/mvapich2-1.5.1p1/bin/mpispawn 0
test001 12580  0.0  0.4  74108 37052 ?        SLl  10:56   0:00 ./xhpl
test001 12581  0.0  0.4  74108 37052 ?        SLl  10:56   0:00 ./xhpl
----------------------------------------------------

If I switched to MVAPICH2 1.2p1, using the same configure parameters,
the produced program does work.  The configureation for both MVAPICH2
is:
--prefix=/opt/mvapich2-version --with-rdma=gen2
--with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 CC=icc
CXX=icpc FC=ifort F77=ifort F90=ifort

Any idea what might have cause this?
How do I produce backtrace of the mpi process?

Thanks

-- Ting-jen
©ó ¤G¡A2010-09-28 ©ó 23:45 +0800¡ATing-jen Yen ´£¨ì¡G
> Hello,
> 
>    Thanks for your response.  I tried to recompile the whole thing
> later, but removed "--enable-romio --with-file-system=lustre" options
> in the ./configure line, (which I added since I thought might improve
> the performance of the system which is using Lustre file system), and it
> seems that the MPI program does work now. I guess I shouldn't touch the
> options I don't understand very well.
> 
> Thanks.
> 
> Ting-jen
> 
> ©ó 2010/9/28 ¤U¤È 10:12, Jonathan Perkins ´£¨ì:
> > Hello, can you provide us with the backtrace of the mpi process(es)?
> > Also, I'd like to know how these are being launched (which launcher,
> > number of processes, etc...) and which processes you actually see
> > running each machine.  Thanks.
> >
> > On Tue, Sep 28, 2010 at 2:48 AM, Ting-jen Yen<yentj at infowrap.com.tw>  wrote:
> >>
> >> We are setting up a cluster with InfiniBand interconnection.
> >> The OS we are using is CentOS 5.4, along with the OpenIB
> >> driver coming with it.
> >>
> >> We managed to compile MVAPICH2 1.5.1 without any problem.  But
> >> when we used this MVAPICH2 to comile a simple "hello world" MPI
> >> program and tried to run it, the program just hanged there
> >> if we used more than one machines. (It ran OK if using only
> >> one machine.)  When we checked processes using 'ps', we noticed
> >> that processes of the MPI program on the first machine was using
> >> almost 100% CPU time, while those on the rest machines was
> >> using 0% CPU time.  It seems that the program stopped at "MPI_Init"
> >> function.
> >>
> >> We tried MVAPICH 1.1 as well as older version of MVAPICH2, 1.2p1.
> >> These two does not have the same problem, and is working fine.
> >>
> >> And idea what may cause such problem?
> >>
> >> (The compiler we used is Intel Compiler V11.1.  I do not have
> >> the detail of InfiniBand HCA right now, though according to 'lspci'
> >> command, it is with "Mellanox MT25208 InfiniHost III Ex" chip.)
> >>
> >> Thanks,
> >>
> >> Ting-jen
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
> >
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list