[mvapich-discuss] program hanged using mvapich with large number of processes

Dhabaleswar Panda panda at cse.ohio-state.edu
Sat Jan 23 09:18:14 EST 2010


You are using the uDAPL interface of MVAPICH2 stack. All our designs and
developments with latest features are taking place on the
most-commonly-used OpenFabrics-Gen2 (IB/iWARP) interface. You should start
using this interface to get the best performance and scalability on your
cluster. You can use this interface and let us know whether you see the
problem or not.

Thanks,

DK

On Sat, 23 Jan 2010, Weimin Wang wrote:

> Hello, Dhabaleswar,
> Thank you for your suggestion.
>
> I have downloaded this MVAPICH2 1.4 from the OSU site and compiled it with:
>
> ./configure CC=pgcc F77=pgf77 F90=pgf90 CXX=pgCC
> --prefix=/data02/home/wmwang/test/mvapich2 --with-rdma=udapl --enable-romio
> make
> make install
>
> When trying with this mvapich, I got the same behavior. The program hanged
> with 32 processes.
>
> I am contacting my system administrator for Infiniband adapter type and will
> tell you later.
>
> Thank you.
>
> Yours,
> Weimin
>
> On Sat, Jan 23, 2010 at 1:03 PM, Dhabaleswar Panda <panda at cse.ohio-state.edu
> > wrote:
>
> > Can you try the latest nightly tarball of the bugfix branch version of
> > MVAPICH2 1.4 (from the following URL) and let us know whether the issue
> > persists.
> >
> > http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.4
> >
> > I am also assuming that you are using the OpenFabrics-Gen2 interface of
> > this release. Please confirm.
> >
> > Thanks,
> >
> > DK
> >
> > On Sat, 23 Jan 2010, Weimin Wang wrote:
> >
> > > Hello, Dhabaleswar,
> > >
> > > Thank you for your response.
> > >
> > > The version of MVAPICH2 I am using is 2-1.4. I do not know the IB adapter
> > > type of my cluster. When running ifconfig, I get:
> > >
> > > wmwang at node73:~/meteo/mvapich2-1.4> ifconfig -a
> > > ib0       Link encap:UNSPEC  HWaddr
> > > 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
> > >           inet addr:10.10.10.73  Bcast:10.255.255.255  Mask:255.0.0.0
> > >           inet6 addr: fe80::202:c903:5:5271/64 Scope:Link
> > >           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
> > >           RX packets:15383132 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:12294382 errors:0 dropped:0 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:256
> > >           RX bytes:15352949444 (14641.7 Mb)  TX bytes:130554397150
> > (124506.3
> > > Mb)
> > > Thank you.
> > >
> > > Bests,
> > > Weimin
> > >
> > > On Sat, Jan 23, 2010 at 5:52 AM, Dhabaleswar Panda <
> > panda at cse.ohio-state.edu
> > > > wrote:
> > >
> > > > Can you tell us the MVAPICH2 version you are using. Also, can you tell
> > us
> > > > the IB adapter type used in your system.
> > > >
> > > > Thanks,
> > > >
> > > > DK
> > > >
> > > > On Fri, 22 Jan 2010, Weimin Wang wrote:
> > > >
> > > > > Hello, list,
> > > > >
> > > > > I have got a strange problem with mvapich2. For cpi example, when I
> > run
> > > > it
> > > > > with small number of processes, it is OK:
> > > > >
> > > > > wmwang at node32:~/test> mpirun_rsh -ssh -np 2 -hostfile ./ma ./cpi
> > > > > Process 0 on node32
> > > > > Process 1 on node32
> > > > > pi is approximately 3.1416009869231241, Error is 0.0000083333333309
> > > > > wall clock time = 0.000174
> > > > >
> > > > > wmwang at node32:~/test> mpirun_rsh -ssh -np 10 -hostfile ./ma ./cpi
> > > > > Process 8 on node33
> > > > > pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> > > > > wall clock time = 0.000127
> > > > > Process 1 on node32
> > > > > Process 3 on node32
> > > > > Process 0 on node32
> > > > > Process 4 on node32
> > > > > Process 2 on node32
> > > > > Process 6 on node32
> > > > > Process 5 on node32
> > > > > Process 7 on node32
> > > > > Process 9 on node33
> > > > > However, when I run cpi with large number processes, the program
> > hangs
> > > > with
> > > > > no output:
> > > > >
> > > > > wmwang at node32:~/test> mpirun_rsh -ssh -np 18 -hostfile ./ma ./cpi
> > > > >
> > > > > And top command in node32 show that,
> > > > >
> > > > >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > > > > 14507 wmwang    15   0 60336  50m  676 S   56  0.2   0:03.86 mpispawn
> > > > > The system I used is,
> > > > >
> > > > > wmwang at node33:~> uname -a
> > > > > Linux node33 2.6.16.60-0.42.4_lustre.1.8.1.1-smp #1 SMP Fri Aug 14
> > > > 08:33:26
> > > > > MDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> > > > > The compiler is pgi v10.0.
> > > > >
> > > > > Would you please give me any hint for this problem?
> > > > >
> > > >
> > > >
> > >
> >
> >
>



More information about the mvapich-discuss mailing list