[mvapich-discuss] program hanged using mvapich with large number of processes

Weimin Wang wmwang at gmail.com
Sat Jan 23 01:02:07 EST 2010


Hello, Dhabaleswar,
Thank you for your suggestion.

I have downloaded this MVAPICH2 1.4 from the OSU site and compiled it with:

./configure CC=pgcc F77=pgf77 F90=pgf90 CXX=pgCC
--prefix=/data02/home/wmwang/test/mvapich2 --with-rdma=udapl --enable-romio
make
make install

When trying with this mvapich, I got the same behavior. The program hanged
with 32 processes.

I am contacting my system administrator for Infiniband adapter type and will
tell you later.

Thank you.

Yours,
Weimin

On Sat, Jan 23, 2010 at 1:03 PM, Dhabaleswar Panda <panda at cse.ohio-state.edu
> wrote:

> Can you try the latest nightly tarball of the bugfix branch version of
> MVAPICH2 1.4 (from the following URL) and let us know whether the issue
> persists.
>
> http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.4
>
> I am also assuming that you are using the OpenFabrics-Gen2 interface of
> this release. Please confirm.
>
> Thanks,
>
> DK
>
> On Sat, 23 Jan 2010, Weimin Wang wrote:
>
> > Hello, Dhabaleswar,
> >
> > Thank you for your response.
> >
> > The version of MVAPICH2 I am using is 2-1.4. I do not know the IB adapter
> > type of my cluster. When running ifconfig, I get:
> >
> > wmwang at node73:~/meteo/mvapich2-1.4> ifconfig -a
> > ib0       Link encap:UNSPEC  HWaddr
> > 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
> >           inet addr:10.10.10.73  Bcast:10.255.255.255  Mask:255.0.0.0
> >           inet6 addr: fe80::202:c903:5:5271/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
> >           RX packets:15383132 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:12294382 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:256
> >           RX bytes:15352949444 (14641.7 Mb)  TX bytes:130554397150
> (124506.3
> > Mb)
> > Thank you.
> >
> > Bests,
> > Weimin
> >
> > On Sat, Jan 23, 2010 at 5:52 AM, Dhabaleswar Panda <
> panda at cse.ohio-state.edu
> > > wrote:
> >
> > > Can you tell us the MVAPICH2 version you are using. Also, can you tell
> us
> > > the IB adapter type used in your system.
> > >
> > > Thanks,
> > >
> > > DK
> > >
> > > On Fri, 22 Jan 2010, Weimin Wang wrote:
> > >
> > > > Hello, list,
> > > >
> > > > I have got a strange problem with mvapich2. For cpi example, when I
> run
> > > it
> > > > with small number of processes, it is OK:
> > > >
> > > > wmwang at node32:~/test> mpirun_rsh -ssh -np 2 -hostfile ./ma ./cpi
> > > > Process 0 on node32
> > > > Process 1 on node32
> > > > pi is approximately 3.1416009869231241, Error is 0.0000083333333309
> > > > wall clock time = 0.000174
> > > >
> > > > wmwang at node32:~/test> mpirun_rsh -ssh -np 10 -hostfile ./ma ./cpi
> > > > Process 8 on node33
> > > > pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> > > > wall clock time = 0.000127
> > > > Process 1 on node32
> > > > Process 3 on node32
> > > > Process 0 on node32
> > > > Process 4 on node32
> > > > Process 2 on node32
> > > > Process 6 on node32
> > > > Process 5 on node32
> > > > Process 7 on node32
> > > > Process 9 on node33
> > > > However, when I run cpi with large number processes, the program
> hangs
> > > with
> > > > no output:
> > > >
> > > > wmwang at node32:~/test> mpirun_rsh -ssh -np 18 -hostfile ./ma ./cpi
> > > >
> > > > And top command in node32 show that,
> > > >
> > > >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > > > 14507 wmwang    15   0 60336  50m  676 S   56  0.2   0:03.86 mpispawn
> > > > The system I used is,
> > > >
> > > > wmwang at node33:~> uname -a
> > > > Linux node33 2.6.16.60-0.42.4_lustre.1.8.1.1-smp #1 SMP Fri Aug 14
> > > 08:33:26
> > > > MDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> > > > The compiler is pgi v10.0.
> > > >
> > > > Would you please give me any hint for this problem?
> > > >
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100123/417a6df6/attachment.html


More information about the mvapich-discuss mailing list