[mvapich-discuss] program hanged using mvapich with large number
of processes
Weimin Wang
wmwang at gmail.com
Fri Jan 22 23:14:43 EST 2010
Hello, Dhabaleswar,
Thank you for your response.
The version of MVAPICH2 I am using is 2-1.4. I do not know the IB adapter
type of my cluster. When running ifconfig, I get:
wmwang at node73:~/meteo/mvapich2-1.4> ifconfig -a
ib0 Link encap:UNSPEC HWaddr
80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:10.10.10.73 Bcast:10.255.255.255 Mask:255.0.0.0
inet6 addr: fe80::202:c903:5:5271/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:15383132 errors:0 dropped:0 overruns:0 frame:0
TX packets:12294382 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:15352949444 (14641.7 Mb) TX bytes:130554397150 (124506.3
Mb)
Thank you.
Bests,
Weimin
On Sat, Jan 23, 2010 at 5:52 AM, Dhabaleswar Panda <panda at cse.ohio-state.edu
> wrote:
> Can you tell us the MVAPICH2 version you are using. Also, can you tell us
> the IB adapter type used in your system.
>
> Thanks,
>
> DK
>
> On Fri, 22 Jan 2010, Weimin Wang wrote:
>
> > Hello, list,
> >
> > I have got a strange problem with mvapich2. For cpi example, when I run
> it
> > with small number of processes, it is OK:
> >
> > wmwang at node32:~/test> mpirun_rsh -ssh -np 2 -hostfile ./ma ./cpi
> > Process 0 on node32
> > Process 1 on node32
> > pi is approximately 3.1416009869231241, Error is 0.0000083333333309
> > wall clock time = 0.000174
> >
> > wmwang at node32:~/test> mpirun_rsh -ssh -np 10 -hostfile ./ma ./cpi
> > Process 8 on node33
> > pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> > wall clock time = 0.000127
> > Process 1 on node32
> > Process 3 on node32
> > Process 0 on node32
> > Process 4 on node32
> > Process 2 on node32
> > Process 6 on node32
> > Process 5 on node32
> > Process 7 on node32
> > Process 9 on node33
> > However, when I run cpi with large number processes, the program hangs
> with
> > no output:
> >
> > wmwang at node32:~/test> mpirun_rsh -ssh -np 18 -hostfile ./ma ./cpi
> >
> > And top command in node32 show that,
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 14507 wmwang 15 0 60336 50m 676 S 56 0.2 0:03.86 mpispawn
> > The system I used is,
> >
> > wmwang at node33:~> uname -a
> > Linux node33 2.6.16.60-0.42.4_lustre.1.8.1.1-smp #1 SMP Fri Aug 14
> 08:33:26
> > MDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> > The compiler is pgi v10.0.
> >
> > Would you please give me any hint for this problem?
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100123/8ec5ec5d/attachment.html
More information about the mvapich-discuss
mailing list