[mvapich-discuss] scaling problem and stray mpd daemon

wei huang huanwei at cse.ohio-state.edu
Sat Oct 28 13:30:15 EDT 2006


Hi Vishwas,

Thanks for using mvapich2.

Would you please let us know the detail of your setup. Which version of
mvapich2 you are using? Which flags you have used for your CFLAGS in our
compilation script (Have you changed anything in our default compilation
script for vapi)?

You will see a lot of mpd threads if there is an active thread. Would you
please run some simple programs, say cpi or pallas, on your system (more
than 32 processes) to make sure the setup is correct. Also, would you
please remove --ncpus and see if you can start your application on more
than 32 processes.

Also, the whole InfiniBand community is moving towards Gen2 (OpenFabric)
stack. May we suggest you upgrade your system to Gen2 stack. MVAPICH2 on
gen2 stack will generally have more features and better performance.
Detail instructions to setup gen2 stack can be found on openfabric
website:

http://www.openib.org/downloads.html

Thanks

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Sat, 28 Oct 2006, Vishwas wrote:

> Hi,
>
> I was not clear.
>
> a. I have 128 core (32 node) machine on infiniband. I have used VAPI stack
> of mvapich2. I have used the
>
>     following command to run the mpd daemons on the nodes
>
>     mpdboot --totalnum=32 --file=< mpd.hosts file with path >  --mpd=< path
> to mpd on local machines >  --verbose  --ncpus=4 --ifhn=infinigj
>
>     The problem I am facing  is, if I submit a job (simple farming kind of
> job), using -np to be greater than 32, job gets stuck (less than 32 it will
> run). It will never end.
>
>     Also, I see lots of mpd daemons start running in nodes, once a job is
> submitted.
>
>
>
> Vishwas
>
>
>
>   _____
>
> From: mvapich-discuss-bounces at cse.ohio-state.edu
> [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Vishwas
> Sent: Saturday, October 28, 2006 2:08 PM
> To: mvapich-discuss at cse.ohio-state.edu
> Subject: [mvapich-discuss] scaling problem and stray mpd daemon
>
>
>
> Hello,
>
>
>
> I am using mvapich2 for the inifiniband interconnect.
> a. I have 128 core (32 node) machine on infiniband. I have used VAPI stack
> of mvapich2. I have used the
>
>     following command to run the mpd daemons on the nodes
>
>     mpdboot --totalnum=32 --file=< mpd.hosts file with path >  --mpd=< path
> to mpd on local machines >  --verbose  --ncpus=4 --ifhn=infinigj
>
>     The problem I am facing  is, if I submit a job (simple farming kind of
> job), using totalnum >= 32, job gets stuck (less than 32 it will run). It
> will never end.
>
>     Also, I see lots of mpd daemons start running in nodes, once a job is
> submitted.
>
> b.  If I am correct mpdallexit causes all mpds in ring to exit and
> mpdcleanup removes socket on local and remote machine
> But for me even after I do these, I see lots of mpd daemons running on
> master (I do ps -ef | grep mpd to see this)
> How to clean this up (now I am using kill <pid>)
>
>
>
> Vishwas
>
>




More information about the mvapich-discuss mailing list