[mvapich-discuss] scaling problem and stray mpd daemon

Vishwas vvasisht at locuz.com
Sat Oct 28 04:38:07 EDT 2006


Hello,

 

I am using mvapich2 for the inifiniband interconnect.
a. I have 128 core (32 node) machine on infiniband. I have used VAPI stack
of mvapich2. I have used the

    following command to run the mpd daemons on the nodes

    mpdboot --totalnum=32 --file=< mpd.hosts file with path >  --mpd=< path
to mpd on local machines >  --verbose  --ncpus=4 --ifhn=infinigj

    The problem I am facing  is, if I submit a job (simple farming kind of
job), using totalnum >= 32, job gets stuck (less than 32 it will run). It
will never end.

    Also, I see lots of mpd daemons start running in nodes, once a job is
submitted. 

b.  If I am correct mpdallexit causes all mpds in ring to exit and
mpdcleanup removes socket on local and remote machine 
But for me even after I do these, I see lots of mpd daemons running on
master (I do ps -ef | grep mpd to see this) 
How to clean this up (now I am using kill <pid>)

 

Vishwas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061028/d5e8aee6/attachment-0001.html


More information about the mvapich-discuss mailing list