[mvapich-discuss] problem running mpd as daemon with mvapich2 0.9.8

Rick Warner rick at microway.com
Wed Dec 20 17:05:50 EST 2006


On Wednesday 20 December 2006 15:50, Matthew Koop wrote:
> > [testm at master ~]$ mpirun -np 1 ./cpi
> > cannot create cq
> > Failed to Initialize HCA type
> > Fatal error in MPI_Init: Other MPI error, error stack:
> > MPIR_Init_thread(230): Initialization failed
> > MPID_Init(81)........: channel initialization failed
> > (unknown)(): Other MPI errorrank 0 in job 1 
> > master.cl.slac.stanford.edu_4268 caused collective abort of all ranks
> >   exit status of rank 0: return code 13
> >
> >
> > The strange thing is that running "service mpd stop" on the nodes,
> > then "service mpd restart" on the master and then "service mpd start" on
> > the nodes fixes the problem.  After restarting the mpd service, regular
> > users can successfully run jobs.  I've investigated the problem, but
> > can't seem to isolate it.  Any clues?
>
> Rick,
>
> The error about not being able to create the CQ suggests to me that at the
> point of time when the MPD daemons are being started the memory locking
> limits are not set high enough. My guess is that at the later point of
> time when you restart the daemons that setting has likely been set in your
> environment.
>
> Can you try adding 'ulimit -l unlimited' within your init script for MPD
> and see if that solves your issue?
>
> Thanks,
> Matt

That did it!  Thanks a lot!

For anyone that's interested, I've attached the init script we use.  It uses a 
file /etc/sysconfig/mpd - the file has 2 lines - MPDMODE=<master|slave> and 
MPDMASTER=<hostname of cluster master>.  The attached script is for redhat 
and fedora.  It can easily be modified for other distros.  The one unique 
thing about this script is that it prevents hangs on shutdown by forcibly 
killing mpd if it doesn't terminate properly (we saw that problem with 
mpich2 - don't know if it's in mvapich2 as well).



-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpd.DEFANGED-2
Type: application/defanged-2
Size: 1943 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061220/839ad470/mpd.bin


More information about the mvapich-discuss mailing list