[mvapich-discuss] problem running mpd as daemon with mvapich2 0.9.8

Rick Warner rick at microway.com
Wed Dec 20 13:11:44 EST 2006


Hello,

When working with mpich2, we normally set up mpd as a daemon and use the 
environment variable MPD_USE_ROOT_MPD=1 for users to all share this daemon.  
This allows users of a dedicated cluster to run mpi jobs without worrying 
about mpdboot, etc.

I have set this same method set up for mvapich2 0.9.8, using the same working 
initscripts from our regular mpich2 method.  However, on a fresh boot, 
running a job as a user gives this:

[testm at master ~]$ mpirun -np 1 ./cpi
cannot create cq
Failed to Initialize HCA type
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(230): Initialization failed
MPID_Init(81)........: channel initialization failed
(unknown)(): Other MPI errorrank 0 in job 1  master.cl.slac.stanford.edu_4268   
caused collective abort of all ranks
  exit status of rank 0: return code 13


The strange thing is that running "service mpd stop" on the nodes, 
then "service mpd restart" on the master and then "service mpd start" on the 
nodes fixes the problem.  After restarting the mpd service, regular users can 
successfully run jobs.  I've investigated the problem, but can't seem to 
isolate it.  Any clues?

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517


More information about the mvapich-discuss mailing list