[mvapich-discuss] problem running mpd as daemon with mvapich2 0.9.8
Rick Warner
rick at microway.com
Wed Dec 20 13:11:44 EST 2006
Hello,
When working with mpich2, we normally set up mpd as a daemon and use the
environment variable MPD_USE_ROOT_MPD=1 for users to all share this daemon.
This allows users of a dedicated cluster to run mpi jobs without worrying
about mpdboot, etc.
I have set this same method set up for mvapich2 0.9.8, using the same working
initscripts from our regular mpich2 method. However, on a fresh boot,
running a job as a user gives this:
[testm at master ~]$ mpirun -np 1 ./cpi
cannot create cq
Failed to Initialize HCA type
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(230): Initialization failed
MPID_Init(81)........: channel initialization failed
(unknown)(): Other MPI errorrank 0 in job 1 master.cl.slac.stanford.edu_4268
caused collective abort of all ranks
exit status of rank 0: return code 13
The strange thing is that running "service mpd stop" on the nodes,
then "service mpd restart" on the master and then "service mpd start" on the
nodes fixes the problem. After restarting the mpd service, regular users can
successfully run jobs. I've investigated the problem, but can't seem to
isolate it. Any clues?
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
More information about the mvapich-discuss
mailing list