[Mvapich-discuss] Diagnosing sporadic failures in mpirun_rsh

Thu Feb 11 11:09:35 EST 2021

Hello all,

I work on an system for real-time data processing using 
mvapich2. This system has been in place several years now, and 
works fairly reliably. There are, however, instances when it 
appears that mpirun_rsh fails to start an MPI job. It is generally 
very difficult to provide a reproducer given the real-time, 
event-driven nature of the application, and the apparent failure 
of mpirun_rsh is no different. The cluster has only nine nodes, 
and generally the MPI jobs comprise no more than ~90 
processes. The error message that we see is the following

[cbe:mpirun_rsh][child_handler] Error in init phase, aborting! 
(1/87 mpispawn connections)

The current version of mvapich2 we're running in production is 
v2.2, but we're able to reproduce this issue using v2.3.5 as well. Note that this system runs continuously, and as I said fairly 
reliably, so I have no reason to believe that any of the usual 
suspects, such as incorrect hostfile, or incorrect ssh 
configuration could be the cause of the problem. My hunch is that 
we are occasionally hitting a performance or reliability issue in 
the cluster configuration, its network (IB), or its shared file 
systems (Lustre and NFS). My question is whether you have any 
suggestions for tracking down the cause of these failures, or any 
tweaks to mpirun_rsh that I ought to try. I'd be happy to hack 
and/or rebuild mvapich2 however you suggest for testing purposes.

--
Martin Pokorny
National Radio Astronomy Observatory
Socorro, NM