[Mvapich-discuss] Diagnosing sporadic failures in mpirun_rsh
Martin Pokorny
mpokorny at nrao.edu
Thu Feb 11 11:09:35 EST 2021
Hello all,
I work on an system for real-time data processing using
mvapich2. This system has been in place several years now, and
works fairly reliably. There are, however, instances when it
appears that mpirun_rsh fails to start an MPI job. It is generally
very difficult to provide a reproducer given the real-time,
event-driven nature of the application, and the apparent failure
of mpirun_rsh is no different. The cluster has only nine nodes,
and generally the MPI jobs comprise no more than ~90
processes. The error message that we see is the following
[cbe:mpirun_rsh][child_handler] Error in init phase, aborting!
(1/87 mpispawn connections)
The current version of mvapich2 we're running in production is
v2.2, but we're able to reproduce this issue using v2.3.5 as well. Note that this system runs continuously, and as I said fairly
reliably, so I have no reason to believe that any of the usual
suspects, such as incorrect hostfile, or incorrect ssh
configuration could be the cause of the problem. My hunch is that
we are occasionally hitting a performance or reliability issue in
the cluster configuration, its network (IB), or its shared file
systems (Lustre and NFS). My question is whether you have any
suggestions for tracking down the cause of these failures, or any
tweaks to mpirun_rsh that I ought to try. I'd be happy to hack
and/or rebuild mvapich2 however you suggest for testing purposes.
--
Martin Pokorny
National Radio Astronomy Observatory
Socorro, NM
More information about the Mvapich-discuss
mailing list