[mvapich-discuss] Question on how to debug job start failures

Craig Tierney Craig.Tierney at noaa.gov
Wed Jul 8 18:21:32 EDT 2009


I am running mvapich2 1.2, built with Ofed support (v1.3.1).
For large jobs, I am having problems where they do not start.
I am using the mpirun_rsh launcher.  When I try to start jobs
with ~512 cores or larger, I can see the problem.  The problem
doesn't happen all the time.

I can't rule our quirky hardware.  The IB tree seems to be
clean (as reported by ibdiagnet).  My last hang, I looked to
see if xhpl had started on all the nodes (8 cases for each
node for dual-socket quad-core systems).  I found that 7 of
the 245 nodes (1960 core job) had no xhpl processes on them.
So either the launching mechanism hung, or something was up with one of
those nodes.

My question is, how should I start debugging this to understand
what process is hanging?

Thanks,
Craig


-- 
Craig Tierney (craig.tierney at noaa.gov)


More information about the mvapich-discuss mailing list