[mvapich-discuss] Question on how to debug job start failures

Craig Tierney Craig.Tierney at noaa.gov
Thu Jul 9 13:05:07 EDT 2009


Dhabaleswar Panda wrote:
> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
> tests using ~512 cores or larger. This will help you to find out whether
> there are any issues when launching jobs and isolate any nodes which might
> be having problems.
> 
> Thanks,
> 

I have been using HPL to test, but I have also used IMB and a user code.
I can't say for certain that 512 cores is the cut-off to the problem, but
the user that gets bit the most tries to use about 512 cores.  If it happened
more, I am sure users would complain.

I have used hpl to search for bad hardware.  It has been a good technique
in the past and I have used it to bring up several clusters.  This one
seems so random that I hoped to do something better.   For example, I have
seen that all the processes have started, but the code doesn't get out
of MPI_Init.  In this case, I was wondering if there was a way to debug
one (all) of the process and see which process hadn't responded yet.

Craig




> DK
> 
> On Wed, 8 Jul 2009, Craig Tierney wrote:
> 
>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>> For large jobs, I am having problems where they do not start.
>> I am using the mpirun_rsh launcher.  When I try to start jobs
>> with ~512 cores or larger, I can see the problem.  The problem
>> doesn't happen all the time.
>>
>> I can't rule our quirky hardware.  The IB tree seems to be
>> clean (as reported by ibdiagnet).  My last hang, I looked to
>> see if xhpl had started on all the nodes (8 cases for each
>> node for dual-socket quad-core systems).  I found that 7 of
>> the 245 nodes (1960 core job) had no xhpl processes on them.
>> So either the launching mechanism hung, or something was up with one of
>> those nodes.
>>
>> My question is, how should I start debugging this to understand
>> what process is hanging?
>>
>> Thanks,
>> Craig
>>
>>
>> --
>> Craig Tierney (craig.tierney at noaa.gov)
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> 


-- 
Craig Tierney (craig.tierney at noaa.gov)


More information about the mvapich-discuss mailing list