[mvapich-discuss] Question on how to debug job start failures

Thu Jul 9 18:19:50 EDT 2009

Dhabaleswar Panda wrote:
> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
> tests using ~512 cores or larger. This will help you to find out whether
> there are any issues when launching jobs and isolate any nodes which might
> be having problems.
> 
> Thanks,
> 
> DK
> 

I dug in further today while the system was offline, and this
is what I found.  The mpispawn process is hanging.  When it hangs
it does hang on different nodes each time.  What I see is that
one side thinks the connection is closed, and the other side waits.

At one end:

[root at h43 ~]# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State
tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED

(gdb) bt
#0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
#1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
#2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
#3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496

At other end (node h4):

(gdb) bt
#0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
#1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
#2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525

The netstat on h4 does not show any connections back to h43.

I tried the latest 1.4Beta from the website (not svn) I found that
for large jobs mpirun_rsh will sometimes exits without running anything.
The large the job, the more likely it is to not to start the job properly.
The only difference is that it doesn't hang.  I turned on debugging with
MPISPAWN_DEBUG, but I didn't see anything interesting from that.

Craig

> On Wed, 8 Jul 2009, Craig Tierney wrote:
> 
>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>> For large jobs, I am having problems where they do not start.
>> I am using the mpirun_rsh launcher.  When I try to start jobs
>> with ~512 cores or larger, I can see the problem.  The problem
>> doesn't happen all the time.
>>
>> I can't rule our quirky hardware.  The IB tree seems to be
>> clean (as reported by ibdiagnet).  My last hang, I looked to
>> see if xhpl had started on all the nodes (8 cases for each
>> node for dual-socket quad-core systems).  I found that 7 of
>> the 245 nodes (1960 core job) had no xhpl processes on them.
>> So either the launching mechanism hung, or something was up with one of
>> those nodes.
>>
>> My question is, how should I start debugging this to understand
>> what process is hanging?
>>
>> Thanks,
>> Craig
>>
>>
>> --
>> Craig Tierney (craig.tierney at noaa.gov)
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)