[mvapich-discuss] Question on how to debug job start failures

Dhabaleswar Panda panda at cse.ohio-state.edu
Fri Jul 10 09:46:46 EDT 2009


Craig - Could you please tell us little more about the details on your
system: node configuration (sockets and cores/socket, processor type), OS
version, etc. What kind of Ethernet connectivity does your system have?
FYI, mpirun_rsh framework launches the job using the standard TCP/IP
calls.

Thanks,

DK



On Thu, 9 Jul 2009, Craig Tierney wrote:

> Dhabaleswar Panda wrote:
> > Are you able to run simple MPI programs (say MPI Hello World) or some IMB
> > tests using ~512 cores or larger. This will help you to find out whether
> > there are any issues when launching jobs and isolate any nodes which might
> > be having problems.
> >
> > Thanks,
> >
> > DK
> >
>
> I dug in further today while the system was offline, and this
> is what I found.  The mpispawn process is hanging.  When it hangs
> it does hang on different nodes each time.  What I see is that
> one side thinks the connection is closed, and the other side waits.
>
> At one end:
>
> [root at h43 ~]# netstat
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address               Foreign Address             State
> tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
> tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
> tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
> tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
> tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
> tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED
>
>
> (gdb) bt
> #0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
> #1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
> #2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
> #3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>
> At other end (node h4):
>
> (gdb) bt
> #0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
> #1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
> #2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>
> The netstat on h4 does not show any connections back to h43.
>
> I tried the latest 1.4Beta from the website (not svn) I found that
> for large jobs mpirun_rsh will sometimes exits without running anything.
> The large the job, the more likely it is to not to start the job properly.
> The only difference is that it doesn't hang.  I turned on debugging with
> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>
> Craig
>
>
>
>
> > On Wed, 8 Jul 2009, Craig Tierney wrote:
> >
> >> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
> >> For large jobs, I am having problems where they do not start.
> >> I am using the mpirun_rsh launcher.  When I try to start jobs
> >> with ~512 cores or larger, I can see the problem.  The problem
> >> doesn't happen all the time.
> >>
> >> I can't rule our quirky hardware.  The IB tree seems to be
> >> clean (as reported by ibdiagnet).  My last hang, I looked to
> >> see if xhpl had started on all the nodes (8 cases for each
> >> node for dual-socket quad-core systems).  I found that 7 of
> >> the 245 nodes (1960 core job) had no xhpl processes on them.
> >> So either the launching mechanism hung, or something was up with one of
> >> those nodes.
> >>
> >> My question is, how should I start debugging this to understand
> >> what process is hanging?
> >>
> >> Thanks,
> >> Craig
> >>
> >>
> >> --
> >> Craig Tierney (craig.tierney at noaa.gov)
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> >
>
>
> --
> Craig Tierney (craig.tierney at noaa.gov)
>



More information about the mvapich-discuss mailing list