[mvapich-discuss] Question on how to debug job start failures

Mon Aug 10 13:53:41 EDT 2009

Craig,

> I have determined that my problem was related to hardware that
> was dropping or corrupting data that was being sent between mpispawn
> processes for jobs larger than 512 cores (64 nodes).  The magic
> number may have to do with the fact that transfers started to exceed
> 8192 bytes, which is the setting of the MTU on our network.

Thanks for the insights here.

> We are working with SMC to find a solution to the problem.  For now,
> we hacked mpirun_rsh to launch the jobs over the IB.

Glad to know that things are working now.

Thanks,

DK

> Thanks for the help,
> Craig
>
>
> Dhabaleswar Panda wrote:
> > Craig,
> >
> >> A follow-up to my problem.  On the new Nehalem cluster (QDR, Centos 5.3,
> >> OFED-1.4.1, Mvapich-1.2p1), I am still having applications hang when using
> >> mpirun_rsh.  The problem seems to start around 512 cores, but it isn't exact.
> >> Not sure if this helps, but Openmpi does not have an issue (but I know has
> >> a completely different launching mechanism).
> >
> > Does this happen with OFED 1.4. As you might have seen from the OFA
> > mailing lists, there have been some issues related to NFS traffic with
> > OFED 1.4.1.
> >
> >> The one similarity is that both systems are using SMC Tiger switch Gige switches
> >> within the racks and uplink to a Force10 GigE switch (although the behavior
> >> was repeated when the core switch was a Cisco unit).
> >>
> >> I have tried messing with MV2_MT_DEGREE.  Setting this low, 4, seems to help
> >> large jobs start, but it does not solve the problem.
> >
> > This is good to know. What happens if you reduce MV2_MT_DEGREE to 2. The
> > job start-up might be slower. However, we need to see whether it is able
> > to start the large-scale jobs.
> >
> >> So the problem could be hardware or a race condition caused in the software.
> >> Any ideas of how to debug the software side (or both) would be appreciated).
> >
> > Thanks,
> >
> > DK
> >
> >> Thanks,
> >> Craig
> >>
> >>
> >>
> >>>
> >>>
> >>>> Thanks,
> >>>>
> >>>> DK
> >>>>
> >>>>
> >>>>
> >>>> On Thu, 9 Jul 2009, Craig Tierney wrote:
> >>>>
> >>>>> Dhabaleswar Panda wrote:
> >>>>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
> >>>>>> tests using ~512 cores or larger. This will help you to find out whether
> >>>>>> there are any issues when launching jobs and isolate any nodes which might
> >>>>>> be having problems.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> DK
> >>>>>>
> >>>>> I dug in further today while the system was offline, and this
> >>>>> is what I found.  The mpispawn process is hanging.  When it hangs
> >>>>> it does hang on different nodes each time.  What I see is that
> >>>>> one side thinks the connection is closed, and the other side waits.
> >>>>>
> >>>>> At one end:
> >>>>>
> >>>>> [root at h43 ~]# netstat
> >>>>> Active Internet connections (w/o servers)
> >>>>> Proto Recv-Q Send-Q Local Address               Foreign Address             State
> >>>>> tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
> >>>>> tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
> >>>>> tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
> >>>>> tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
> >>>>> tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
> >>>>> tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED
> >>>>>
> >>>>>
> >>>>> (gdb) bt
> >>>>> #0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
> >>>>> #1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
> >>>>> #2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
> >>>>> #3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
> >>>>>
> >>>>> At other end (node h4):
> >>>>>
> >>>>> (gdb) bt
> >>>>> #0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
> >>>>> #1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
> >>>>> #2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
> >>>>>
> >>>>> The netstat on h4 does not show any connections back to h43.
> >>>>>
> >>>>> I tried the latest 1.4Beta from the website (not svn) I found that
> >>>>> for large jobs mpirun_rsh will sometimes exits without running anything.
> >>>>> The large the job, the more likely it is to not to start the job properly.
> >>>>> The only difference is that it doesn't hang.  I turned on debugging with
> >>>>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
> >>>>>
> >>>>> Craig
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
> >>>>>>
> >>>>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
> >>>>>>> For large jobs, I am having problems where they do not start.
> >>>>>>> I am using the mpirun_rsh launcher.  When I try to start jobs
> >>>>>>> with ~512 cores or larger, I can see the problem.  The problem
> >>>>>>> doesn't happen all the time.
> >>>>>>>
> >>>>>>> I can't rule our quirky hardware.  The IB tree seems to be
> >>>>>>> clean (as reported by ibdiagnet).  My last hang, I looked to
> >>>>>>> see if xhpl had started on all the nodes (8 cases for each
> >>>>>>> node for dual-socket quad-core systems).  I found that 7 of
> >>>>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
> >>>>>>> So either the launching mechanism hung, or something was up with one of
> >>>>>>> those nodes.
> >>>>>>>
> >>>>>>> My question is, how should I start debugging this to understand
> >>>>>>> what process is hanging?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Craig
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Craig Tierney (craig.tierney at noaa.gov)
> >>>>>>> _______________________________________________
> >>>>>>> mvapich-discuss mailing list
> >>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>>
> >>>>> --
> >>>>> Craig Tierney (craig.tierney at noaa.gov)
> >>>>>
> >>>> _______________________________________________
> >>>> mvapich-discuss mailing list
> >>>> mvapich-discuss at cse.ohio-state.edu
> >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>
> >>>
> >>
> >> --
> >> Craig Tierney (craig.tierney at noaa.gov)
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
> --
> Craig Tierney (craig.tierney at noaa.gov)
>