[mvapich-discuss] Question on how to debug job start failures

Mon Jul 27 16:19:28 EDT 2009

Dhabaleswar Panda wrote:
> Craig,
> 
>> A follow-up to my problem.  On the new Nehalem cluster (QDR, Centos 5.3,
>> OFED-1.4.1, Mvapich-1.2p1), I am still having applications hang when using
>> mpirun_rsh.  The problem seems to start around 512 cores, but it isn't exact.
>> Not sure if this helps, but Openmpi does not have an issue (but I know has
>> a completely different launching mechanism).
> 
> Does this happen with OFED 1.4. As you might have seen from the OFA
> mailing lists, there have been some issues related to NFS traffic with
> OFED 1.4.1.
> 

I haven't tested with OFED 1.4, but on the other system (which generated
the original post) is running OFED 1.3.1.

>> The one similarity is that both systems are using SMC Tiger switch Gige switches
>> within the racks and uplink to a Force10 GigE switch (although the behavior
>> was repeated when the core switch was a Cisco unit).
>>
>> I have tried messing with MV2_MT_DEGREE.  Setting this low, 4, seems to help
>> large jobs start, but it does not solve the problem.
> 
> This is good to know. What happens if you reduce MV2_MT_DEGREE to 2. The
> job start-up might be slower. However, we need to see whether it is able
> to start the large-scale jobs.
> 

I will try it.  For some reason I thought 4 was the smallest value.

Craig

>> So the problem could be hardware or a race condition caused in the software.
>> Any ideas of how to debug the software side (or both) would be appreciated).
> 
> Thanks,
> 
> DK
> 
>> Thanks,
>> Craig
>>
>>
>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>>>
>>>>
>>>> On Thu, 9 Jul 2009, Craig Tierney wrote:
>>>>
>>>>> Dhabaleswar Panda wrote:
>>>>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
>>>>>> tests using ~512 cores or larger. This will help you to find out whether
>>>>>> there are any issues when launching jobs and isolate any nodes which might
>>>>>> be having problems.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> DK
>>>>>>
>>>>> I dug in further today while the system was offline, and this
>>>>> is what I found.  The mpispawn process is hanging.  When it hangs
>>>>> it does hang on different nodes each time.  What I see is that
>>>>> one side thinks the connection is closed, and the other side waits.
>>>>>
>>>>> At one end:
>>>>>
>>>>> [root at h43 ~]# netstat
>>>>> Active Internet connections (w/o servers)
>>>>> Proto Recv-Q Send-Q Local Address               Foreign Address             State
>>>>> tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
>>>>> tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
>>>>> tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
>>>>> tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
>>>>> tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
>>>>> tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED
>>>>>
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
>>>>> #1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
>>>>> #2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
>>>>> #3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>>>>>
>>>>> At other end (node h4):
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
>>>>> #1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
>>>>> #2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>>>>>
>>>>> The netstat on h4 does not show any connections back to h43.
>>>>>
>>>>> I tried the latest 1.4Beta from the website (not svn) I found that
>>>>> for large jobs mpirun_rsh will sometimes exits without running anything.
>>>>> The large the job, the more likely it is to not to start the job properly.
>>>>> The only difference is that it doesn't hang.  I turned on debugging with
>>>>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>>>>>
>>>>> Craig
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
>>>>>>
>>>>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>>>>>>> For large jobs, I am having problems where they do not start.
>>>>>>> I am using the mpirun_rsh launcher.  When I try to start jobs
>>>>>>> with ~512 cores or larger, I can see the problem.  The problem
>>>>>>> doesn't happen all the time.
>>>>>>>
>>>>>>> I can't rule our quirky hardware.  The IB tree seems to be
>>>>>>> clean (as reported by ibdiagnet).  My last hang, I looked to
>>>>>>> see if xhpl had started on all the nodes (8 cases for each
>>>>>>> node for dual-socket quad-core systems).  I found that 7 of
>>>>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
>>>>>>> So either the launching mechanism hung, or something was up with one of
>>>>>>> those nodes.
>>>>>>>
>>>>>>> My question is, how should I start debugging this to understand
>>>>>>> what process is hanging?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Craig
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>> --
>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>
>>
>> --
>> Craig Tierney (craig.tierney at noaa.gov)
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)