[mvapich-discuss] Question on how to debug job start failures
Craig Tierney
Craig.Tierney at noaa.gov
Mon Jul 27 16:19:28 EDT 2009
Dhabaleswar Panda wrote:
> Craig,
>
>> A follow-up to my problem. On the new Nehalem cluster (QDR, Centos 5.3,
>> OFED-1.4.1, Mvapich-1.2p1), I am still having applications hang when using
>> mpirun_rsh. The problem seems to start around 512 cores, but it isn't exact.
>> Not sure if this helps, but Openmpi does not have an issue (but I know has
>> a completely different launching mechanism).
>
> Does this happen with OFED 1.4. As you might have seen from the OFA
> mailing lists, there have been some issues related to NFS traffic with
> OFED 1.4.1.
>
I haven't tested with OFED 1.4, but on the other system (which generated
the original post) is running OFED 1.3.1.
>> The one similarity is that both systems are using SMC Tiger switch Gige switches
>> within the racks and uplink to a Force10 GigE switch (although the behavior
>> was repeated when the core switch was a Cisco unit).
>>
>> I have tried messing with MV2_MT_DEGREE. Setting this low, 4, seems to help
>> large jobs start, but it does not solve the problem.
>
> This is good to know. What happens if you reduce MV2_MT_DEGREE to 2. The
> job start-up might be slower. However, we need to see whether it is able
> to start the large-scale jobs.
>
I will try it. For some reason I thought 4 was the smallest value.
Craig
>> So the problem could be hardware or a race condition caused in the software.
>> Any ideas of how to debug the software side (or both) would be appreciated).
>
> Thanks,
>
> DK
>
>> Thanks,
>> Craig
>>
>>
>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>>>
>>>>
>>>> On Thu, 9 Jul 2009, Craig Tierney wrote:
>>>>
>>>>> Dhabaleswar Panda wrote:
>>>>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
>>>>>> tests using ~512 cores or larger. This will help you to find out whether
>>>>>> there are any issues when launching jobs and isolate any nodes which might
>>>>>> be having problems.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> DK
>>>>>>
>>>>> I dug in further today while the system was offline, and this
>>>>> is what I found. The mpispawn process is hanging. When it hangs
>>>>> it does hang on different nodes each time. What I see is that
>>>>> one side thinks the connection is closed, and the other side waits.
>>>>>
>>>>> At one end:
>>>>>
>>>>> [root at h43 ~]# netstat
>>>>> Active Internet connections (w/o servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address State
>>>>> tcp 0 0 h43:50797 wms-sge:sge_qmaster ESTABLISHED
>>>>> tcp 0 0 h43:816 jetsam1:nfs ESTABLISHED
>>>>> tcp 0 0 h43:49730 h6:56443 ESTABLISHED
>>>>> tcp 31245 0 h43:49730 h4:41799 CLOSE_WAIT
>>>>> tcp 0 0 h43:ssh h1:35169 ESTABLISHED
>>>>> tcp 0 0 h43:ssh wfe7-eth2:51964 ESTABLISHED
>>>>>
>>>>>
>>>>> (gdb) bt
>>>>> #0 0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
>>>>> #1 0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
>>>>> #2 0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
>>>>> #3 0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>>>>>
>>>>> At other end (node h4):
>>>>>
>>>>> (gdb) bt
>>>>> #0 0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
>>>>> #1 0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
>>>>> #2 0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>>>>>
>>>>> The netstat on h4 does not show any connections back to h43.
>>>>>
>>>>> I tried the latest 1.4Beta from the website (not svn) I found that
>>>>> for large jobs mpirun_rsh will sometimes exits without running anything.
>>>>> The large the job, the more likely it is to not to start the job properly.
>>>>> The only difference is that it doesn't hang. I turned on debugging with
>>>>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>>>>>
>>>>> Craig
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
>>>>>>
>>>>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>>>>>>> For large jobs, I am having problems where they do not start.
>>>>>>> I am using the mpirun_rsh launcher. When I try to start jobs
>>>>>>> with ~512 cores or larger, I can see the problem. The problem
>>>>>>> doesn't happen all the time.
>>>>>>>
>>>>>>> I can't rule our quirky hardware. The IB tree seems to be
>>>>>>> clean (as reported by ibdiagnet). My last hang, I looked to
>>>>>>> see if xhpl had started on all the nodes (8 cases for each
>>>>>>> node for dual-socket quad-core systems). I found that 7 of
>>>>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
>>>>>>> So either the launching mechanism hung, or something was up with one of
>>>>>>> those nodes.
>>>>>>>
>>>>>>> My question is, how should I start debugging this to understand
>>>>>>> what process is hanging?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Craig
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>> --
>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>
>>
>> --
>> Craig Tierney (craig.tierney at noaa.gov)
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
--
Craig Tierney (craig.tierney at noaa.gov)
More information about the mvapich-discuss
mailing list