[mvapich-discuss] Question on how to debug job start failures

Mon Aug 10 13:40:41 EDT 2009

I have determined that my problem was related to hardware that
was dropping or corrupting data that was being sent between mpispawn
processes for jobs larger than 512 cores (64 nodes).  The magic
number may have to do with the fact that transfers started to exceed
8192 bytes, which is the setting of the MTU on our network.

We are working with SMC to find a solution to the problem.  For now,
we hacked mpirun_rsh to launch the jobs over the IB.

Thanks for the help,
Craig

Dhabaleswar Panda wrote:
> Craig,
> 
>> A follow-up to my problem.  On the new Nehalem cluster (QDR, Centos 5.3,
>> OFED-1.4.1, Mvapich-1.2p1), I am still having applications hang when using
>> mpirun_rsh.  The problem seems to start around 512 cores, but it isn't exact.
>> Not sure if this helps, but Openmpi does not have an issue (but I know has
>> a completely different launching mechanism).
> 
> Does this happen with OFED 1.4. As you might have seen from the OFA
> mailing lists, there have been some issues related to NFS traffic with
> OFED 1.4.1.
> 
>> The one similarity is that both systems are using SMC Tiger switch Gige switches
>> within the racks and uplink to a Force10 GigE switch (although the behavior
>> was repeated when the core switch was a Cisco unit).
>>
>> I have tried messing with MV2_MT_DEGREE.  Setting this low, 4, seems to help
>> large jobs start, but it does not solve the problem.
> 
> This is good to know. What happens if you reduce MV2_MT_DEGREE to 2. The
> job start-up might be slower. However, we need to see whether it is able
> to start the large-scale jobs.
> 
>> So the problem could be hardware or a race condition caused in the software.
>> Any ideas of how to debug the software side (or both) would be appreciated).
> 
> Thanks,
> 
> DK
> 
>> Thanks,
>> Craig
>>
>>
>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>>>
>>>>
>>>> On Thu, 9 Jul 2009, Craig Tierney wrote:
>>>>
>>>>> Dhabaleswar Panda wrote:
>>>>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
>>>>>> tests using ~512 cores or larger. This will help you to find out whether
>>>>>> there are any issues when launching jobs and isolate any nodes which might
>>>>>> be having problems.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> DK
>>>>>>
>>>>> I dug in further today while the system was offline, and this
>>>>> is what I found.  The mpispawn process is hanging.  When it hangs
>>>>> it does hang on different nodes each time.  What I see is that
>>>>> one side thinks the connection is closed, and the other side waits.
>>>>>
>>>>> At one end:
>>>>>
>>>>> [root at h43 ~]# netstat
>>>>> Active Internet connections (w/o servers)
>>>>> Proto Recv-Q Send-Q Local Address               Foreign Address             State
>>>>> tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
>>>>> tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
>>>>> tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
>>>>> tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
>>>>> tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
>>>>> tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED
>>>>>
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
>>>>> #1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
>>>>> #2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
>>>>> #3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>>>>>
>>>>> At other end (node h4):
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
>>>>> #1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
>>>>> #2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>>>>>
>>>>> The netstat on h4 does not show any connections back to h43.
>>>>>
>>>>> I tried the latest 1.4Beta from the website (not svn) I found that
>>>>> for large jobs mpirun_rsh will sometimes exits without running anything.
>>>>> The large the job, the more likely it is to not to start the job properly.
>>>>> The only difference is that it doesn't hang.  I turned on debugging with
>>>>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>>>>>
>>>>> Craig
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
>>>>>>
>>>>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>>>>>>> For large jobs, I am having problems where they do not start.
>>>>>>> I am using the mpirun_rsh launcher.  When I try to start jobs
>>>>>>> with ~512 cores or larger, I can see the problem.  The problem
>>>>>>> doesn't happen all the time.
>>>>>>>
>>>>>>> I can't rule our quirky hardware.  The IB tree seems to be
>>>>>>> clean (as reported by ibdiagnet).  My last hang, I looked to
>>>>>>> see if xhpl had started on all the nodes (8 cases for each
>>>>>>> node for dual-socket quad-core systems).  I found that 7 of
>>>>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
>>>>>>> So either the launching mechanism hung, or something was up with one of
>>>>>>> those nodes.
>>>>>>>
>>>>>>> My question is, how should I start debugging this to understand
>>>>>>> what process is hanging?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Craig
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>> --
>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>
>>
>> --
>> Craig Tierney (craig.tierney at noaa.gov)
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)