[mvapich-discuss] Question on how to debug job start failures

Mon Jul 27 13:36:41 EDT 2009

Craig Tierney wrote:
> Dhabaleswar Panda wrote:
>> Craig - Could you please tell us little more about the details on your
>> system: node configuration (sockets and cores/socket, processor type), OS
>> version, etc. What kind of Ethernet connectivity does your system have?
>> FYI, mpirun_rsh framework launches the job using the standard TCP/IP
>> calls.
>>
> 
> The system is a cluster based on Supermicro Motherboards.  Each node
> is a dual-socket, quad-core harpertown, 2.8 GHz.  Each node has 16 GB
> of RAM.
> 
> We are running Centos 5.1.  We are using the 2.6.18-92.1.13.el5 kernel
> and the e1000e Intel GigE driver.  The nodes boot over NFS.  The entire
> OS image is available via NFS.
> 
> About 30 nodes each attach to an SMC8150L2 gigE switch.  The 9 swtiches
> have 2 uplinks to an Force10 switch (not sure of model number).  The
> links are bonded via a port-channel.   Spanning tree is disabled.
> 
> We are testing a Centos 5.3 image for a new Nehalemm cluster, but I won't
> have the hardware up until the end of next week.
> 
> Craig
> 
> 

A follow-up to my problem.  On the new Nehalem cluster (QDR, Centos 5.3,
OFED-1.4.1, Mvapich-1.2p1), I am still having applications hang when using
mpirun_rsh.  The problem seems to start around 512 cores, but it isn't exact.
Not sure if this helps, but Openmpi does not have an issue (but I know has
a completely different launching mechanism).

The one similarity is that both systems are using SMC Tiger switch Gige switches
within the racks and uplink to a Force10 GigE switch (although the behavior
was repeated when the core switch was a Cisco unit).

I have tried messing with MV2_MT_DEGREE.  Setting this low, 4, seems to help
large jobs start, but it does not solve the problem.

So the problem could be hardware or a race condition caused in the software.
Any ideas of how to debug the software side (or both) would be appreciated).

Thanks,
Craig

> 
> 
> 
>> Thanks,
>>
>> DK
>>
>>
>>
>> On Thu, 9 Jul 2009, Craig Tierney wrote:
>>
>>> Dhabaleswar Panda wrote:
>>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
>>>> tests using ~512 cores or larger. This will help you to find out whether
>>>> there are any issues when launching jobs and isolate any nodes which might
>>>> be having problems.
>>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>> I dug in further today while the system was offline, and this
>>> is what I found.  The mpispawn process is hanging.  When it hangs
>>> it does hang on different nodes each time.  What I see is that
>>> one side thinks the connection is closed, and the other side waits.
>>>
>>> At one end:
>>>
>>> [root at h43 ~]# netstat
>>> Active Internet connections (w/o servers)
>>> Proto Recv-Q Send-Q Local Address               Foreign Address             State
>>> tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
>>> tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
>>> tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
>>> tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
>>> tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
>>> tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED
>>>
>>>
>>> (gdb) bt
>>> #0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
>>> #1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
>>> #2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
>>> #3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>>>
>>> At other end (node h4):
>>>
>>> (gdb) bt
>>> #0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
>>> #1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
>>> #2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>>>
>>> The netstat on h4 does not show any connections back to h43.
>>>
>>> I tried the latest 1.4Beta from the website (not svn) I found that
>>> for large jobs mpirun_rsh will sometimes exits without running anything.
>>> The large the job, the more likely it is to not to start the job properly.
>>> The only difference is that it doesn't hang.  I turned on debugging with
>>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>>>
>>> Craig
>>>
>>>
>>>
>>>
>>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
>>>>
>>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>>>>> For large jobs, I am having problems where they do not start.
>>>>> I am using the mpirun_rsh launcher.  When I try to start jobs
>>>>> with ~512 cores or larger, I can see the problem.  The problem
>>>>> doesn't happen all the time.
>>>>>
>>>>> I can't rule our quirky hardware.  The IB tree seems to be
>>>>> clean (as reported by ibdiagnet).  My last hang, I looked to
>>>>> see if xhpl had started on all the nodes (8 cases for each
>>>>> node for dual-socket quad-core systems).  I found that 7 of
>>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
>>>>> So either the launching mechanism hung, or something was up with one of
>>>>> those nodes.
>>>>>
>>>>> My question is, how should I start debugging this to understand
>>>>> what process is hanging?
>>>>>
>>>>> Thanks,
>>>>> Craig
>>>>>
>>>>>
>>>>> --
>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>> --
>>> Craig Tierney (craig.tierney at noaa.gov)
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)