[mvapich-discuss] Question on how to debug job start failures

Fri Jul 10 12:37:18 EDT 2009

Dhabaleswar Panda wrote:
> Craig - Could you please tell us little more about the details on your
> system: node configuration (sockets and cores/socket, processor type), OS
> version, etc. What kind of Ethernet connectivity does your system have?
> FYI, mpirun_rsh framework launches the job using the standard TCP/IP
> calls.
> 

The system is a cluster based on Supermicro Motherboards.  Each node
is a dual-socket, quad-core harpertown, 2.8 GHz.  Each node has 16 GB
of RAM.

We are running Centos 5.1.  We are using the 2.6.18-92.1.13.el5 kernel
and the e1000e Intel GigE driver.  The nodes boot over NFS.  The entire
OS image is available via NFS.

About 30 nodes each attach to an SMC8150L2 gigE switch.  The 9 swtiches
have 2 uplinks to an Force10 switch (not sure of model number).  The
links are bonded via a port-channel.   Spanning tree is disabled.

We are testing a Centos 5.3 image for a new Nehalemm cluster, but I won't
have the hardware up until the end of next week.

Craig

> Thanks,
> 
> DK
> 
> 
> 
> On Thu, 9 Jul 2009, Craig Tierney wrote:
> 
>> Dhabaleswar Panda wrote:
>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
>>> tests using ~512 cores or larger. This will help you to find out whether
>>> there are any issues when launching jobs and isolate any nodes which might
>>> be having problems.
>>>
>>> Thanks,
>>>
>>> DK
>>>
>> I dug in further today while the system was offline, and this
>> is what I found.  The mpispawn process is hanging.  When it hangs
>> it does hang on different nodes each time.  What I see is that
>> one side thinks the connection is closed, and the other side waits.
>>
>> At one end:
>>
>> [root at h43 ~]# netstat
>> Active Internet connections (w/o servers)
>> Proto Recv-Q Send-Q Local Address               Foreign Address             State
>> tcp        0      0 h43:50797                   wms-sge:sge_qmaster         ESTABLISHED
>> tcp        0      0 h43:816                     jetsam1:nfs                 ESTABLISHED
>> tcp        0      0 h43:49730                   h6:56443                    ESTABLISHED
>> tcp    31245      0 h43:49730                   h4:41799                    CLOSE_WAIT
>> tcp        0      0 h43:ssh                     h1:35169                    ESTABLISHED
>> tcp        0      0 h43:ssh                     wfe7-eth2:51964             ESTABLISHED
>>
>>
>> (gdb) bt
>> #0  0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
>> #1  0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
>> #2  0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
>> #3  0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>>
>> At other end (node h4):
>>
>> (gdb) bt
>> #0  0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
>> #1  0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
>> #2  0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>>
>> The netstat on h4 does not show any connections back to h43.
>>
>> I tried the latest 1.4Beta from the website (not svn) I found that
>> for large jobs mpirun_rsh will sometimes exits without running anything.
>> The large the job, the more likely it is to not to start the job properly.
>> The only difference is that it doesn't hang.  I turned on debugging with
>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>>
>> Craig
>>
>>
>>
>>
>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
>>>
>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>>>> For large jobs, I am having problems where they do not start.
>>>> I am using the mpirun_rsh launcher.  When I try to start jobs
>>>> with ~512 cores or larger, I can see the problem.  The problem
>>>> doesn't happen all the time.
>>>>
>>>> I can't rule our quirky hardware.  The IB tree seems to be
>>>> clean (as reported by ibdiagnet).  My last hang, I looked to
>>>> see if xhpl had started on all the nodes (8 cases for each
>>>> node for dual-socket quad-core systems).  I found that 7 of
>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
>>>> So either the launching mechanism hung, or something was up with one of
>>>> those nodes.
>>>>
>>>> My question is, how should I start debugging this to understand
>>>> what process is hanging?
>>>>
>>>> Thanks,
>>>> Craig
>>>>
>>>>
>>>> --
>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>
>>
>> --
>> Craig Tierney (craig.tierney at noaa.gov)
>>
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)