[mvapich-discuss] Question on how to debug job start failures
Craig Tierney
Craig.Tierney at noaa.gov
Mon Jul 27 13:36:41 EDT 2009
Craig Tierney wrote:
> Dhabaleswar Panda wrote:
>> Craig - Could you please tell us little more about the details on your
>> system: node configuration (sockets and cores/socket, processor type), OS
>> version, etc. What kind of Ethernet connectivity does your system have?
>> FYI, mpirun_rsh framework launches the job using the standard TCP/IP
>> calls.
>>
>
> The system is a cluster based on Supermicro Motherboards. Each node
> is a dual-socket, quad-core harpertown, 2.8 GHz. Each node has 16 GB
> of RAM.
>
> We are running Centos 5.1. We are using the 2.6.18-92.1.13.el5 kernel
> and the e1000e Intel GigE driver. The nodes boot over NFS. The entire
> OS image is available via NFS.
>
> About 30 nodes each attach to an SMC8150L2 gigE switch. The 9 swtiches
> have 2 uplinks to an Force10 switch (not sure of model number). The
> links are bonded via a port-channel. Spanning tree is disabled.
>
> We are testing a Centos 5.3 image for a new Nehalemm cluster, but I won't
> have the hardware up until the end of next week.
>
> Craig
>
>
A follow-up to my problem. On the new Nehalem cluster (QDR, Centos 5.3,
OFED-1.4.1, Mvapich-1.2p1), I am still having applications hang when using
mpirun_rsh. The problem seems to start around 512 cores, but it isn't exact.
Not sure if this helps, but Openmpi does not have an issue (but I know has
a completely different launching mechanism).
The one similarity is that both systems are using SMC Tiger switch Gige switches
within the racks and uplink to a Force10 GigE switch (although the behavior
was repeated when the core switch was a Cisco unit).
I have tried messing with MV2_MT_DEGREE. Setting this low, 4, seems to help
large jobs start, but it does not solve the problem.
So the problem could be hardware or a race condition caused in the software.
Any ideas of how to debug the software side (or both) would be appreciated).
Thanks,
Craig
>
>
>
>> Thanks,
>>
>> DK
>>
>>
>>
>> On Thu, 9 Jul 2009, Craig Tierney wrote:
>>
>>> Dhabaleswar Panda wrote:
>>>> Are you able to run simple MPI programs (say MPI Hello World) or some IMB
>>>> tests using ~512 cores or larger. This will help you to find out whether
>>>> there are any issues when launching jobs and isolate any nodes which might
>>>> be having problems.
>>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>> I dug in further today while the system was offline, and this
>>> is what I found. The mpispawn process is hanging. When it hangs
>>> it does hang on different nodes each time. What I see is that
>>> one side thinks the connection is closed, and the other side waits.
>>>
>>> At one end:
>>>
>>> [root at h43 ~]# netstat
>>> Active Internet connections (w/o servers)
>>> Proto Recv-Q Send-Q Local Address Foreign Address State
>>> tcp 0 0 h43:50797 wms-sge:sge_qmaster ESTABLISHED
>>> tcp 0 0 h43:816 jetsam1:nfs ESTABLISHED
>>> tcp 0 0 h43:49730 h6:56443 ESTABLISHED
>>> tcp 31245 0 h43:49730 h4:41799 CLOSE_WAIT
>>> tcp 0 0 h43:ssh h1:35169 ESTABLISHED
>>> tcp 0 0 h43:ssh wfe7-eth2:51964 ESTABLISHED
>>>
>>>
>>> (gdb) bt
>>> #0 0x00002b1284f0e950 in __read_nocancel () from /lib64/libc.so.6
>>> #1 0x00000000004035ea in read_socket (socket=5, buffer=0x16dec8a0, bytes=640) at mpirun_util.c:97
>>> #2 0x000000000040402f in mpispawn_tree_init (me=5, req_socket=383699104) at mpispawn_tree.c:190
>>> #3 0x0000000000401a90 in main (argc=5, argv=0x16dec8a0) at mpispawn.c:496
>>>
>>> At other end (node h4):
>>>
>>> (gdb) bt
>>> #0 0x00002b95b77308d3 in __select_nocancel () from /lib64/libc.so.6
>>> #1 0x0000000000404379 in mtpmi_processops () at pmi_tree.c:754
>>> #2 0x0000000000401c32 in main (argc=1024, argv=0x6101a0) at mpispawn.c:525
>>>
>>> The netstat on h4 does not show any connections back to h43.
>>>
>>> I tried the latest 1.4Beta from the website (not svn) I found that
>>> for large jobs mpirun_rsh will sometimes exits without running anything.
>>> The large the job, the more likely it is to not to start the job properly.
>>> The only difference is that it doesn't hang. I turned on debugging with
>>> MPISPAWN_DEBUG, but I didn't see anything interesting from that.
>>>
>>> Craig
>>>
>>>
>>>
>>>
>>>> On Wed, 8 Jul 2009, Craig Tierney wrote:
>>>>
>>>>> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
>>>>> For large jobs, I am having problems where they do not start.
>>>>> I am using the mpirun_rsh launcher. When I try to start jobs
>>>>> with ~512 cores or larger, I can see the problem. The problem
>>>>> doesn't happen all the time.
>>>>>
>>>>> I can't rule our quirky hardware. The IB tree seems to be
>>>>> clean (as reported by ibdiagnet). My last hang, I looked to
>>>>> see if xhpl had started on all the nodes (8 cases for each
>>>>> node for dual-socket quad-core systems). I found that 7 of
>>>>> the 245 nodes (1960 core job) had no xhpl processes on them.
>>>>> So either the launching mechanism hung, or something was up with one of
>>>>> those nodes.
>>>>>
>>>>> My question is, how should I start debugging this to understand
>>>>> what process is hanging?
>>>>>
>>>>> Thanks,
>>>>> Craig
>>>>>
>>>>>
>>>>> --
>>>>> Craig Tierney (craig.tierney at noaa.gov)
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>> --
>>> Craig Tierney (craig.tierney at noaa.gov)
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
--
Craig Tierney (craig.tierney at noaa.gov)
More information about the mvapich-discuss
mailing list