[mvapich-discuss] [mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.

Mike Heinz michael.heinz at qlogic.com
Thu Jul 16 12:37:06 EDT 2009


mvapich-1.1.0-3355.src.rpm

mvapich2-1.2p1-1.src.rpm


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Krishna Chaitanya Kandalla [mailto:kandalla at cse.ohio-state.edu]
Sent: Thursday, July 16, 2009 12:34 PM
To: Mike Heinz
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] [mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.

Mike,
          Can you also let us know the version numbers of the mvapich2
and mvapich1 stacks that you are using?

Thanks,
Krishna

Mike Heinz wrote:
> Krishna,
>
> What I'm saying is that if I run the program between A & D or A & C it works, but if I run it between A & B it silently hangs, never making progress. Meanwhile, I can run the same program between C & B and C & A, but runs between C & D silently hang without making progress. This problem only occurs with mvapich2, not with mvapich1 or openmpi. All other Infiniband operations appear to be working normally.
>
> This behavior is repeatable for those two pairs of machines ( A & B and C & D), but has not been seen on any other machines on the fabric, and we have not seen this on any other fabric - if I had to guess there's some kind of timing hole that's being exposed in very narrow conditions.
>
> The fabric in question is actually used to test software before we release it, so it contains a mix of Linux distros, but all machines are X86_64 architecture.
>
> For the stack traces I sent you, node 0 is a 8-way Xeon E5320 1.86 gHZ, node 1 is a 2-way Opteron running at 2.4 GHz.
>
> I realize the symptoms are quite bizarre - we've had several Infiniband coders and testers investigating this for a couple of weeks now - I was just hoping you might be able to suggest a line of investigation.
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Krishna Chaitanya Kandalla [mailto:kandalla at cse.ohio-state.edu]
> Sent: Wednesday, July 15, 2009 7:19 PM
> To: Mike Heinz
> Cc: mvapich-discuss at cse.ohio-state.edu; Todd Rimmer
> Subject: Re: [mvapich-discuss] [mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.
>
> Mike,
>         I guess, I had mistakenly started the job on 3 processes earlier
> and it had hung. On running with 2 processes, (the way it is supposed to
> be run), it executes correctly on our machines. Can you give us some
> more information about your hardware. You were speaking about some
> reachability issues between certain two nodes.  I am guessing that you
> are running tests with on either :
> 1. Nodes "A" and "D"  or
> 2. Nodes "B" and "C"
>
>        Also,
>  >  "A" can't run mvapich2 programs with machine "B", and machine "C"
> can't run programs with machine "D"
>
>        What exactly is the kind of error message that you see in this case?
>
> Thanks,
> Krishna
>
> Krishna Chaitanya Kandalla wrote:
>
>> Mike,
>>           Thank you for providing the source code. I am able to
>> reproduce the hang on our cluster, as well. I will look into the issue.
>>
>> Thanks,
>> Krishna
>>
>> Mike Heinz wrote:
>>
>>> I was wondering about that - I passed the parameter in a param file,
>>> using the -param argument to mpirun_rsh. I just tried passing it
>>> inline as well, here are the results:
>>>
>>> mpiexec -env MV2_USE_SHMEM_COLL 0 -np 2
>>> /opt/iba/src/mpi_apps/bandwidth/bw 10 10
>>>
>>> node 0
>>>
>>> Loaded symbols for /lib64/libnss_files.so.2
>>> 0x00002aaaaaae5bf8 in MPIDI_CH3I_SMP_write_progress at plt ()
>>>    from /usr/mpi/gcc/mvapich2-1.2p1/lib/libmpich.so.1.1
>>> (gdb) where
>>> #0  0x00002aaaaaae5bf8 in MPIDI_CH3I_SMP_write_progress at plt ()
>>>    from /usr/mpi/gcc/mvapich2-1.2p1/lib/libmpich.so.1.1
>>> #1  0x00002aaaaab17536 in MPIDI_CH3I_Progress (is_blocking=1, state=0x1)
>>>     at ch3_progress.c:174
>>> #2  0x00002aaaaab98e14 in PMPI_Recv (buf=0xc50000, count=4,
>>>     datatype=1275068673, source=1, tag=101, comm=1140850688,
>>> status=0x601520)
>>>     at recv.c:156
>>> #3  0x0000000000400ea8 in main (argc=3, argv=0x7ffffe2de508) at bw.c:91
>>>
>>>
>>> (gdb) where
>>> #0  0x00002b9af218cd80 in mthca_poll_cq (ibcq=0xf5de80, ne=1,
>>>     wc=0x7fffb9786a60) at src/cq.c:470
>>> #1  0x00002b9af14ee2a8 in MPIDI_CH3I_MRAILI_Cq_poll (
>>>     vbuf_handle=0x7fffb9786b78, vc_req=0xf55d00, receiving=0,
>>> is_blocking=1)
>>>     at /usr/include/infiniband/verbs.h:934
>>> #2  0x00002b9af14ef2e5 in MPIDI_CH3I_MRAILI_Waiting_msg (vc=0xf55d00,
>>>     vbuf_handle=0x7fffb9786b78, blocking=1) at ibv_channel_manager.c:468
>>> #3  0x00002b9af14a8304 in MPIDI_CH3I_read_progress
>>> (vc_pptr=0x7fffb9786b80,
>>>     v_ptr=0x7fffb9786b78, is_blocking=<value optimized out>)
>>>     at ch3_read_progress.c:158
>>> #4  0x00002b9af14a7f44 in MPIDI_CH3I_Progress (is_blocking=1,
>>>     state=<value optimized out>) at ch3_progress.c:202
>>> #5  0x00002b9af14ec60e in MPIC_Wait (request_ptr=0xfc7978) at
>>> helper_fns.c:269
>>> #6  0x00002b9af14eca03 in MPIC_Sendrecv (sendbuf=0x0, sendcount=0,
>>>     sendtype=1275068685, dest=0, sendtag=1, recvbuf=0x0, recvcount=0,
>>>     recvtype=1275068685, source=0, recvtag=1, comm=1140850688,
>>> status=0x1)
>>>     at helper_fns.c:125
>>> #7  0x00002b9af149b07a in MPIR_Barrier (comm_ptr=<value optimized out>)
>>>     at barrier.c:82
>>> #8  0x00002b9af149b698 in PMPI_Barrier (comm=1140850688) at
>>> barrier.c:446
>>> #9  0x0000000000400ea3 in main (argc=3, argv=0x7fffb9786e88) at bw.c:81
>>>
>>> Bw.c is the old "bandwidth" benchmark. It looks like it actually gets
>>> out of MPI_Init() in this case, but then one side is waiting at a
>>> barrier while the other has already gone past the barrier. I've
>>> attached a copy of the program.
>>>
>>>
>>> --
>>> Michael Heinz
>>> Principal Engineer, Qlogic Corporation
>>> King of Prussia, Pennsylvania
>>> -----Original Message-----
>>> From: Krishna Chaitanya Kandalla [mailto:kandalla at cse.ohio-state.edu]
>>> Sent: Wednesday, July 15, 2009 3:42 PM
>>> To: Mike Heinz
>>> Subject: Re: [mvapich-discuss] [mpich2-dev] Need a hint in debugging
>>> a problem that only affects a few machines in our cluster.
>>>
>>> Mike,
>>> Thats a little surprising. Setting this variable off ensures that a
>>> particular flag is set to 0. This flag is supposed to guard the piece
>>> of code that does the 2-level communicator creation. Just out of
>>> curiosity, can you also let me know the command that you are using to
>>> launch the job. The env variables need to be set before the
>>> executable is specified. If MV2_USE_SHMEM_COLL=0 appears after the
>>> executable name, the job launcher might not pick it up.
>>>
>>> Thanks,
>>> Krishna
>>>
>>>
>>>
>>>
>>> Mike Heinz wrote:
>>>
>>>
>>>> Krishna, thanks for the suggestion - but setting MV2_USE_SHMEM_COLL
>>>> to zero did not seem to change the stack trace much:
>>>>
>>>> Node 0:
>>>>
>>>> 0x00002aaaaab5d8b7 in MPIDI_CH3I_MRAILI_Cq_poll
>>>> (vbuf_handle=0x7fffcb46d698,
>>>>
>>>> vc_req=0x0, receiving=0, is_blocking=1) at ibv_channel_manager.c:529
>>>>
>>>> 529 for (; i < rdma_num_hcas; ++i) {
>>>>
>>>> (gdb) where
>>>>
>>>> #0 0x00002aaaaab5d8b7 in MPIDI_CH3I_MRAILI_Cq_poll (
>>>>
>>>> vbuf_handle=0x7fffcb46d698, vc_req=0x0, receiving=0, is_blocking=1)
>>>>
>>>> at ibv_channel_manager.c:529
>>>>
>>>> #1 0x00002aaaaab177fa in MPIDI_CH3I_read_progress
>>>> (vc_pptr=0x7fffcb46d6a0,
>>>>
>>>> v_ptr=0x7fffcb46d698, is_blocking=1) at ch3_read_progress.c:143
>>>>
>>>> #2 0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
>>>>
>>>> state=<value optimized out>) at ch3_progress.c:202
>>>>
>>>> #3 0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
>>>>
>>>> at helper_fns.c:269
>>>>
>>>> #4 0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x10993a80,
>>>> sendcount=2,
>>>>
>>>> sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x10993a88,
>>>> recvcount=2,
>>>>
>>>> recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
>>>>
>>>> status=0x7fffcb46d820) at helper_fns.c:125
>>>>
>>>> #5 0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
>>>>
>>>> sendcount=<value optimized out>, sendtype=1275069445,
>>>> recvbuf=0x10993a80,
>>>>
>>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
>>>>
>>>> at allgather.c:192
>>>>
>>>> #6 0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>>
>>>> sendcount=2, sendtype=1275069445, recvbuf=0x10993a80, recvcount=2,
>>>>
>>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>>
>>>> #7 0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0,
>>>> key=0,
>>>>
>>>> newcomm=0x2aaaaae1c2f4) at comm_split.c:196
>>>>
>>>> #8 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
>>>>
>>>> ---Type <return> to continue, or q <return> to quit---
>>>>
>>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>>
>>>> #9 0x00002aaaaab6877d in PMPI_Init (argc=0x7fffcb46db3c,
>>>> argv=0x7fffcb46db30)
>>>>
>>>> at init.c:146
>>>>
>>>> #10 0x0000000000400b2f in main (argc=3, argv=0x7fffcb46dc78) at bw.c:27
>>>>
>>>> Node 1:
>>>>
>>>> MPIDI_CH3I_read_progress (vc_pptr=0x7fff0b10bb50, v_ptr=0x7fff0b10bb48,
>>>>
>>>> is_blocking=1) at ch3_read_progress.c:143
>>>>
>>>> 143 type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking);
>>>>
>>>> (gdb) where
>>>>
>>>> #0 MPIDI_CH3I_read_progress (vc_pptr=0x7fff0b10bb50,
>>>> v_ptr=0x7fff0b10bb48,
>>>>
>>>> is_blocking=1) at ch3_read_progress.c:143
>>>>
>>>> #1 0x00002afc9fb21f44 in MPIDI_CH3I_Progress (is_blocking=1,
>>>>
>>>> state=<value optimized out>) at ch3_progress.c:202
>>>>
>>>> #2 0x00002afc9fb6660e in MPIC_Wait (request_ptr=0x2afc9fd242a0)
>>>>
>>>> at helper_fns.c:269
>>>>
>>>> #3 0x00002afc9fb66a03 in MPIC_Sendrecv (sendbuf=0xf77028, sendcount=2,
>>>>
>>>> sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf77020, recvcount=4,
>>>>
>>>> recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
>>>>
>>>> status=0x7fff0b10bcd0) at helper_fns.c:125
>>>>
>>>> #4 0x00002afc9fb08ddb in MPIR_Allgather (sendbuf=<value optimized out>,
>>>>
>>>> sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf77020,
>>>>
>>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2afc9fd26c80)
>>>>
>>>> at allgather.c:192
>>>>
>>>> #5 0x00002afc9fb09a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>>
>>>> sendcount=2, sendtype=1275069445, recvbuf=0xf77020, recvcount=2,
>>>>
>>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>>
>>>> #6 0x00002afc9fb4591b in PMPI_Comm_split (comm=1140850688, color=1,
>>>> key=0,
>>>>
>>>> newcomm=0x2afc9fd26d94) at comm_split.c:196
>>>>
>>>> #7 0x00002afc9fb478f4 in create_2level_comm (comm=1140850688, size=2,
>>>>
>>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>>
>>>> #8 0x00002afc9fb730a5 in PMPI_Init (argc=0x7fff0b10bfec,
>>>> argv=0x7fff0b10bfe0)
>>>>
>>>> at init.c:146
>>>>
>>>> ---Type <return> to continue, or q <return> to quit---
>>>>
>>>> #9 0x0000000000400bcf in main (argc=3, argv=0x7fff0b10c128) at bw.c:27
>>>>
>>>> Any suggestions would be appreciated.
>>>>
>>>> --
>>>>
>>>> Michael Heinz
>>>>
>>>> Principal Engineer, Qlogic Corporation
>>>>
>>>> King of Prussia, Pennsylvania
>>>>
>>>> *From:* kris.c1986 at gmail.com [mailto:kris.c1986 at gmail.com] *On
>>>> Behalf Of *Krishna Chaitanya
>>>> *Sent:* Tuesday, July 14, 2009 6:39 PM
>>>> *To:* Mike Heinz
>>>> *Cc:* Todd Rimmer; mvapich-discuss at cse.ohio-state.edu;
>>>> mpich2-dev at mcs.anl.gov
>>>> *Subject:* Re: [mvapich-discuss] [mpich2-dev] Need a hint in
>>>> debugging a problem that only affects a few machines in our cluster.
>>>>
>>>> Mike,
>>>> The hang seems to be occuring when the MPI library is trying to
>>>> create the 2-level communicator, during the init phase. Can you try
>>>> running the test with MV2_USE_SHMEM_COLL
>>>> <http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-16000011.74>=0.
>>>> This will ensure that a flat communicator is used for the subsequent
>>>> MPI calls. This might help us isolate the problem.
>>>>
>>>> Thanks,
>>>> Krishna
>>>>
>>>> On Tue, Jul 14, 2009 at 5:04 PM, Mike Heinz
>>>> <michael.heinz at qlogic.com <mailto:michael.heinz at qlogic.com>> wrote:
>>>>
>>>> We're having a very odd problem with our fabric, where, out of the
>>>> entire cluster, machine "A" can't run mvapich2 programs with machine
>>>> "B", and machine "C" can't run programs with machine "D" - even
>>>> though "A" can run with "D" and "B" can run with "C" - and the rest
>>>> of the fabric works fine.
>>>>
>>>> 1) There are no IB errors anywhere on the fabric that I can find,
>>>> and the machines in question all work correctly with mvapich1 and
>>>> low-level IB tests.
>>>>
>>>> 2) The problem occurs whether using mpd or rsh.
>>>>
>>>> 3) If I attach to the running processes, both machines appear to be
>>>> waiting for a read operation to complete. (See below)
>>>>
>>>> Can anyone make a suggestion on how to debug this?
>>>>
>>>> Stack trace for node 0:
>>>>
>>>> #0 0x000000361160abb5 in pthread_spin_lock () from
>>>> /lib64/libpthread.so.0
>>>>
>>>> #1 0x00002aaaab08fb6c in mthca_poll_cq (ibcq=0x2060980, ne=1,
>>>>
>>>> wc=0x7fff9d835900) at src/cq.c:468
>>>>
>>>> #2 0x00002aaaaab5d8d8 in MPIDI_CH3I_MRAILI_Cq_poll (
>>>>
>>>> vbuf_handle=0x7fff9d8359d8, vc_req=0x0, receiving=0, is_blocking=1)
>>>>
>>>> at /usr/include/infiniband/verbs.h:934
>>>>
>>>> #3 0x00002aaaaab177fa in MPIDI_CH3I_read_progress
>>>> (vc_pptr=0x7fff9d8359e0,
>>>>
>>>> v_ptr=0x7fff9d8359d8, is_blocking=1) at ch3_read_progress.c:143
>>>>
>>>> #4 0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
>>>>
>>>> state=<value optimized out>) at ch3_progress.c:202
>>>>
>>>> #5 0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
>>>>
>>>> at helper_fns.c:269
>>>>
>>>> #6 0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x217fc50, sendcount=2,
>>>>
>>>> sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x217fc58, recvcount=2,
>>>>
>>>> recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
>>>>
>>>> status=0x7fff9d835b60) at helper_fns.c:125
>>>>
>>>> #7 0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
>>>>
>>>> sendcount=<value optimized out>, sendtype=1275069445,
>>>> recvbuf=0x217fc50,
>>>>
>>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
>>>>
>>>> at allgather.c:192
>>>>
>>>> #8 0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>>
>>>> sendcount=2, sendtype=1275069445, recvbuf=0x217fc50, recvcount=2,
>>>>
>>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>>
>>>> ---Type <return> to continue, or q <return> to quit---
>>>>
>>>> #9 0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0,
>>>> key=0,
>>>>
>>>> newcomm=0x2aaaaae1c2f4) at comm_split.c:196
>>>>
>>>> #10 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
>>>>
>>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>>
>>>> #11 0x00002aaaaab6877d in PMPI_Init (argc=0x7fff9d835e7c,
>>>> argv=0x7fff9d835e70)
>>>>
>>>> at init.c:146
>>>>
>>>> #12 0x0000000000400b2f in main (argc=3, argv=0x7fff9d835fb8) at bw.c:27
>>>>
>>>> Stack trace for node 1:
>>>>
>>>> #0 0x00002ac3cbdac2d2 in MPIDI_CH3I_read_progress
>>>> (vc_pptr=0x7fffdee81020,
>>>>
>>>> v_ptr=0x7fffdee81018, is_blocking=1) at ch3_read_progress.c:143
>>>>
>>>> #1 0x00002ac3cbdabf44 in MPIDI_CH3I_Progress (is_blocking=1,
>>>>
>>>> state=<value optimized out>) at ch3_progress.c:202
>>>>
>>>> #2 0x00002ac3cbdf060e in MPIC_Wait (request_ptr=0x2ac3cbfae2a0)
>>>>
>>>> at helper_fns.c:269
>>>>
>>>> #3 0x00002ac3cbdf0a03 in MPIC_Sendrecv (sendbuf=0xf79028, sendcount=2,
>>>>
>>>> sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf79020, recvcount=4,
>>>>
>>>> recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
>>>>
>>>> status=0x7fffdee811a0) at helper_fns.c:125
>>>>
>>>> #4 0x00002ac3cbd92ddb in MPIR_Allgather (sendbuf=<value optimized out>,
>>>>
>>>> sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf79020,
>>>>
>>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2ac3cbfb0c80)
>>>>
>>>> at allgather.c:192
>>>>
>>>> #5 0x00002ac3cbd93a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>>
>>>> sendcount=2, sendtype=1275069445, recvbuf=0xf79020, recvcount=2,
>>>>
>>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>>
>>>> #6 0x00002ac3cbdcf91b in PMPI_Comm_split (comm=1140850688, color=1,
>>>> key=0,
>>>>
>>>> newcomm=0x2ac3cbfb0d94) at comm_split.c:196
>>>>
>>>> #7 0x00002ac3cbdd18f4 in create_2level_comm (comm=1140850688, size=2,
>>>>
>>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>>
>>>> #8 0x00002ac3cbdfd0a5 in PMPI_Init (argc=0x7fffdee814bc,
>>>> argv=0x7fffdee814b0)
>>>>
>>>> at init.c:146
>>>>
>>>> ---Type <return> to continue, or q <return> to quit---
>>>>
>>>> #9 0x0000000000400bcf in main (argc=3, argv=0x7fffdee815f8) at bw.c:27
>>>>
>>>> --
>>>>
>>>> Michael Heinz
>>>>
>>>> Principal Engineer, Qlogic Corporation
>>>>
>>>> King of Prussia, Pennsylvania
>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> In the middle of difficulty, lies opportunity
>>>>
>>>>
>>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
>
>



More information about the mvapich-discuss mailing list