[mvapich-discuss] [mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.

Krishna Chaitanya kandalla at cse.ohio-state.edu
Tue Jul 14 18:38:40 EDT 2009


Mike,
         The hang seems to be occuring when the MPI library is trying to
create the 2-level communicator, during the init phase. Can you try running
the test with MV2_USE_SHMEM_COLL<http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-16000011.74>=0.
This will ensure that a flat communicator is used for the subsequent MPI
calls. This might help us isolate the problem.

Thanks,
Krishna


On Tue, Jul 14, 2009 at 5:04 PM, Mike Heinz <michael.heinz at qlogic.com>wrote:

>  We’re having a very odd problem with our fabric, where, out of the entire
> cluster, machine “A” can’t run mvapich2 programs with  machine “B”, and
> machine “C” can’t run programs with machine “D” – even though “A” can run
> with “D” and “B” can run with “C” – and the rest of the fabric works fine.
>
>
>
> 1)      There are no IB errors anywhere on the fabric that I can find, and
> the machines in question all work correctly with mvapich1 and low-level IB
> tests.
>
> 2)      The problem occurs whether using mpd or rsh.
>
> 3)      If I attach to the running processes, both machines appear to be
> waiting for a read operation to complete. (See below)
>
>
>
> Can anyone make a suggestion on how to debug this?
>
>
>
> Stack trace for node 0:
>
>
>
> #0  0x000000361160abb5 in pthread_spin_lock () from /lib64/libpthread.so.0
>
> #1  0x00002aaaab08fb6c in mthca_poll_cq (ibcq=0x2060980, ne=1,
>
>     wc=0x7fff9d835900) at src/cq.c:468
>
> #2  0x00002aaaaab5d8d8 in MPIDI_CH3I_MRAILI_Cq_poll (
>
>     vbuf_handle=0x7fff9d8359d8, vc_req=0x0, receiving=0, is_blocking=1)
>
>     at /usr/include/infiniband/verbs.h:934
>
> #3  0x00002aaaaab177fa in MPIDI_CH3I_read_progress (vc_pptr=0x7fff9d8359e0,
>
>     v_ptr=0x7fff9d8359d8, is_blocking=1) at ch3_read_progress.c:143
>
> #4  0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
>
>     state=<value optimized out>) at ch3_progress.c:202
>
> #5  0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
>
>     at helper_fns.c:269
>
> #6  0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x217fc50, sendcount=2,
>
>     sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x217fc58, recvcount=2,
>
>     recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
>
>     status=0x7fff9d835b60) at helper_fns.c:125
>
> #7  0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
>
>     sendcount=<value optimized out>, sendtype=1275069445,
> recvbuf=0x217fc50,
>
>     recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
>
>     at allgather.c:192
>
> #8  0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>
>     sendcount=2, sendtype=1275069445, recvbuf=0x217fc50, recvcount=2,
>
>     recvtype=1275069445, comm=1140850688) at allgather.c:866
>
> ---Type <return> to continue, or q <return> to quit---
>
> #9  0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0, key=0,
>
>     newcomm=0x2aaaaae1c2f4) at comm_split.c:196
>
> #10 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
>
>     my_rank=<value optimized out>) at create_2level_comm.c:142
>
> #11 0x00002aaaaab6877d in PMPI_Init (argc=0x7fff9d835e7c,
> argv=0x7fff9d835e70)
>
>     at init.c:146
>
> #12 0x0000000000400b2f in main (argc=3, argv=0x7fff9d835fb8) at bw.c:27
>
>
>
> Stack trace for node 1:
>
>
>
> #0  0x00002ac3cbdac2d2 in MPIDI_CH3I_read_progress (vc_pptr=0x7fffdee81020,
>
>     v_ptr=0x7fffdee81018, is_blocking=1) at ch3_read_progress.c:143
>
> #1  0x00002ac3cbdabf44 in MPIDI_CH3I_Progress (is_blocking=1,
>
>     state=<value optimized out>) at ch3_progress.c:202
>
> #2  0x00002ac3cbdf060e in MPIC_Wait (request_ptr=0x2ac3cbfae2a0)
>
>     at helper_fns.c:269
>
> #3  0x00002ac3cbdf0a03 in MPIC_Sendrecv (sendbuf=0xf79028, sendcount=2,
>
>     sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf79020, recvcount=4,
>
>     recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
>
>     status=0x7fffdee811a0) at helper_fns.c:125
>
> #4  0x00002ac3cbd92ddb in MPIR_Allgather (sendbuf=<value optimized out>,
>
>     sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf79020,
>
>     recvcount=2, recvtype=1275069445, comm_ptr=0x2ac3cbfb0c80)
>
>     at allgather.c:192
>
> #5  0x00002ac3cbd93a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>
>     sendcount=2, sendtype=1275069445, recvbuf=0xf79020, recvcount=2,
>
>     recvtype=1275069445, comm=1140850688) at allgather.c:866
>
> #6  0x00002ac3cbdcf91b in PMPI_Comm_split (comm=1140850688, color=1, key=0,
>
>     newcomm=0x2ac3cbfb0d94) at comm_split.c:196
>
> #7  0x00002ac3cbdd18f4 in create_2level_comm (comm=1140850688, size=2,
>
>     my_rank=<value optimized out>) at create_2level_comm.c:142
>
> #8  0x00002ac3cbdfd0a5 in PMPI_Init (argc=0x7fffdee814bc,
> argv=0x7fffdee814b0)
>
>     at init.c:146
>
> ---Type <return> to continue, or q <return> to quit---
>
> #9  0x0000000000400bcf in main (argc=3, argv=0x7fffdee815f8) at bw.c:27
>
> --
>
> Michael Heinz
>
> Principal Engineer, Qlogic Corporation
>
> King of Prussia, Pennsylvania
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
In the middle of difficulty, lies opportunity
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090714/de258c01/attachment.html


More information about the mvapich-discuss mailing list