[mvapich-discuss] [mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Wed Jul 15 19:19:07 EDT 2009


Mike,
        I guess, I had mistakenly started the job on 3 processes earlier 
and it had hung. On running with 2 processes, (the way it is supposed to 
be run), it executes correctly on our machines. Can you give us some 
more information about your hardware. You were speaking about some 
reachability issues between certain two nodes.  I am guessing that you 
are running tests with on either :
1. Nodes "A" and "D"  or
2. Nodes "B" and "C"
       
       Also,
 >  "A" can't run mvapich2 programs with machine "B", and machine "C" 
can't run programs with machine "D"

       What exactly is the kind of error message that you see in this case?

Thanks,
Krishna

Krishna Chaitanya Kandalla wrote:
> Mike,
>           Thank you for providing the source code. I am able to 
> reproduce the hang on our cluster, as well. I will look into the issue.
>
> Thanks,
> Krishna
>
> Mike Heinz wrote:
>> I was wondering about that - I passed the parameter in a param file, 
>> using the -param argument to mpirun_rsh. I just tried passing it 
>> inline as well, here are the results:
>>
>> mpiexec -env MV2_USE_SHMEM_COLL 0 -np 2 
>> /opt/iba/src/mpi_apps/bandwidth/bw 10 10
>>
>> node 0
>>
>> Loaded symbols for /lib64/libnss_files.so.2
>> 0x00002aaaaaae5bf8 in MPIDI_CH3I_SMP_write_progress at plt ()
>>    from /usr/mpi/gcc/mvapich2-1.2p1/lib/libmpich.so.1.1
>> (gdb) where
>> #0  0x00002aaaaaae5bf8 in MPIDI_CH3I_SMP_write_progress at plt ()
>>    from /usr/mpi/gcc/mvapich2-1.2p1/lib/libmpich.so.1.1
>> #1  0x00002aaaaab17536 in MPIDI_CH3I_Progress (is_blocking=1, state=0x1)
>>     at ch3_progress.c:174
>> #2  0x00002aaaaab98e14 in PMPI_Recv (buf=0xc50000, count=4,
>>     datatype=1275068673, source=1, tag=101, comm=1140850688, 
>> status=0x601520)
>>     at recv.c:156
>> #3  0x0000000000400ea8 in main (argc=3, argv=0x7ffffe2de508) at bw.c:91
>>
>>
>> (gdb) where
>> #0  0x00002b9af218cd80 in mthca_poll_cq (ibcq=0xf5de80, ne=1,
>>     wc=0x7fffb9786a60) at src/cq.c:470
>> #1  0x00002b9af14ee2a8 in MPIDI_CH3I_MRAILI_Cq_poll (
>>     vbuf_handle=0x7fffb9786b78, vc_req=0xf55d00, receiving=0, 
>> is_blocking=1)
>>     at /usr/include/infiniband/verbs.h:934
>> #2  0x00002b9af14ef2e5 in MPIDI_CH3I_MRAILI_Waiting_msg (vc=0xf55d00,
>>     vbuf_handle=0x7fffb9786b78, blocking=1) at ibv_channel_manager.c:468
>> #3  0x00002b9af14a8304 in MPIDI_CH3I_read_progress 
>> (vc_pptr=0x7fffb9786b80,
>>     v_ptr=0x7fffb9786b78, is_blocking=<value optimized out>)
>>     at ch3_read_progress.c:158
>> #4  0x00002b9af14a7f44 in MPIDI_CH3I_Progress (is_blocking=1,
>>     state=<value optimized out>) at ch3_progress.c:202
>> #5  0x00002b9af14ec60e in MPIC_Wait (request_ptr=0xfc7978) at 
>> helper_fns.c:269
>> #6  0x00002b9af14eca03 in MPIC_Sendrecv (sendbuf=0x0, sendcount=0,
>>     sendtype=1275068685, dest=0, sendtag=1, recvbuf=0x0, recvcount=0,
>>     recvtype=1275068685, source=0, recvtag=1, comm=1140850688, 
>> status=0x1)
>>     at helper_fns.c:125
>> #7  0x00002b9af149b07a in MPIR_Barrier (comm_ptr=<value optimized out>)
>>     at barrier.c:82
>> #8  0x00002b9af149b698 in PMPI_Barrier (comm=1140850688) at 
>> barrier.c:446
>> #9  0x0000000000400ea3 in main (argc=3, argv=0x7fffb9786e88) at bw.c:81
>>
>> Bw.c is the old "bandwidth" benchmark. It looks like it actually gets 
>> out of MPI_Init() in this case, but then one side is waiting at a 
>> barrier while the other has already gone past the barrier. I've 
>> attached a copy of the program.
>>
>>
>> -- 
>> Michael Heinz
>> Principal Engineer, Qlogic Corporation
>> King of Prussia, Pennsylvania
>> -----Original Message-----
>> From: Krishna Chaitanya Kandalla [mailto:kandalla at cse.ohio-state.edu] 
>> Sent: Wednesday, July 15, 2009 3:42 PM
>> To: Mike Heinz
>> Subject: Re: [mvapich-discuss] [mpich2-dev] Need a hint in debugging 
>> a problem that only affects a few machines in our cluster.
>>
>> Mike,
>> Thats a little surprising. Setting this variable off ensures that a 
>> particular flag is set to 0. This flag is supposed to guard the piece 
>> of code that does the 2-level communicator creation. Just out of 
>> curiosity, can you also let me know the command that you are using to 
>> launch the job. The env variables need to be set before the 
>> executable is specified. If MV2_USE_SHMEM_COLL=0 appears after the 
>> executable name, the job launcher might not pick it up.
>>
>> Thanks,
>> Krishna
>>
>>
>>
>>
>> Mike Heinz wrote:
>>  
>>> Krishna, thanks for the suggestion - but setting MV2_USE_SHMEM_COLL 
>>> to zero did not seem to change the stack trace much:
>>>
>>> Node 0:
>>>
>>> 0x00002aaaaab5d8b7 in MPIDI_CH3I_MRAILI_Cq_poll 
>>> (vbuf_handle=0x7fffcb46d698,
>>>
>>> vc_req=0x0, receiving=0, is_blocking=1) at ibv_channel_manager.c:529
>>>
>>> 529 for (; i < rdma_num_hcas; ++i) {
>>>
>>> (gdb) where
>>>
>>> #0 0x00002aaaaab5d8b7 in MPIDI_CH3I_MRAILI_Cq_poll (
>>>
>>> vbuf_handle=0x7fffcb46d698, vc_req=0x0, receiving=0, is_blocking=1)
>>>
>>> at ibv_channel_manager.c:529
>>>
>>> #1 0x00002aaaaab177fa in MPIDI_CH3I_read_progress 
>>> (vc_pptr=0x7fffcb46d6a0,
>>>
>>> v_ptr=0x7fffcb46d698, is_blocking=1) at ch3_read_progress.c:143
>>>
>>> #2 0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
>>>
>>> state=<value optimized out>) at ch3_progress.c:202
>>>
>>> #3 0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
>>>
>>> at helper_fns.c:269
>>>
>>> #4 0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x10993a80, 
>>> sendcount=2,
>>>
>>> sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x10993a88, 
>>> recvcount=2,
>>>
>>> recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
>>>
>>> status=0x7fffcb46d820) at helper_fns.c:125
>>>
>>> #5 0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
>>>
>>> sendcount=<value optimized out>, sendtype=1275069445, 
>>> recvbuf=0x10993a80,
>>>
>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
>>>
>>> at allgather.c:192
>>>
>>> #6 0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>
>>> sendcount=2, sendtype=1275069445, recvbuf=0x10993a80, recvcount=2,
>>>
>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>
>>> #7 0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0, 
>>> key=0,
>>>
>>> newcomm=0x2aaaaae1c2f4) at comm_split.c:196
>>>
>>> #8 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
>>>
>>> ---Type <return> to continue, or q <return> to quit---
>>>
>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>
>>> #9 0x00002aaaaab6877d in PMPI_Init (argc=0x7fffcb46db3c, 
>>> argv=0x7fffcb46db30)
>>>
>>> at init.c:146
>>>
>>> #10 0x0000000000400b2f in main (argc=3, argv=0x7fffcb46dc78) at bw.c:27
>>>
>>> Node 1:
>>>
>>> MPIDI_CH3I_read_progress (vc_pptr=0x7fff0b10bb50, v_ptr=0x7fff0b10bb48,
>>>
>>> is_blocking=1) at ch3_read_progress.c:143
>>>
>>> 143 type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking);
>>>
>>> (gdb) where
>>>
>>> #0 MPIDI_CH3I_read_progress (vc_pptr=0x7fff0b10bb50, 
>>> v_ptr=0x7fff0b10bb48,
>>>
>>> is_blocking=1) at ch3_read_progress.c:143
>>>
>>> #1 0x00002afc9fb21f44 in MPIDI_CH3I_Progress (is_blocking=1,
>>>
>>> state=<value optimized out>) at ch3_progress.c:202
>>>
>>> #2 0x00002afc9fb6660e in MPIC_Wait (request_ptr=0x2afc9fd242a0)
>>>
>>> at helper_fns.c:269
>>>
>>> #3 0x00002afc9fb66a03 in MPIC_Sendrecv (sendbuf=0xf77028, sendcount=2,
>>>
>>> sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf77020, recvcount=4,
>>>
>>> recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
>>>
>>> status=0x7fff0b10bcd0) at helper_fns.c:125
>>>
>>> #4 0x00002afc9fb08ddb in MPIR_Allgather (sendbuf=<value optimized out>,
>>>
>>> sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf77020,
>>>
>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2afc9fd26c80)
>>>
>>> at allgather.c:192
>>>
>>> #5 0x00002afc9fb09a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>
>>> sendcount=2, sendtype=1275069445, recvbuf=0xf77020, recvcount=2,
>>>
>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>
>>> #6 0x00002afc9fb4591b in PMPI_Comm_split (comm=1140850688, color=1, 
>>> key=0,
>>>
>>> newcomm=0x2afc9fd26d94) at comm_split.c:196
>>>
>>> #7 0x00002afc9fb478f4 in create_2level_comm (comm=1140850688, size=2,
>>>
>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>
>>> #8 0x00002afc9fb730a5 in PMPI_Init (argc=0x7fff0b10bfec, 
>>> argv=0x7fff0b10bfe0)
>>>
>>> at init.c:146
>>>
>>> ---Type <return> to continue, or q <return> to quit---
>>>
>>> #9 0x0000000000400bcf in main (argc=3, argv=0x7fff0b10c128) at bw.c:27
>>>
>>> Any suggestions would be appreciated.
>>>
>>> -- 
>>>
>>> Michael Heinz
>>>
>>> Principal Engineer, Qlogic Corporation
>>>
>>> King of Prussia, Pennsylvania
>>>
>>> *From:* kris.c1986 at gmail.com [mailto:kris.c1986 at gmail.com] *On 
>>> Behalf Of *Krishna Chaitanya
>>> *Sent:* Tuesday, July 14, 2009 6:39 PM
>>> *To:* Mike Heinz
>>> *Cc:* Todd Rimmer; mvapich-discuss at cse.ohio-state.edu; 
>>> mpich2-dev at mcs.anl.gov
>>> *Subject:* Re: [mvapich-discuss] [mpich2-dev] Need a hint in 
>>> debugging a problem that only affects a few machines in our cluster.
>>>
>>> Mike,
>>> The hang seems to be occuring when the MPI library is trying to 
>>> create the 2-level communicator, during the init phase. Can you try 
>>> running the test with MV2_USE_SHMEM_COLL 
>>> <http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-16000011.74>=0. 
>>> This will ensure that a flat communicator is used for the subsequent 
>>> MPI calls. This might help us isolate the problem.
>>>
>>> Thanks,
>>> Krishna
>>>
>>> On Tue, Jul 14, 2009 at 5:04 PM, Mike Heinz 
>>> <michael.heinz at qlogic.com <mailto:michael.heinz at qlogic.com>> wrote:
>>>
>>> We're having a very odd problem with our fabric, where, out of the 
>>> entire cluster, machine "A" can't run mvapich2 programs with machine 
>>> "B", and machine "C" can't run programs with machine "D" - even 
>>> though "A" can run with "D" and "B" can run with "C" - and the rest 
>>> of the fabric works fine.
>>>
>>> 1) There are no IB errors anywhere on the fabric that I can find, 
>>> and the machines in question all work correctly with mvapich1 and 
>>> low-level IB tests.
>>>
>>> 2) The problem occurs whether using mpd or rsh.
>>>
>>> 3) If I attach to the running processes, both machines appear to be 
>>> waiting for a read operation to complete. (See below)
>>>
>>> Can anyone make a suggestion on how to debug this?
>>>
>>> Stack trace for node 0:
>>>
>>> #0 0x000000361160abb5 in pthread_spin_lock () from 
>>> /lib64/libpthread.so.0
>>>
>>> #1 0x00002aaaab08fb6c in mthca_poll_cq (ibcq=0x2060980, ne=1,
>>>
>>> wc=0x7fff9d835900) at src/cq.c:468
>>>
>>> #2 0x00002aaaaab5d8d8 in MPIDI_CH3I_MRAILI_Cq_poll (
>>>
>>> vbuf_handle=0x7fff9d8359d8, vc_req=0x0, receiving=0, is_blocking=1)
>>>
>>> at /usr/include/infiniband/verbs.h:934
>>>
>>> #3 0x00002aaaaab177fa in MPIDI_CH3I_read_progress 
>>> (vc_pptr=0x7fff9d8359e0,
>>>
>>> v_ptr=0x7fff9d8359d8, is_blocking=1) at ch3_read_progress.c:143
>>>
>>> #4 0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
>>>
>>> state=<value optimized out>) at ch3_progress.c:202
>>>
>>> #5 0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
>>>
>>> at helper_fns.c:269
>>>
>>> #6 0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x217fc50, sendcount=2,
>>>
>>> sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x217fc58, recvcount=2,
>>>
>>> recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
>>>
>>> status=0x7fff9d835b60) at helper_fns.c:125
>>>
>>> #7 0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
>>>
>>> sendcount=<value optimized out>, sendtype=1275069445, 
>>> recvbuf=0x217fc50,
>>>
>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
>>>
>>> at allgather.c:192
>>>
>>> #8 0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>
>>> sendcount=2, sendtype=1275069445, recvbuf=0x217fc50, recvcount=2,
>>>
>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>
>>> ---Type <return> to continue, or q <return> to quit---
>>>
>>> #9 0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0, 
>>> key=0,
>>>
>>> newcomm=0x2aaaaae1c2f4) at comm_split.c:196
>>>
>>> #10 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
>>>
>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>
>>> #11 0x00002aaaaab6877d in PMPI_Init (argc=0x7fff9d835e7c, 
>>> argv=0x7fff9d835e70)
>>>
>>> at init.c:146
>>>
>>> #12 0x0000000000400b2f in main (argc=3, argv=0x7fff9d835fb8) at bw.c:27
>>>
>>> Stack trace for node 1:
>>>
>>> #0 0x00002ac3cbdac2d2 in MPIDI_CH3I_read_progress 
>>> (vc_pptr=0x7fffdee81020,
>>>
>>> v_ptr=0x7fffdee81018, is_blocking=1) at ch3_read_progress.c:143
>>>
>>> #1 0x00002ac3cbdabf44 in MPIDI_CH3I_Progress (is_blocking=1,
>>>
>>> state=<value optimized out>) at ch3_progress.c:202
>>>
>>> #2 0x00002ac3cbdf060e in MPIC_Wait (request_ptr=0x2ac3cbfae2a0)
>>>
>>> at helper_fns.c:269
>>>
>>> #3 0x00002ac3cbdf0a03 in MPIC_Sendrecv (sendbuf=0xf79028, sendcount=2,
>>>
>>> sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf79020, recvcount=4,
>>>
>>> recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
>>>
>>> status=0x7fffdee811a0) at helper_fns.c:125
>>>
>>> #4 0x00002ac3cbd92ddb in MPIR_Allgather (sendbuf=<value optimized out>,
>>>
>>> sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf79020,
>>>
>>> recvcount=2, recvtype=1275069445, comm_ptr=0x2ac3cbfb0c80)
>>>
>>> at allgather.c:192
>>>
>>> #5 0x00002ac3cbd93a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
>>>
>>> sendcount=2, sendtype=1275069445, recvbuf=0xf79020, recvcount=2,
>>>
>>> recvtype=1275069445, comm=1140850688) at allgather.c:866
>>>
>>> #6 0x00002ac3cbdcf91b in PMPI_Comm_split (comm=1140850688, color=1, 
>>> key=0,
>>>
>>> newcomm=0x2ac3cbfb0d94) at comm_split.c:196
>>>
>>> #7 0x00002ac3cbdd18f4 in create_2level_comm (comm=1140850688, size=2,
>>>
>>> my_rank=<value optimized out>) at create_2level_comm.c:142
>>>
>>> #8 0x00002ac3cbdfd0a5 in PMPI_Init (argc=0x7fffdee814bc, 
>>> argv=0x7fffdee814b0)
>>>
>>> at init.c:146
>>>
>>> ---Type <return> to continue, or q <return> to quit---
>>>
>>> #9 0x0000000000400bcf in main (argc=3, argv=0x7fffdee815f8) at bw.c:27
>>>
>>> -- 
>>>
>>> Michael Heinz
>>>
>>> Principal Engineer, Qlogic Corporation
>>>
>>> King of Prussia, Pennsylvania
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu 
>>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>> -- 
>>> In the middle of difficulty, lies opportunity
>>>
>>>     
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


More information about the mvapich-discuss mailing list