[mvapich-discuss] One-sided communication error on multiple nodes

Devendar Bureddy bureddy at cse.ohio-state.edu
Tue Aug 6 18:50:16 EDT 2013


HI Van

I tried your test program with 1.9 build with same configuration. It is
running fine  with 128 procs (16 nodes) on QDR fabric.  That error messages
you reported is an indication of a bad event from hardware. I'm not sure if
there is any bad HCA in the fabric. Can you try osu benchmarks across
multiple nodes and see if they running fine?

$ install/bin/mpirun_rsh -np 128 -hostfile hostfile ./commtest
...
Total Time: 0.7618
Copy Buffer Time: 0.0177
Compute Time: 0.0262
Communication Time: 0.7179
Communication Setup Time: 0.5377
Communication Fence Time: 0.1802
Communication Other Time: 0.0000
Communication Time: 0.7182
Communication Setup Time: 0.5536
Communication Fence Time: 0.1646
Communication Other Time: 0.0000
...
$

-Devendar

On Tue, Aug 6, 2013 at 5:28 PM, Van Bui <vbui at mcs.anl.gov> wrote:

> Attached is the test code.
>
> Van
>
> ----- Original Message -----
> From: "Van Bui" <vbui at mcs.anl.gov>
> To: mvapich-discuss at cse.ohio-state.edu
> Sent: Tuesday, August 6, 2013 4:38:23 PM
> Subject: [mvapich-discuss] One-sided communication error on multiple nodes
>
> Hi,
>
> I am getting the following runtime error when I run my code using the
> latest version of MVAPICH2 (1.9). The code seems to run fine if I run it on
> a single node. I get the error only when I run it on multiple nodes on a
> Sandy Bridge cluster (2 sockets per node). The cluster uses a QDR
> Infiniband fabric. The code also runs fine with MPICH on multiple nodes.
>
> Here is my config line for MVAPICH2: -with-device=ch3:nemesis:ib,tcp
> CC=icc F77=ifort FC=ifort CXX=icpc
>
> My code uses MPI one-sided communication. Here are some of the MPI calls
> in my code: MPI_Win_create_dynamic, MPI_Win_attach, MPI_Win_fence, and
> MPI_Put.
>
> Please let me know if you need more details about the error or the setup.
>
> [iforge127:mpi_rank_0][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL
> event 3
>
> [iforge126:mpi_rank_31][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL
> event 3
>
> [iforge127:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 14. MPI process died?
> [iforge127:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [0->47] send desc error, wc_opcode=0
> [iforge073:mpi_rank_63][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL
> event 3
>
> [0->47] wc.status=10, wc.wr_id=0x9cc9e0, wc.opcode=0, vbuf->phead->type=0
> = MPIDI_CH3_PKT_EAGER_SEND
> [iforge073:mpi_rank_48][MPIDI_CH3I_MRAILI_Cq_poll]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got
> completion with error 10, vendor code=0x88, dest rank=47
> : No such file or directory (2)
> [iforge127:mpi_rank_15][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL
> event 3
>
> [iforge073:mpispawn_3][readline] Unexpected End-Of-File on file descriptor
> 17. MPI process died?
> [iforge073:mpispawn_3][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [iforge126:mpispawn_1][readline] Unexpected End-Of-File on file descriptor
> 19. MPI process died?
> [iforge126:mpispawn_1][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [0<-15] recv desc error, wc_opcode=128
> [0->15] wc.status=10, wc.wr_id=0x1c9f600, wc.opcode=128,
> vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS_REPLY
> [iforge126:mpi_rank_16][MPIDI_CH3I_MRAILI_Cq_poll]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got
> completion with error 10, vendor code=0x88, dest rank=15
> : No such file or directory (2)
> [iforge074:mpi_rank_47][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL
> event 3
>
> [0->31] send desc error, wc_opcode=0
> [0->31] wc.status=10, wc.wr_id=0x1a39ad8, wc.opcode=0,
> vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS_REPLY
> [iforge074:mpi_rank_32][MPIDI_CH3I_MRAILI_Cq_poll]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got
> completion with error 10, vendor code=0x88, dest rank=31
> : No such file or directory (2)
>
> Thanks,
> Van
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130806/f9937746/attachment-0001.html


More information about the mvapich-discuss mailing list