[mvapich-discuss] One-sided communication error on multiple nodes
Van Bui
vbui at mcs.anl.gov
Thu Aug 8 09:52:16 EDT 2013
Thanks for looking into this Devendar. I actually had a couple of copies of mvapich2-latest in my download folder and had accidentally installed 1.9-r6297. The runtime error goes away with 1.9-r6338 and the performance is very good too.
Van
----- Original Message -----
From: "Devendar Bureddy" <bureddy at cse.ohio-state.edu>
To: "Van Bui" <vbui at mcs.anl.gov>
Cc: mvapich-discuss at cse.ohio-state.edu
Sent: Tuesday, August 6, 2013 6:50:16 PM
Subject: Re: [mvapich-discuss] One-sided communication error on multiple nodes
HI Van
I tried your test program with 1.9 build with same configuration. It is running fine with 128 procs (16 nodes) on QDR fabric. That error messages you reported is an indication of a bad event from hardware. I'm not sure if there is any bad HCA in the fabric. Can you try osu benchmarks across multiple nodes and see if they running fine?
$ install/bin/mpirun_rsh -np 128 -hostfile hostfile ./commtest
...
Total Time: 0.7618
Copy Buffer Time: 0.0177
Compute Time: 0.0262
Communication Time: 0.7179
Communication Setup Time: 0.5377
Communication Fence Time: 0.1802
Communication Other Time: 0.0000
Communication Time: 0.7182
Communication Setup Time: 0.5536
Communication Fence Time: 0.1646
Communication Other Time: 0.0000
...
$
-Devendar
On Tue, Aug 6, 2013 at 5:28 PM, Van Bui < vbui at mcs.anl.gov > wrote:
Attached is the test code.
Van
----- Original Message -----
From: "Van Bui" < vbui at mcs.anl.gov >
To: mvapich-discuss at cse.ohio-state.edu
Sent: Tuesday, August 6, 2013 4:38:23 PM
Subject: [mvapich-discuss] One-sided communication error on multiple nodes
Hi,
I am getting the following runtime error when I run my code using the latest version of MVAPICH2 (1.9). The code seems to run fine if I run it on a single node. I get the error only when I run it on multiple nodes on a Sandy Bridge cluster (2 sockets per node). The cluster uses a QDR Infiniband fabric. The code also runs fine with MPICH on multiple nodes.
Here is my config line for MVAPICH2: -with-device=ch3:nemesis:ib,tcp CC=icc F77=ifort FC=ifort CXX=icpc
My code uses MPI one-sided communication. Here are some of the MPI calls in my code: MPI_Win_create_dynamic, MPI_Win_attach, MPI_Win_fence, and MPI_Put.
Please let me know if you need more details about the error or the setup.
[iforge127:mpi_rank_0][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3
[iforge126:mpi_rank_31][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3
[iforge127:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14. MPI process died?
[iforge127:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[0->47] send desc error, wc_opcode=0
[iforge073:mpi_rank_63][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3
[0->47] wc.status=10, wc.wr_id=0x9cc9e0, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[iforge073:mpi_rank_48][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got completion with error 10, vendor code=0x88, dest rank=47
: No such file or directory (2)
[iforge127:mpi_rank_15][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3
[iforge073:mpispawn_3][readline] Unexpected End-Of-File on file descriptor 17. MPI process died?
[iforge073:mpispawn_3][mtpmi_processops] Error while reading PMI socket. MPI process died?
[iforge126:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 19. MPI process died?
[iforge126:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[0<-15] recv desc error, wc_opcode=128
[0->15] wc.status=10, wc.wr_id=0x1c9f600, wc.opcode=128, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS_REPLY
[iforge126:mpi_rank_16][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got completion with error 10, vendor code=0x88, dest rank=15
: No such file or directory (2)
[iforge074:mpi_rank_47][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3
[0->31] send desc error, wc_opcode=0
[0->31] wc.status=10, wc.wr_id=0x1a39ad8, wc.opcode=0, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS_REPLY
[iforge074:mpi_rank_32][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got completion with error 10, vendor code=0x88, dest rank=31
: No such file or directory (2)
Thanks,
Van
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Devendar
More information about the mvapich-discuss
mailing list