[mvapich-discuss] One-sided communication error on multiple nodes

Van Bui vbui at mcs.anl.gov
Thu Aug 8 09:52:16 EDT 2013


Thanks for looking into this Devendar. I actually had a couple of copies of mvapich2-latest in my download folder and had accidentally installed 1.9-r6297. The runtime error goes away with 1.9-r6338 and the performance is very good too. 

Van 

----- Original Message -----
From: "Devendar Bureddy" <bureddy at cse.ohio-state.edu>
To: "Van Bui" <vbui at mcs.anl.gov>
Cc: mvapich-discuss at cse.ohio-state.edu
Sent: Tuesday, August 6, 2013 6:50:16 PM
Subject: Re: [mvapich-discuss] One-sided communication error on multiple nodes


HI Van 


I tried your test program with 1.9 build with same configuration. It is running fine with 128 procs (16 nodes) on QDR fabric. That error messages you reported is an indication of a bad event from hardware. I'm not sure if there is any bad HCA in the fabric. Can you try osu benchmarks across multiple nodes and see if they running fine? 


$ install/bin/mpirun_rsh -np 128 -hostfile hostfile ./commtest 

... 

Total Time: 0.7618 
Copy Buffer Time: 0.0177 
Compute Time: 0.0262 
Communication Time: 0.7179 
Communication Setup Time: 0.5377 
Communication Fence Time: 0.1802 
Communication Other Time: 0.0000 
Communication Time: 0.7182 
Communication Setup Time: 0.5536 
Communication Fence Time: 0.1646 
Communication Other Time: 0.0000 
... 
$ 


-Devendar 


On Tue, Aug 6, 2013 at 5:28 PM, Van Bui < vbui at mcs.anl.gov > wrote: 


Attached is the test code. 

Van 


----- Original Message ----- 
From: "Van Bui" < vbui at mcs.anl.gov > 
To: mvapich-discuss at cse.ohio-state.edu 
Sent: Tuesday, August 6, 2013 4:38:23 PM 
Subject: [mvapich-discuss] One-sided communication error on multiple nodes 



Hi, 

I am getting the following runtime error when I run my code using the latest version of MVAPICH2 (1.9). The code seems to run fine if I run it on a single node. I get the error only when I run it on multiple nodes on a Sandy Bridge cluster (2 sockets per node). The cluster uses a QDR Infiniband fabric. The code also runs fine with MPICH on multiple nodes. 

Here is my config line for MVAPICH2: -with-device=ch3:nemesis:ib,tcp CC=icc F77=ifort FC=ifort CXX=icpc 

My code uses MPI one-sided communication. Here are some of the MPI calls in my code: MPI_Win_create_dynamic, MPI_Win_attach, MPI_Win_fence, and MPI_Put. 

Please let me know if you need more details about the error or the setup. 

[iforge127:mpi_rank_0][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3 

[iforge126:mpi_rank_31][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3 

[iforge127:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14. MPI process died? 
[iforge127:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? 
[0->47] send desc error, wc_opcode=0 
[iforge073:mpi_rank_63][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3 

[0->47] wc.status=10, wc.wr_id=0x9cc9e0, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND 
[iforge073:mpi_rank_48][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got completion with error 10, vendor code=0x88, dest rank=47 
: No such file or directory (2) 
[iforge127:mpi_rank_15][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3 

[iforge073:mpispawn_3][readline] Unexpected End-Of-File on file descriptor 17. MPI process died? 
[iforge073:mpispawn_3][mtpmi_processops] Error while reading PMI socket. MPI process died? 
[iforge126:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 19. MPI process died? 
[iforge126:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died? 
[0<-15] recv desc error, wc_opcode=128 
[0->15] wc.status=10, wc.wr_id=0x1c9f600, wc.opcode=128, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS_REPLY 
[iforge126:mpi_rank_16][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got completion with error 10, vendor code=0x88, dest rank=15 
: No such file or directory (2) 
[iforge074:mpi_rank_47][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1002: Got FATAL event 3 

[0->31] send desc error, wc_opcode=0 
[0->31] wc.status=10, wc.wr_id=0x1a39ad8, wc.opcode=0, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS_REPLY 
[iforge074:mpi_rank_32][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:580: [] Got completion with error 10, vendor code=0x88, dest rank=31 
: No such file or directory (2) 

Thanks, 
Van 
_______________________________________________ 
mvapich-discuss mailing list 
mvapich-discuss at cse.ohio-state.edu 
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 

_______________________________________________ 
mvapich-discuss mailing list 
mvapich-discuss at cse.ohio-state.edu 
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 





-- 
Devendar 


More information about the mvapich-discuss mailing list