[mvapich-discuss] GPUDIrect RDMA limitations with MPI3 RMA

Tue Sep 2 13:39:54 EDT 2014

Hello Jens.  Thank you for the report.  We have been able to reproduce
this behavior and investigating this further.  We'll get back to you
once we are able to come up with a fix or workaround.

On Sun, Aug 31, 2014 at 04:38:07PM -0400, Jens Glaser wrote:
> Hi,
> 
> I was experimenting with GPUDirect RDMA and the new MVAPICH2 2.0 GDR release a bit, to track down a bug which occurred persistently when
> using MPI3 RMA communication with GDR in my code.
> 
> It turns out when I use MPI_Win_create_dynamic and MPI_Win_attach, as well as 
> PSCW (Post-start-complete-wait) active synchronization, I am limited to a GDR
> message size of 32K. Larger values of MV2_GPUDIRECT_LIMIT (than 32768) lead
> to incorrect (or no) data transmission, and the application will crash as a result.
> 
> Example:
> 
> $ MV2_USE_CUDA=1 MV2_GPUDIRECT_LIMIT=65536 mpirun -np 2 sh osu-micro-benchmarks-4.4/get_local_rank osu-micro-benchmarks-4.4/mpi/one-sided/osu_put_latency -w dynamic -s pscw D D
> # OSU MPI_Put-CUDA Latency Test v4.4
> # Window creation: MPI_Win_create_dynamic
> # Synchronization: MPI_Win_post/start/complete/wait
> # Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
> # Size          Latency (us)
> 0                       2.05
> 1                       6.99
> 2                       7.00
> 4                       7.02
> 8                       7.08
> 16                      6.98
> 32                      7.06
> 64                      7.04
> 128                     7.05
> 256                     7.23
> 512                     7.43
> 1024                    7.87
> 2048                    8.24
> 4096                   16.57
> 8192                   21.21
> 16384                  28.19
> 32768                  44.89
> [ivb126:mpi_rank_1][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1275: Got FATAL event 3
> 
> mlx5: ivb127: got completion with error:
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000000 00008813 08001046 378ee1d1
> [ivb127:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
> [ivb127:mpi_rank_0][handle_cqe] Msg from 1: wc.status=10, wc.wr_id=0x2404300, wc.opcode=0, vbuf->phead->type=38 = MPIDI_CH3_PKT_PUT
> [ivb127:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got completion with error 10, vendor code=0x88, dest rank=1
> 
> The benchmark completes fine with default options (Win_flush_local and static windows), and then gives the expected ~3us small message
> latency, it also completes for the above parameters when I use MV2_GPUDIRECT_LIMIT=32768 instead of 65536.
> 
> The test is running on GPU 0 across two nodes, which both have CUDA 6.5 and the following PCIex topology:
> 
> $ nvidia-smi topo -m
> 	GPU0	GPU1	GPU2	GPU3	mlx5_0	CPU Affinity
> GPU0	 X 	PIX	SOC	SOC	PHB	0,1,2,3,4,5,6,7,8,9
> GPU1	PIX	 X 	SOC	SOC	PHB	0,1,2,3,4,5,6,7,8,9
> GPU2	SOC	SOC	 X 	PHB	SOC	10,11,12,13,14,15,16,17,18,19
> GPU3	SOC	SOC	PHB	 X 	SOC	10,11,12,13,14,15,16,17,18,19
> mlx5_0	PHB	PHB	SOC	SOC	 X 	
> 
> Legend:
> 
>   X   = Self
>   SOC = Path traverses a socket-level link (e.g. QPI)
>   PHB = Path traverses a PCIe host bridge
>   PXB = Path traverses multiple PCIe internal switches
>   PIX = Path traverses a PCIe internal switch
> 
> Any tips/explanations are appreciated!
> 
> best
> Jens

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jonathan Perkins