[mvapich-discuss] GPUDIrect RDMA limitations with MPI3 RMA

Sun Aug 31 16:38:07 EDT 2014

Hi,

I was experimenting with GPUDirect RDMA and the new MVAPICH2 2.0 GDR release a bit, to track down a bug which occurred persistently when
using MPI3 RMA communication with GDR in my code.

It turns out when I use MPI_Win_create_dynamic and MPI_Win_attach, as well as 
PSCW (Post-start-complete-wait) active synchronization, I am limited to a GDR
message size of 32K. Larger values of MV2_GPUDIRECT_LIMIT (than 32768) lead
to incorrect (or no) data transmission, and the application will crash as a result.

Example:

$ MV2_USE_CUDA=1 MV2_GPUDIRECT_LIMIT=65536 mpirun -np 2 sh osu-micro-benchmarks-4.4/get_local_rank osu-micro-benchmarks-4.4/mpi/one-sided/osu_put_latency -w dynamic -s pscw D D
# OSU MPI_Put-CUDA Latency Test v4.4
# Window creation: MPI_Win_create_dynamic
# Synchronization: MPI_Win_post/start/complete/wait
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size          Latency (us)
0                       2.05
1                       6.99
2                       7.00
4                       7.02
8                       7.08
16                      6.98
32                      7.06
64                      7.04
128                     7.05
256                     7.23
512                     7.43
1024                    7.87
2048                    8.24
4096                   16.57
8192                   21.21
16384                  28.19
32768                  44.89
[ivb126:mpi_rank_1][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1275: Got FATAL event 3

mlx5: ivb127: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 08001046 378ee1d1
[ivb127:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
[ivb127:mpi_rank_0][handle_cqe] Msg from 1: wc.status=10, wc.wr_id=0x2404300, wc.opcode=0, vbuf->phead->type=38 = MPIDI_CH3_PKT_PUT
[ivb127:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got completion with error 10, vendor code=0x88, dest rank=1

The benchmark completes fine with default options (Win_flush_local and static windows), and then gives the expected ~3us small message
latency, it also completes for the above parameters when I use MV2_GPUDIRECT_LIMIT=32768 instead of 65536.

The test is running on GPU 0 across two nodes, which both have CUDA 6.5 and the following PCIex topology:

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	mlx5_0	CPU Affinity
GPU0	 X 	PIX	SOC	SOC	PHB	0,1,2,3,4,5,6,7,8,9
GPU1	PIX	 X 	SOC	SOC	PHB	0,1,2,3,4,5,6,7,8,9
GPU2	SOC	SOC	 X 	PHB	SOC	10,11,12,13,14,15,16,17,18,19
GPU3	SOC	SOC	PHB	 X 	SOC	10,11,12,13,14,15,16,17,18,19
mlx5_0	PHB	PHB	SOC	SOC	 X 	

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

Any tips/explanations are appreciated!

best
Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140831/de6c3dc7/attachment.html>