[mvapich-discuss] GPUDIrect RDMA limitations with MPI3 RMA
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Tue Sep 2 13:39:54 EDT 2014
Hello Jens. Thank you for the report. We have been able to reproduce
this behavior and investigating this further. We'll get back to you
once we are able to come up with a fix or workaround.
On Sun, Aug 31, 2014 at 04:38:07PM -0400, Jens Glaser wrote:
> Hi,
>
> I was experimenting with GPUDirect RDMA and the new MVAPICH2 2.0 GDR release a bit, to track down a bug which occurred persistently when
> using MPI3 RMA communication with GDR in my code.
>
> It turns out when I use MPI_Win_create_dynamic and MPI_Win_attach, as well as
> PSCW (Post-start-complete-wait) active synchronization, I am limited to a GDR
> message size of 32K. Larger values of MV2_GPUDIRECT_LIMIT (than 32768) lead
> to incorrect (or no) data transmission, and the application will crash as a result.
>
> Example:
>
> $ MV2_USE_CUDA=1 MV2_GPUDIRECT_LIMIT=65536 mpirun -np 2 sh osu-micro-benchmarks-4.4/get_local_rank osu-micro-benchmarks-4.4/mpi/one-sided/osu_put_latency -w dynamic -s pscw D D
> # OSU MPI_Put-CUDA Latency Test v4.4
> # Window creation: MPI_Win_create_dynamic
> # Synchronization: MPI_Win_post/start/complete/wait
> # Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
> # Size Latency (us)
> 0 2.05
> 1 6.99
> 2 7.00
> 4 7.02
> 8 7.08
> 16 6.98
> 32 7.06
> 64 7.04
> 128 7.05
> 256 7.23
> 512 7.43
> 1024 7.87
> 2048 8.24
> 4096 16.57
> 8192 21.21
> 16384 28.19
> 32768 44.89
> [ivb126:mpi_rank_1][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1275: Got FATAL event 3
>
> mlx5: ivb127: got completion with error:
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 00000000 00008813 08001046 378ee1d1
> [ivb127:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
> [ivb127:mpi_rank_0][handle_cqe] Msg from 1: wc.status=10, wc.wr_id=0x2404300, wc.opcode=0, vbuf->phead->type=38 = MPIDI_CH3_PKT_PUT
> [ivb127:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got completion with error 10, vendor code=0x88, dest rank=1
>
> The benchmark completes fine with default options (Win_flush_local and static windows), and then gives the expected ~3us small message
> latency, it also completes for the above parameters when I use MV2_GPUDIRECT_LIMIT=32768 instead of 65536.
>
> The test is running on GPU 0 across two nodes, which both have CUDA 6.5 and the following PCIex topology:
>
> $ nvidia-smi topo -m
> GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
> GPU0 X PIX SOC SOC PHB 0,1,2,3,4,5,6,7,8,9
> GPU1 PIX X SOC SOC PHB 0,1,2,3,4,5,6,7,8,9
> GPU2 SOC SOC X PHB SOC 10,11,12,13,14,15,16,17,18,19
> GPU3 SOC SOC PHB X SOC 10,11,12,13,14,15,16,17,18,19
> mlx5_0 PHB PHB SOC SOC X
>
> Legend:
>
> X = Self
> SOC = Path traverses a socket-level link (e.g. QPI)
> PHB = Path traverses a PCIe host bridge
> PXB = Path traverses multiple PCIe internal switches
> PIX = Path traverses a PCIe internal switch
>
> Any tips/explanations are appreciated!
>
> best
> Jens
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Jonathan Perkins
More information about the mvapich-discuss
mailing list