[mvapich-discuss] GPUDIrect RDMA limitations with MPI3 RMA
Jens Glaser
jsglaser at umich.edu
Sun Aug 31 16:38:07 EDT 2014
Hi,
I was experimenting with GPUDirect RDMA and the new MVAPICH2 2.0 GDR release a bit, to track down a bug which occurred persistently when
using MPI3 RMA communication with GDR in my code.
It turns out when I use MPI_Win_create_dynamic and MPI_Win_attach, as well as
PSCW (Post-start-complete-wait) active synchronization, I am limited to a GDR
message size of 32K. Larger values of MV2_GPUDIRECT_LIMIT (than 32768) lead
to incorrect (or no) data transmission, and the application will crash as a result.
Example:
$ MV2_USE_CUDA=1 MV2_GPUDIRECT_LIMIT=65536 mpirun -np 2 sh osu-micro-benchmarks-4.4/get_local_rank osu-micro-benchmarks-4.4/mpi/one-sided/osu_put_latency -w dynamic -s pscw D D
# OSU MPI_Put-CUDA Latency Test v4.4
# Window creation: MPI_Win_create_dynamic
# Synchronization: MPI_Win_post/start/complete/wait
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size Latency (us)
0 2.05
1 6.99
2 7.00
4 7.02
8 7.08
16 6.98
32 7.06
64 7.04
128 7.05
256 7.23
512 7.43
1024 7.87
2048 8.24
4096 16.57
8192 21.21
16384 28.19
32768 44.89
[ivb126:mpi_rank_1][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1275: Got FATAL event 3
mlx5: ivb127: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 08001046 378ee1d1
[ivb127:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
[ivb127:mpi_rank_0][handle_cqe] Msg from 1: wc.status=10, wc.wr_id=0x2404300, wc.opcode=0, vbuf->phead->type=38 = MPIDI_CH3_PKT_PUT
[ivb127:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got completion with error 10, vendor code=0x88, dest rank=1
The benchmark completes fine with default options (Win_flush_local and static windows), and then gives the expected ~3us small message
latency, it also completes for the above parameters when I use MV2_GPUDIRECT_LIMIT=32768 instead of 65536.
The test is running on GPU 0 across two nodes, which both have CUDA 6.5 and the following PCIex topology:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X PIX SOC SOC PHB 0,1,2,3,4,5,6,7,8,9
GPU1 PIX X SOC SOC PHB 0,1,2,3,4,5,6,7,8,9
GPU2 SOC SOC X PHB SOC 10,11,12,13,14,15,16,17,18,19
GPU3 SOC SOC PHB X SOC 10,11,12,13,14,15,16,17,18,19
mlx5_0 PHB PHB SOC SOC X
Legend:
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
Any tips/explanations are appreciated!
best
Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140831/de6c3dc7/attachment.html>
More information about the mvapich-discuss
mailing list