[mvapich-discuss] MPI_Alltoallv - performance question
Jens Glaser
jglaser at umn.edu
Wed Jun 12 21:36:31 EDT 2013
Hi,
I am writing a parallel FFT for GPUs. It relies on MPI_Alltoallv to do data redistribution. With MVAPICH2 1.9b (the version installed on the node)
I am finding that most of the time of the communicating code is spent inside that function. Are there ways to tune performance of that routine?
Specifically, I am running my code between two GPUs on the same IOH (mpirun -np 2), so MVAPICH2 should use IPC. All my data is on the device. In this first case,
I am sending large messages (2MB). While the actual memory copies (cuMemcpyAsync) account for about 12.9% of the execution time you can see that the
total fraction of time spent inside the MPI call is actually 70%! Even if some of the synchronization calls in the table should possibly be added to that number,
I did not expect that the host code of MPI_Alltoallv to actually take so much time.
Here's the tau profile
(obtained with mpirun -n 2 tau_exec -T mpi,cupti -cupti my_program)
---------------------------------------------------------------------------------------
%Time Exclusive Inclusive #Call #Subrs Inclusive Name
msec total msec usec/call
---------------------------------------------------------------------------------------
100.0 1,826 14,795 1 21306 14795013 .TAU application
70.6 4,663 10,441 2000 2.5161E+06 5221 MPI_Alltoallv()
12.9 1,914 1,914 128000 0 15 cuMemcpyAsync
12.7 1,885 1,885 2.10748E+06 0 1 cuEventQuery [THROTTLED]
11.1 1,638 1,638 1000 0 1639 cuEventSynchronize
7.1 1,044 1,044 134004 0 8 cuEventRecord [THROTTLED]
6.0 891 891 128000 0 7 cuStreamWaitEvent [THROTTLED]
2.3 318 334 1 96 334490 MPI_Init()
1.7 251 251 1001 0 251 cuCtxSynchronize
0.9 139 139 2000 90 70 MPI_Barrier()
0.7 102 102 7000 0 15 cuLaunchKernel
0.2 27 27 2000 0 14 cuMemcpy
0.2 22 22 18709 0 1 cuPointerGetAttribute
0.1 10 10 1 0 10905 cuMemcpyHtoDAsync_v2
0.1 10 10 1 0 10095 cuMemcpyDtoH_v2
0.1 8 8 2 0 4251 cuMemHostRegister
0.1 8 8 9 0 894 cuMemAlloc_v2
0.0 6 6 4000 0 2 cuFuncSetCacheConfig
0.0 1 5 1 86 5830 MPI_Finalize()
0.0 5 5 294 0 18 cuDeviceGetAttribute
0.0 4 4 3 0 1358 cuIpcOpenMemHandle
0.0 3 3 1000 0 3 cuEventElapsedTime
0.0 2 2 2 0 1340 cuMemHostUnregister
0.0 1 1 7 0 271 cuMemFree_v2
0.0 1 1 3 0 380 cuIpcCloseMemHandle
etc.
If I change the code to send only very small messages (32 Byte), about 50% of the time is spent in MPI_Alltoallv, out of which only 11% is used by cuMemcpyD2D, and another
15% by synchronization, if I am reading the summary correctly. Is it true then that only about 50% of the time used by MPI call is actually spent in data transfers?
This would correspond to a significant overhead!
Thanks
Jens
More information about the mvapich-discuss
mailing list