[mvapich-discuss] Re: MPI_Alltoallv - performance question
Jens Glaser
jglaser at umn.edu
Wed Jun 12 23:02:52 EDT 2013
Hi,
one update: after tentatively replacing the Alltoallv with Isend/Irecv pairs, the performance is only slightly improved (5-10%)
for my small n=2 test case. This means, I am limited by real communication costs, not by Alltoallv overhead, even if it is not immediately
clear from the profiling. I have therefore probably answered my question myself.
Jens
On Jun 12, 2013, at 8:36 PM, Jens Glaser <jglaser at umn.edu> wrote:
> Hi,
>
> I am writing a parallel FFT for GPUs. It relies on MPI_Alltoallv to do data redistribution. With MVAPICH2 1.9b (the version installed on the node)
> I am finding that most of the time of the communicating code is spent inside that function. Are there ways to tune performance of that routine?
>
> Specifically, I am running my code between two GPUs on the same IOH (mpirun -np 2), so MVAPICH2 should use IPC. All my data is on the device. In this first case,
> I am sending large messages (2MB). While the actual memory copies (cuMemcpyAsync) account for about 12.9% of the execution time you can see that the
> total fraction of time spent inside the MPI call is actually 70%! Even if some of the synchronization calls in the table should possibly be added to that number,
> I did not expect that the host code of MPI_Alltoallv to actually take so much time.
>
> Here's the tau profile
> (obtained with mpirun -n 2 tau_exec -T mpi,cupti -cupti my_program)
>
> ---------------------------------------------------------------------------------------
> %Time Exclusive Inclusive #Call #Subrs Inclusive Name
> msec total msec usec/call
> ---------------------------------------------------------------------------------------
> 100.0 1,826 14,795 1 21306 14795013 .TAU application
> 70.6 4,663 10,441 2000 2.5161E+06 5221 MPI_Alltoallv()
> 12.9 1,914 1,914 128000 0 15 cuMemcpyAsync
> 12.7 1,885 1,885 2.10748E+06 0 1 cuEventQuery [THROTTLED]
> 11.1 1,638 1,638 1000 0 1639 cuEventSynchronize
> 7.1 1,044 1,044 134004 0 8 cuEventRecord [THROTTLED]
> 6.0 891 891 128000 0 7 cuStreamWaitEvent [THROTTLED]
> 2.3 318 334 1 96 334490 MPI_Init()
> 1.7 251 251 1001 0 251 cuCtxSynchronize
> 0.9 139 139 2000 90 70 MPI_Barrier()
> 0.7 102 102 7000 0 15 cuLaunchKernel
> 0.2 27 27 2000 0 14 cuMemcpy
> 0.2 22 22 18709 0 1 cuPointerGetAttribute
> 0.1 10 10 1 0 10905 cuMemcpyHtoDAsync_v2
> 0.1 10 10 1 0 10095 cuMemcpyDtoH_v2
> 0.1 8 8 2 0 4251 cuMemHostRegister
> 0.1 8 8 9 0 894 cuMemAlloc_v2
> 0.0 6 6 4000 0 2 cuFuncSetCacheConfig
> 0.0 1 5 1 86 5830 MPI_Finalize()
> 0.0 5 5 294 0 18 cuDeviceGetAttribute
> 0.0 4 4 3 0 1358 cuIpcOpenMemHandle
> 0.0 3 3 1000 0 3 cuEventElapsedTime
> 0.0 2 2 2 0 1340 cuMemHostUnregister
> 0.0 1 1 7 0 271 cuMemFree_v2
> 0.0 1 1 3 0 380 cuIpcCloseMemHandle
>
> etc.
>
> If I change the code to send only very small messages (32 Byte), about 50% of the time is spent in MPI_Alltoallv, out of which only 11% is used by cuMemcpyD2D, and another
> 15% by synchronization, if I am reading the summary correctly. Is it true then that only about 50% of the time used by MPI call is actually spent in data transfers?
> This would correspond to a significant overhead!
>
> Thanks
> Jens
More information about the mvapich-discuss
mailing list