[mvapich-discuss] Re: MPI_Alltoallv - performance question

Wed Jun 12 23:02:52 EDT 2013

Hi,

one update: after tentatively replacing the Alltoallv with Isend/Irecv pairs, the performance is only slightly improved (5-10%)
for my small n=2 test case. This means, I am limited by real communication costs,  not by Alltoallv overhead, even if it is not immediately
clear from the profiling. I have therefore probably answered my question myself.

Jens

On Jun 12, 2013, at 8:36 PM, Jens Glaser <jglaser at umn.edu> wrote:

> Hi,
> 
> I am writing a parallel FFT for GPUs. It relies on MPI_Alltoallv to do data redistribution. With MVAPICH2 1.9b (the version installed on the node)
> I am finding that most of the time of the communicating code is spent inside that function. Are there ways to tune performance of that routine?
> 
> Specifically, I am running my code between two GPUs on the same IOH (mpirun -np 2), so  MVAPICH2 should use IPC. All my data is on the device. In this first case,
> I am sending large messages (2MB). While the actual memory copies (cuMemcpyAsync) account for about 12.9% of the execution time you can see that the
> total fraction of time spent inside the MPI call is actually 70%! Even if some of the synchronization calls in the table should possibly be added to that number,
> I did not expect that the host code of MPI_Alltoallv to actually take so much time.
> 
> Here's the tau profile
> (obtained with mpirun -n 2 tau_exec -T mpi,cupti -cupti my_program)
> 
> ---------------------------------------------------------------------------------------
> %Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
>              msec   total msec                          usec/call 
> ---------------------------------------------------------------------------------------
> 100.0        1,826       14,795           1       21306   14795013 .TAU application
> 70.6        4,663       10,441        2000  2.5161E+06       5221 MPI_Alltoallv() 
> 12.9        1,914        1,914      128000           0         15 cuMemcpyAsync
> 12.7        1,885        1,885 2.10748E+06           0          1 cuEventQuery [THROTTLED]
> 11.1        1,638        1,638        1000           0       1639 cuEventSynchronize
>  7.1        1,044        1,044      134004           0          8 cuEventRecord [THROTTLED]
>  6.0          891          891      128000           0          7 cuStreamWaitEvent [THROTTLED]
>  2.3          318          334           1          96     334490 MPI_Init() 
>  1.7          251          251        1001           0        251 cuCtxSynchronize
>  0.9          139          139        2000          90         70 MPI_Barrier() 
>  0.7          102          102        7000           0         15 cuLaunchKernel
>  0.2           27           27        2000           0         14 cuMemcpy
>  0.2           22           22       18709           0          1 cuPointerGetAttribute
>  0.1           10           10           1           0      10905 cuMemcpyHtoDAsync_v2
>  0.1           10           10           1           0      10095 cuMemcpyDtoH_v2
>  0.1            8            8           2           0       4251 cuMemHostRegister
>  0.1            8            8           9           0        894 cuMemAlloc_v2
>  0.0            6            6        4000           0          2 cuFuncSetCacheConfig
>  0.0            1            5           1          86       5830 MPI_Finalize() 
>  0.0            5            5         294           0         18 cuDeviceGetAttribute
>  0.0            4            4           3           0       1358 cuIpcOpenMemHandle
>  0.0            3            3        1000           0          3 cuEventElapsedTime
>  0.0            2            2           2           0       1340 cuMemHostUnregister
>  0.0            1            1           7           0        271 cuMemFree_v2
>  0.0            1            1           3           0        380 cuIpcCloseMemHandle
> 
> etc.
> 
> If I change the code to send only very small messages (32 Byte), about 50% of the time is spent in MPI_Alltoallv, out of which only 11% is used by cuMemcpyD2D, and another
> 15% by synchronization, if I am reading the summary correctly. Is it true then that only about 50% of the time used by MPI call is actually spent in data transfers?
> This would correspond to a significant overhead!
> 
> Thanks
> Jens