[mvapich-discuss] MPI_Alltoallv - performance question

Wed Jun 12 21:36:31 EDT 2013

Hi,

I am writing a parallel FFT for GPUs. It relies on MPI_Alltoallv to do data redistribution. With MVAPICH2 1.9b (the version installed on the node)
I am finding that most of the time of the communicating code is spent inside that function. Are there ways to tune performance of that routine?

Specifically, I am running my code between two GPUs on the same IOH (mpirun -np 2), so  MVAPICH2 should use IPC. All my data is on the device. In this first case,
I am sending large messages (2MB). While the actual memory copies (cuMemcpyAsync) account for about 12.9% of the execution time you can see that the
total fraction of time spent inside the MPI call is actually 70%! Even if some of the synchronization calls in the table should possibly be added to that number,
I did not expect that the host code of MPI_Alltoallv to actually take so much time.

Here's the tau profile
(obtained with mpirun -n 2 tau_exec -T mpi,cupti -cupti my_program)

---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0        1,826       14,795           1       21306   14795013 .TAU application
 70.6        4,663       10,441        2000  2.5161E+06       5221 MPI_Alltoallv() 
 12.9        1,914        1,914      128000           0         15 cuMemcpyAsync
 12.7        1,885        1,885 2.10748E+06           0          1 cuEventQuery [THROTTLED]
 11.1        1,638        1,638        1000           0       1639 cuEventSynchronize
  7.1        1,044        1,044      134004           0          8 cuEventRecord [THROTTLED]
  6.0          891          891      128000           0          7 cuStreamWaitEvent [THROTTLED]
  2.3          318          334           1          96     334490 MPI_Init() 
  1.7          251          251        1001           0        251 cuCtxSynchronize
  0.9          139          139        2000          90         70 MPI_Barrier() 
  0.7          102          102        7000           0         15 cuLaunchKernel
  0.2           27           27        2000           0         14 cuMemcpy
  0.2           22           22       18709           0          1 cuPointerGetAttribute
  0.1           10           10           1           0      10905 cuMemcpyHtoDAsync_v2
  0.1           10           10           1           0      10095 cuMemcpyDtoH_v2
  0.1            8            8           2           0       4251 cuMemHostRegister
  0.1            8            8           9           0        894 cuMemAlloc_v2
  0.0            6            6        4000           0          2 cuFuncSetCacheConfig
  0.0            1            5           1          86       5830 MPI_Finalize() 
  0.0            5            5         294           0         18 cuDeviceGetAttribute
  0.0            4            4           3           0       1358 cuIpcOpenMemHandle
  0.0            3            3        1000           0          3 cuEventElapsedTime
  0.0            2            2           2           0       1340 cuMemHostUnregister
  0.0            1            1           7           0        271 cuMemFree_v2
  0.0            1            1           3           0        380 cuIpcCloseMemHandle

etc.

If I change the code to send only very small messages (32 Byte), about 50% of the time is spent in MPI_Alltoallv, out of which only 11% is used by cuMemcpyD2D, and another
15% by synchronization, if I am reading the summary correctly. Is it true then that only about 50% of the time used by MPI call is actually spent in data transfers?
This would correspond to a significant overhead!

Thanks
Jens