[mvapich-discuss] The cost of MPI_Fence()?

Sat Mar 18 10:30:55 EST 2006

Hi Guangming,

> Thank you for your suggestions!
> I have profiled my program in detail. The time cost of computation is
> different from that of communication in each loops. Sometimes, the cost of
> computation is much more/less than that of communication.
> The time is measured in the communication/computation non-overlap as follows:
>  for (i = 0; i < loops; i++) {
> 	comm_start = MPI_Wtime();
> 	MPI_Fence;// To move outside cannot promise the correct results.
> 	for (dest = 0; dest < p; dest++)
>  		if (id != dest)
>  			MPI_Put;
> 	MPI_Fence;
> 	comm_end = MPI_Wtime();
> 	comm_time += (comm_end-comm_start);
> 	comp_start = MPI_Wtime();
> 	computation;
> 	comp_end = MPI_Wtime();
> 	comp_time += (comp_end-comp_start);
>  }

MPI_Wtime may not be a good indicator if the computation is too less and
finishes in micro-seconds. You may consider using assembly like 'rdtsc'.

> However, I have found another surprising case:
> I replace MPI_Put with MPI_Alltoall. The experimental results show that the
> performance of MPI_Alltoall is better than this one sided communication
> MPI_Put even though I overlap computation with communication by moving
> computation to before the second MPI_Fence. So I wonder that collective
> communications such as MPI_Alltoall also have been optimized using one sided
> communication. But I compiled the program using MPI-1, the performance of
> MPI_Alltoall is independ of MPI-1/MPI-2.

The MPI_Alltoall in MPICH-2 is optimized at algorithm level. It seems that
you communication pattern is actually all-to-all communication. Thus using
MPI_Alltoall may give better performance. By the way, you may need to
remove the first fence if you are moving computation before the second
MPI_Fence.  Since fence itself is a collective operation, two consecutive
calls definitely waste time.

Thanks!

-- Wei