[mvapich-discuss] Faster is Slower

Mon Aug 28 12:35:19 EDT 2006

Greetings Norm,

Taylor, Norm R. wrote:

>Hi,
>  I'm wondering if you can advise me on an issue I'm encountering with
>MPI+Infiniband, using MVAPICH. I'm finding that the high rate at which
>collective operations - e.g. MPI_GATHER - poll to determine if all nodes
>have entered the operation steals too many CPU cycles from other
>processes, slowing down overall performance. Is there a way I can tune
>these operations to be more CPU-efficient? I actually improve
>performance by adding a few microseconds of sleep time to the data
>transfer processes (these are the ones using MPI+Infiniband) to give
>more CPU cycles to the computational processes. This tuning is very
>specific to the problem at hand and the number of nodes in use. Tuning
>at the process level seems still inefficient - it would be better if the
>sleep time was applied inside the collective operations. Is there a way
>I can set a parameter somewhere to make that happen?
>  
>
Thanks for bringing this up on the list. Infact, we have thought about 
this very situation, where one process is busy polling for considerable 
duration and steals CPU cycles from other "useful" processes.

MVAPICH has a mode (called BLOCKING_SUPPORT), using which a MPI process 
will not busy poll indefinitely, rather yield the CPU to other 
processes. The user can further fine tune the "spin-block" threshold to 
get the best CPU usage/message latency tradeoff for any specific 
application.

In order to activate this mode, you can follow the instructions given in 
Section 4.4.1 in our user guide. Look under the bullet "Customize 
MVAPICH configuration" -> Blocking Progress.

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html#x1-100004.4.1

In order to further fine tune your application, you can adjust the spin 
count (after which the process yields the CPU) using the environment 
variable VIADEV_MAX_SPIN_COUNT.

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html#x1-860009.30

For example, a latency sensitive application could set this parameter 
high (like 20,000-30,000) so that the application yields CPU less often. 
On the other hand, a application which wants to yield the CPU as often 
as possible can set this parameter to be low (like 20-30).

Please let us know if this answers your question.

Thanks,
Sayantan.

-- 
http://www.cse.ohio-state.edu/~surs