[mvapich-discuss] Default of MV2_ENABLE_AFFINITY: why 1?

Mon Jun 18 21:48:19 EDT 2012

On Mon, Jun 18, 2012 at 01:17:28PM -0400, Stephen Cousins wrote:
> Hi,
> 
> I am revisiting some affinity issues. For jobs that run on all cores of a
> node I can see that having MV2_ENABLE_AFFINITY=1 is beneficial. However, if
> you ever have users that run on a subset of the cores, where other jobs
> might get scheduled on this node too this is a big problem. It is fine for
> the first job. The second job though uses the same cores as the first and
> performance goes down dramatically.  We are using Torque and Moab.
> 
> I checked the mvapich2 configure script to see if there was a way to
> compile the code with a different default but I didn't find any way to do
> this. Rather than changing the code I have set the environment variable in
> the module file for loading the MVAPICH2 environment.
> 
> To my mind setting it to 0 as the default is a better choice since
> currently the consequence of it being set wrong is much more dramatically
> bad than the benefit of it being right. That is, currently if you get it
> wrong you see at least a 100% time penalty in your job, whereas if you get
> it right (that is, you really do want affinity set) then you get maybe a
> 10% to 20% benefit.
> 
> In general, I'd much rather have Affinity enabled but not the way it is
> currently implemented.
> How about if Affinity is enabled, then when new processes are started make
> sure they are started on cores that aren't already being used, at least not
> by other MVAPICH2 programs. Non-MVAPICH2 programs (at least the ones I'm
> seeing with CHARM or OpenMP, I'll have to check with OpenMPI jobs) the
> Linux scheduler seems to bounce them around appropriately scattering them
> amongst the free sockets/cores.
> 
> I have seen in the list that a general answer to this problem is to use CPU
> Mapping but unless this can be done automatically by Moab/Torque this will
> not work. For one thing, each node that is assigned to the job may require
> a different mapping depending on what else is running on the nodes.
> 
> What do you think?

Thank you for your note.  We are discussing the issue that you're
bringing up to see if we can do anything additional to address this
situation.

It seems that it would be ideal for the job manager such as Torque or
SLURM to set the cpuset that the jobs can run on.  If this is configured
then this should not be an issue.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo