[mvapich-discuss] Core binding oversubscription with batch schedulers

Thu Sep 12 12:56:07 EDT 2013

On Thu, Sep 12, 2013 at 09:37:41AM +0100, Mark Dixon wrote:
> On Wed, 11 Sep 2013, Lockwood, Glenn wrote:
> ...
> >As I understand it, Grid Engine's "affinity" options only populate
> >a machinefile ($pe_nodefile or something), and it is up to the MPI
> >task launcher (not the MPI stack itself) to grok that file's
> >contents and make the appropriate changes to the MPI stack's
> >binding options to actually have any effect.  Resource managers
> >like Torque create explicit cpusets on nodes for jobs, and these
> >make MVAPICH2's default binding (binding the first rank to the
> >first eligible core, etc) work automatically.
> 
> Hi Glenn,
> 
> Gridengine has a few different affinity options, including the one
> you mention. The MPI stack (in which I would include the launcher,
> BTW) used by MVAPICH2 completely ignores those fields in the
> PE_HOSTFILE.
> 
> Instead, I much prefer gridengine's affinity option which sets the
> core affinity as seen by numactl. This method means that any non-MPI
> programs run from the job script are also constrained to the
> allocated cores.

Can you elaborate on this?  My understanding is that gridengine has an
rsh replacement which when used by mpirun_rsh will result in the desired
behavior that the MPI Library will only see the cpuset allocated by grid
engine on each node (may use numactl under the hood).  Is this correct?

If so, you can activate this rsh replacement by setting the environment
variable RSH_CMD to the appropriate command at configure time.

Example:
./configure --prefix=/install/prefix RSH_CMD=/the/rsh/replacement ...

Then at runtime you will use the -rsh option of mpirun_rsh.

Example:
mpirun_rsh -rsh -n 2 $NSLOTS ...  

> 
> Interesting that you say that the correct behaviour is seen with
> cpusets. I'll look into that, but cpuset support in gridengine isn't
> there yet.
> 
> 
> >You (or others) may disagree, but I see this as a resource
> >manager/scheduler issue, not really an MPI issue.  A workaround
> >might be to have your prologue script get the list of assigned
> >cores from grid engine's machinefile and create a cpuset for the
> >job before the job runs.  This would make mvapich2 automatically
> >bind to the cores given to the job by the resource manager.
> 
> Hacking around with the job prologue or similar could do it, but
> there are an awful lot of gridengine users out there that would also
> need to do the same. In the meantime, presumably they're all seeing
> poor performance with MVAPICH2 by default on nodes shared between
> jobs. That's not good for anyone.
> 
> I may be misguided, but I think there's enough shared responsibility
> - and room for blame - between the resource manager and the MPI
> stack to warrant some defensive programming on both sides.
> 
> It just sounds sensible to me - if an application uses core affinity
> options, it at least has an option to only use the cores inherited
> by its environment. Certainly, both IntelMPI and OpenMPI do this by
> default.

We do in fact use only the cores inherited by its environment.  It
sounds like the method that you have tried using in grid engine does not
actually set up the environment but is giving us hints.  I think using
a method where the shell sets up the affinity ahead of time is desirable
but we can also consider adding support in mpirun_rsh to extend its
hostfile format to accommodate this.

> 
> 
> You explicitly separated the MPI launcher from the rest of the MPI
> stack. Do you feel I should be trying to discuss this with the Hydra
> folks instead?
> 
> (Given the various MV2_* environment variables that control
> affinity, I figured I would start here)

This type of issue may affect a decent number of users so there is no
harm in discussing this here.  Thanks for your note.  I hope that we can
find an acceptable solution for your situation.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo