[mvapich-discuss] Core binding oversubscription with batch schedulers

Mark Dixon m.c.dixon at leeds.ac.uk
Thu Sep 12 04:37:41 EDT 2013


On Wed, 11 Sep 2013, Lockwood, Glenn wrote:
...
> As I understand it, Grid Engine's "affinity" options only populate a 
> machinefile ($pe_nodefile or something), and it is up to the MPI task 
> launcher (not the MPI stack itself) to grok that file's contents and 
> make the appropriate changes to the MPI stack's binding options to 
> actually have any effect.  Resource managers like Torque create explicit 
> cpusets on nodes for jobs, and these make MVAPICH2's default binding 
> (binding the first rank to the first eligible core, etc) work 
> automatically.

Hi Glenn,

Gridengine has a few different affinity options, including the one you 
mention. The MPI stack (in which I would include the launcher, BTW) used 
by MVAPICH2 completely ignores those fields in the PE_HOSTFILE.

Instead, I much prefer gridengine's affinity option which sets the core 
affinity as seen by numactl. This method means that any non-MPI programs 
run from the job script are also constrained to the allocated cores.

Interesting that you say that the correct behaviour is seen with cpusets. 
I'll look into that, but cpuset support in gridengine isn't there yet.


> You (or others) may disagree, but I see this as a resource 
> manager/scheduler issue, not really an MPI issue.  A workaround might be 
> to have your prologue script get the list of assigned cores from grid 
> engine's machinefile and create a cpuset for the job before the job 
> runs.  This would make mvapich2 automatically bind to the cores given to 
> the job by the resource manager.

Hacking around with the job prologue or similar could do it, but there are 
an awful lot of gridengine users out there that would also need to do the 
same. In the meantime, presumably they're all seeing poor performance with 
MVAPICH2 by default on nodes shared between jobs. That's not good for 
anyone.

I may be misguided, but I think there's enough shared responsibility - and 
room for blame - between the resource manager and the MPI stack to warrant 
some defensive programming on both sides.

It just sounds sensible to me - if an application uses core affinity 
options, it at least has an option to only use the cores inherited by its 
environment. Certainly, both IntelMPI and OpenMPI do this by default.


You explicitly separated the MPI launcher from the rest of the MPI stack. 
Do you feel I should be trying to discuss this with the Hydra folks 
instead?

(Given the various MV2_* environment variables that control affinity, I 
figured I would start here)

Cheers,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------


More information about the mvapich-discuss mailing list