[mvapich-discuss] Core binding oversubscription with batch
schedulers
Mark Dixon
m.c.dixon at leeds.ac.uk
Thu Sep 12 04:37:41 EDT 2013
On Wed, 11 Sep 2013, Lockwood, Glenn wrote:
...
> As I understand it, Grid Engine's "affinity" options only populate a
> machinefile ($pe_nodefile or something), and it is up to the MPI task
> launcher (not the MPI stack itself) to grok that file's contents and
> make the appropriate changes to the MPI stack's binding options to
> actually have any effect. Resource managers like Torque create explicit
> cpusets on nodes for jobs, and these make MVAPICH2's default binding
> (binding the first rank to the first eligible core, etc) work
> automatically.
Hi Glenn,
Gridengine has a few different affinity options, including the one you
mention. The MPI stack (in which I would include the launcher, BTW) used
by MVAPICH2 completely ignores those fields in the PE_HOSTFILE.
Instead, I much prefer gridengine's affinity option which sets the core
affinity as seen by numactl. This method means that any non-MPI programs
run from the job script are also constrained to the allocated cores.
Interesting that you say that the correct behaviour is seen with cpusets.
I'll look into that, but cpuset support in gridengine isn't there yet.
> You (or others) may disagree, but I see this as a resource
> manager/scheduler issue, not really an MPI issue. A workaround might be
> to have your prologue script get the list of assigned cores from grid
> engine's machinefile and create a cpuset for the job before the job
> runs. This would make mvapich2 automatically bind to the cores given to
> the job by the resource manager.
Hacking around with the job prologue or similar could do it, but there are
an awful lot of gridengine users out there that would also need to do the
same. In the meantime, presumably they're all seeing poor performance with
MVAPICH2 by default on nodes shared between jobs. That's not good for
anyone.
I may be misguided, but I think there's enough shared responsibility - and
room for blame - between the resource manager and the MPI stack to warrant
some defensive programming on both sides.
It just sounds sensible to me - if an application uses core affinity
options, it at least has an option to only use the cores inherited by its
environment. Certainly, both IntelMPI and OpenMPI do this by default.
You explicitly separated the MPI launcher from the rest of the MPI stack.
Do you feel I should be trying to discuss this with the Hydra folks
instead?
(Given the various MV2_* environment variables that control affinity, I
figured I would start here)
Cheers,
Mark
--
-----------------------------------------------------------------
Mark Dixon Email : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
More information about the mvapich-discuss
mailing list