[mvapich-discuss] Core binding oversubscription with batch schedulers

Mark Dixon m.c.dixon at leeds.ac.uk
Thu Oct 10 09:40:01 EDT 2013


On Thu, 12 Sep 2013, Jonathan Perkins wrote:
...
> Can you elaborate on this?  My understanding is that gridengine has an 
> rsh replacement which when used by mpirun_rsh will result in the desired 
> behavior that the MPI Library will only see the cpuset allocated by grid 
> engine on each node (may use numactl under the hood).  Is this correct?

(sorry for the delay - suffered from email overload for the past month)

Correct.

> If so, you can activate this rsh replacement by setting the environment 
> variable RSH_CMD to the appropriate command at configure time.

Yes; we used to do some something similar to that when we used mpirun_rsh.

However, we recently switched to using hydra: among other advantages, its 
support for gridengine means we no longer need to convert the hostfile to 
something mpirun_rsh understands.

...
> We do in fact use only the cores inherited by its environment.  It
> sounds like the method that you have tried using in grid engine does not
> actually set up the environment but is giving us hints.  I think using
> a method where the shell sets up the affinity ahead of time is desirable
> but we can also consider adding support in mpirun_rsh to extend its
> hostfile format to accommodate this.

Depending on the submission flags given by the user, gridengine has 
traditionally supported the following:

* environment - the cores assignment is in an environment variables (no 
core binding actually done)

* parallel environment - the cores assignment is in the gridengine 
hostfile (no core binding actually done)

* set - the cores assignment is done via libnuma.so's affinity routines, 
which can be overridden (e.g. by the numactl command).


"set" is the one I was testing: launching MVAPICH2 via Hydra completely 
stomped over the affinities setup by gridengine.

...
> This type of issue may affect a decent number of users so there is no
> harm in discussing this here.  Thanks for your note.  I hope that we can
> find an acceptable solution for your situation.
...

Thanks :)

Since the Oracle takeover of Sun and subsequent forking of the software, 
gridengine turned into a hard target to track. The description above 
essentially describes the situation pre-fork and probably describes the 
lowest common denominator today.

Some of the forks have since moved on to trying to use cpusets instead of 
libnuma's affinity routines to perform mandatory restrictions, but the 
situation is still in flux. I'm currently investigating these.

Cheers,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------


More information about the mvapich-discuss mailing list