[mvapich-discuss] Core binding oversubscription with batch schedulers

Lockwood, Glenn glock at sdsc.edu
Wed Sep 11 12:44:39 EDT 2013


Mark,

As I understand it, Grid Engine's "affinity" options only populate a machinefile ($pe_nodefile or something), and it is up to the MPI task launcher (not the MPI stack itself) to grok that file's contents and make the appropriate changes to the MPI stack's binding options to actually have any effect.  Resource managers like Torque create explicit cpusets on nodes for jobs, and these make MVAPICH2's default binding (binding the first rank to the first eligible core, etc) work automatically.

You (or others) may disagree, but I see this as a resource manager/scheduler issue, not really an MPI issue.  A workaround might be to have your prologue script get the list of assigned cores from grid engine's machinefile and create a cpuset for the job before the job runs.  This would make mvapich2 automatically bind to the cores given to the job by the resource manager.

Glenn

--
Glenn K. Lockwood, Ph.D.
SDSC User Services
glock at sdsc.edu
(858) 246-1075

On Sep 11, 2013, at 9:00 AM, Mark Dixon <m.c.dixon at leeds.ac.uk>
 wrote:

> Hi,
> 
> I'm using MVAPICH2 1.9 on a Linux/Intel cluster where compute nodes can be shared between jobs (but cores are not oversubscribed) and have a few problems with core binding. We use gridengine to assign/launch jobs.
> 
> By default, an MPI application built with MVAPICH2 appears to bind processes to cores linearly - core 0 gets the 1st rank on that host, core 1 gets the 2nd and so on. As discussed previously on this list, this is not appropriate behaviour when there is more than one job on the same node, and there are some environment variables to help out with this.
> 
> Unfortunately, those variables only seem to allow for the same mappings on all hosts. This still does not allow for the common case where the cores that are "yours" vary from one compute node to another.
> 
> Batch schedulers can launch processes on each of the compute nodes with the right NUMA settings for that host (as seen by the likes of numactl). What would be really useful, would be for MVAPICH2 to notice what cores it has "inherited", and only assign cores out of that pool.
> 
> Is this something that MVAPICH2 can do today and I've just not read the documentation properly?
> 
> Thanks,
> 
> Mark
> -- 
> -----------------------------------------------------------------
> Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
> HPC/Grid Systems Support         Tel (int): 35429
> Information Systems Services     Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss




More information about the mvapich-discuss mailing list