[mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Fri Mar 15 20:02:36 EDT 2019

Hello, Andy.

Thanks for reporting the issue.

Could you please try to set the following environment variables for multi-way, shared job launching scenarios:

MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=compact
MV2_PIVOT_CORE_ID=<K>

K = (j) * (C / PPN)

j = job index that runs on that node (starting with 0, 1, 2 ...)
C = total number of cores per node
PPN = number of processes per node that are launched in a submitted job

Example: You want to launch 4 jobs, each with 4 processes per node, on a 16-core dual-socket CPU. You want first two jobs to share first socket (cores 0-3 and  4-7) and next two jobs to shared second socket (cores 8-11 and 12-15). Here's what you can do:

1) i = 0; launch first job with MV2_PIVOT_CORE_ID=0 e.g., (K = 0 * (16 / 4) = 0)
2) i = 1; launch second job with MV2_PIVOT_CORE_ID=4 e.g., (K = 1 * (16 / 4) = 4
3) i = 2; launch third job with MV2_PIVOT_CORE_ID=8 e.g., (K = 2 * (16 / 4) = 8)
3) i = 3; launch fourth job with MV2_PIVOT_CORE_ID=12 e.g., (K = 3 * (16 / 4) = 12)
... and so on (you get the idea).

Similarly for the un-even processes in a job, you need to manually set the next available core-id for the subsequent job submission.

You can further use MV2_SHOW_CPU_BINDING=1 to make sure the bindings are appropriately being set.

Let us know if you face any issue.

Regards,

Jahanzeb

________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of Andrew Parker <andrew.parker at fluidgravity.co.uk>
Sent: Friday, March 15, 2019 12:42:37 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Hi,

We have an odd (to us) problem with mvapich 2.3. Compiled as:

MVAPICH2 Version:       2.3

MVAPICH2 Release date:  Mon Jul 23 22:00:00 EST 2018

MVAPICH2 Device:        ch3:psm

MVAPICH2 configure:     --with-pmi=pmi1 --with-pm=slurm --with-slurm=/opt/software/slurm --with-device=ch3:psm --enable-fortran=no --enable-cxx --enable-romio --enable-shared --with-slurm-lib=/opt/software/slurm/lib/ --prefix=/opt/install/mvapich2-2.3

MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2

MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2

MVAPICH2 F77:   gfortran

MVAPICH2 FC:    gfortran

We submit jobs via slurm.  Our codes are pure mpi, there is no thread model whatsoever in the codes and are single threaded.  HyperThreading is off on all nodes. Our nodes are of type:

Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

Byte Order:            Little Endian

CPU(s):                16

On-line CPU(s) list:   0-15

Thread(s) per core:    1

Core(s) per socket:    8

Socket(s):             2

NUMA node(s):          2

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 79

Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Stepping:              1

CPU MHz:               1200.035

BogoMIPS:              4195.82

Virtualization:        VT-x

L1d cache:             32K

L1i cache:             32K

L2 cache:              256K

L3 cache:              20480K

NUMA node0 CPU(s):     0-7

NUMA node1 CPU(s):     8-15

When we submit jobs to several nodes in multiples of 16 there are no problems.  But we now find that if we submit two 8-way jobs to the same node in the queue the second socket is never reached and both jobs end up getting 50% of 8 cores. Likewise submitting 4 jobs each 4-way all land on the same 4 cores as the first job and none of the other 12 cores are used.  The only command that seems to make a difference but we need to do more digging is (export MV2_ENABLE_AFFINITY=0).

However, we have read that MV2_ENABLE_AFFINITY may have knock on effects for shared-memory utilisation, and will mean the OS rather than MVAPICH picks the nodes.  We have read that while this means all cores will be used and have shown this to be the case in initial tests, it means that the OS is also free to move the jobs during the run, and also if it moves them this impacts memory.  Finally, we've built our codes against intel mpi and do not have this problem, so we believe that tells us that our hardware/BIOS/ENV/Slurm is setup correctly: happy to dig more but given the effect of MV2_ENABLE_AFFINITY we believe it's simply that we don't really know how to control mvapich properly at this level.  Or network is omni-path if that helps.

Ideally, we'd just like to be able to use all cores on the node when a job is submitted instead of them landing on the same cores when the job size submitted is less than 16. We'd like the jobs to be locked to a core for the life of the job for efficiency/speed/memory reasons.  How would we go about getting this to work as we describe?

Thanks,

Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190316/257cc73a/attachment.html>