[mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Fri Mar 15 12:42:37 EDT 2019

Hi,

We have an odd (to us) problem with mvapich 2.3. Compiled as:

MVAPICH2 Version:       2.3

MVAPICH2 Release date:  Mon Jul 23 22:00:00 EST 2018

MVAPICH2 Device:        ch3:psm

MVAPICH2 configure:     --with-pmi=pmi1 --with-pm=slurm --with-slurm=/opt/software/slurm --with-device=ch3:psm --enable-fortran=no --enable-cxx --enable-romio --enable-shared --with-slurm-lib=/opt/software/slurm/lib/ --prefix=/opt/install/mvapich2-2.3

MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2

MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2

MVAPICH2 F77:   gfortran

MVAPICH2 FC:    gfortran

We submit jobs via slurm.  Our codes are pure mpi, there is no thread model whatsoever in the codes and are single threaded.  HyperThreading is off on all nodes. Our nodes are of type:

Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

Byte Order:            Little Endian

CPU(s):                16

On-line CPU(s) list:   0-15

Thread(s) per core:    1

Core(s) per socket:    8

Socket(s):             2

NUMA node(s):          2

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 79

Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Stepping:              1

CPU MHz:               1200.035

BogoMIPS:              4195.82

Virtualization:        VT-x

L1d cache:             32K

L1i cache:             32K

L2 cache:              256K

L3 cache:              20480K

NUMA node0 CPU(s):     0-7

NUMA node1 CPU(s):     8-15

When we submit jobs to several nodes in multiples of 16 there are no problems.  But we now find that if we submit two 8-way jobs to the same node in the queue the second socket is never reached and both jobs end up getting 50% of 8 cores. Likewise submitting 4 jobs each 4-way all land on the same 4 cores as the first job and none of the other 12 cores are used.  The only command that seems to make a difference but we need to do more digging is (export MV2_ENABLE_AFFINITY=0).

However, we have read that MV2_ENABLE_AFFINITY may have knock on effects for shared-memory utilisation, and will mean the OS rather than MVAPICH picks the nodes.  We have read that while this means all cores will be used and have shown this to be the case in initial tests, it means that the OS is also free to move the jobs during the run, and also if it moves them this impacts memory.  Finally, we've built our codes against intel mpi and do not have this problem, so we believe that tells us that our hardware/BIOS/ENV/Slurm is setup correctly: happy to dig more but given the effect of MV2_ENABLE_AFFINITY we believe it's simply that we don't really know how to control mvapich properly at this level.  Or network is omni-path if that helps.

Ideally, we'd just like to be able to use all cores on the node when a job is submitted instead of them landing on the same cores when the job size submitted is less than 16. We'd like the jobs to be locked to a core for the life of the job for efficiency/speed/memory reasons.  How would we go about getting this to work as we describe?

Thanks,

Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190315/062f8a6e/attachment-0001.html>