[mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Fri Apr 19 13:24:05 EDT 2019

Hi Chris,

Since MPI jobs don't share information with each other, it is not possible for the MPI library to reliably detect all the possible scenarios and provide the best mapping.

Since you are using srun as the launcher, I'd suggest using Slurm's task affinity plugin for setting affinity. As a resource manager, Slurm has a global view of how many processes are being launched on each node it should be able to provide appropriate mapping in these scenarios.

You can find more information about this plugin here:
https://slurm.schedmd.com/cpu_management.html#Step2
https://slurm.schedmd.com/mc_support.html

Thanks,
Sourav

On Tue, Apr 16, 2019 at 8:50 AM Christopher Grapes <chris.grapes at fluidgravity.co.uk<mailto:chris.grapes at fluidgravity.co.uk>> wrote:
Sourav,

Thank you very much for your suggestion.  If I submit 4 4-way jobs then this appears to work well.  Each process is assigned its own CPU and all of the CPUs on the node are fully utilized.

However, if I change things slightly and submit 1 12-way job and one 4-way job for example then Andy’s original observations return.
The CPU affinity for the first job is:
-------------CPU AFFINITY-------------
OMP_NUM_THREADS:         0
MV2_THREADS_PER_PROCESS: 1
RANK: 0  CPU_SET:    0
RANK: 1  CPU_SET:    1
RANK: 2  CPU_SET:    2
RANK: 3  CPU_SET:    3
RANK: 4  CPU_SET:    4
RANK: 5  CPU_SET:    5
RANK: 6  CPU_SET:    6
RANK: 7  CPU_SET:    7
RANK: 8  CPU_SET:    8
RANK: 9  CPU_SET:    9
RANK:10  CPU_SET:   10
RANK:11  CPU_SET:   11
-------------------------------------
And hence the first 12 CPUs are on the node are utilised.  The 4-way job reports the following CPU affinity:

-------------CPU AFFINITY-------------
OMP_NUM_THREADS:         0
MV2_THREADS_PER_PROCESS: 1
RANK: 0  CPU_SET:    0   1   2   3
RANK: 1  CPU_SET:    4   5   6   7
RANK: 2  CPU_SET:    8   9  10  11
RANK: 3  CPU_SET:   12  13  14  15
-------------------------------------
Ranks 0,1,2 clearly overlap with the 12-way job affinity and they end up on a CPU that is already running the 12-way job. At the same time any three of CPUs 12,13,14,15 are left idle.

Any ideas how to stop this?

Thanks

Chris Grapes

From: "Chakraborty, Sourav" <chakraborty.52 at buckeyemail.osu.edu<mailto:chakraborty.52 at buckeyemail.osu.edu>>
Reply-To: "chakraborty.52 at osu.edu<mailto:chakraborty.52 at osu.edu>" <chakraborty.52 at osu.edu<mailto:chakraborty.52 at osu.edu>>
Date: Monday, 15 April 2019 at 18:00
To: Christopher Grapes <chris.grapes at fluidgravity.co.uk<mailto:chris.grapes at fluidgravity.co.uk>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Hi Chris,

Can you please try setting the following environment variables?

export MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=spread

Thanks,
Sourav

On Mon, Apr 15, 2019 at 5:58 AM Christopher Grapes <chris.grapes at fluidgravity.co.uk<mailto:chris.grapes at fluidgravity.co.uk>> wrote:
Hi all,

Are there any further thoughts on Andy’s problem below.  Is there are more practical solution to the problem that we can roll out to our users?

Thanks

Chris

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> on behalf of Andrew Parker <andrew.parker at fluidgravity.co.uk<mailto:andrew.parker at fluidgravity.co.uk>>
Date: Wednesday, 20 March 2019 at 12:03
To: "Hashmi, Jahanzeb" <hashmi.29 at buckeyemail.osu.edu<mailto:hashmi.29 at buckeyemail.osu.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Hi,

Thank you for your help.  I can confirm that the below works.  Can I ask if there is a fix that can be used in a production environment for users?  It is not practical to ask users to generate this for each job script.  Do you have a recommendation to enforce this behaviour for our codes permanently and without allowing the OS to move jobs between cores throughout the life time of the job?
Thanks,
Andy

From: "Hashmi, Jahanzeb" <hashmi.29 at buckeyemail.osu.edu<mailto:hashmi.29 at buckeyemail.osu.edu>>
Date: Saturday, 16 March 2019 at 00:02
To: Andrew Parker <andrew.parker at fluidgravity.co.uk<mailto:andrew.parker at fluidgravity.co.uk>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Hello, Andy.

Thanks for reporting the issue.

Could you please try to set the following environment variables for multi-way, shared job launching scenarios:

MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=compact
MV2_PIVOT_CORE_ID=<K>

K = (j) * (C / PPN)

j = job index that runs on that node (starting with 0, 1, 2 ...)
C = total number of cores per node
PPN = number of processes per node that are launched in a submitted job

Example: You want to launch 4 jobs, each with 4 processes per node, on a 16-core dual-socket CPU. You want first two jobs to share first socket (cores 0-3 and  4-7) and next two jobs to shared second socket (cores 8-11 and 12-15). Here's what you can do:

1) i = 0; launch first job with MV2_PIVOT_CORE_ID=0 e.g., (K = 0 * (16 / 4) = 0)
2) i = 1; launch second job with MV2_PIVOT_CORE_ID=4 e.g., (K = 1 * (16 / 4) = 4
3) i = 2; launch third job with MV2_PIVOT_CORE_ID=8 e.g., (K = 2 * (16 / 4) = 8)
3) i = 3; launch fourth job with MV2_PIVOT_CORE_ID=12 e.g., (K = 3 * (16 / 4) = 12)
... and so on (you get the idea).

Similarly for the un-even processes in a job, you need to manually set the next available core-id for the subsequent job submission.

You can further use MV2_SHOW_CPU_BINDING=1 to make sure the bindings are appropriately being set.

Let us know if you face any issue.

Regards,

Jahanzeb

________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> on behalf of Andrew Parker <andrew.parker at fluidgravity.co.uk<mailto:andrew.parker at fluidgravity.co.uk>>
Sent: Friday, March 15, 2019 12:42:37 PM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
Subject: [mvapich-discuss] Help with odd affinity problem (under utilised cpus on nodes)

Hi,

We have an odd (to us) problem with mvapich 2.3. Compiled as:

MVAPICH2 Version:       2.3

MVAPICH2 Release date:  Mon Jul 23 22:00:00 EST 2018

MVAPICH2 Device:        ch3:psm

MVAPICH2 configure:     --with-pmi=pmi1 --with-pm=slurm --with-slurm=/opt/software/slurm --with-device=ch3:psm --enable-fortran=no --enable-cxx --enable-romio --enable-shared --with-slurm-lib=/opt/software/slurm/lib/ --prefix=/opt/install/mvapich2-2.3

MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2

MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2

MVAPICH2 F77:   gfortran

MVAPICH2 FC:    gfortran

We submit jobs via slurm.  Our codes are pure mpi, there is no thread model whatsoever in the codes and are single threaded.  HyperThreading is off on all nodes. Our nodes are of type:

Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

Byte Order:            Little Endian

CPU(s):                16

On-line CPU(s) list:   0-15

Thread(s) per core:    1

Core(s) per socket:    8

Socket(s):             2

NUMA node(s):          2

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 79

Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Stepping:              1

CPU MHz:               1200.035

BogoMIPS:              4195.82

Virtualization:        VT-x

L1d cache:             32K

L1i cache:             32K

L2 cache:              256K

L3 cache:              20480K

NUMA node0 CPU(s):     0-7

NUMA node1 CPU(s):     8-15

When we submit jobs to several nodes in multiples of 16 there are no problems.  But we now find that if we submit two 8-way jobs to the same node in the queue the second socket is never reached and both jobs end up getting 50% of 8 cores. Likewise submitting 4 jobs each 4-way all land on the same 4 cores as the first job and none of the other 12 cores are used.  The only command that seems to make a difference but we need to do more digging is (export MV2_ENABLE_AFFINITY=0).

However, we have read that MV2_ENABLE_AFFINITY may have knock on effects for shared-memory utilisation, and will mean the OS rather than MVAPICH picks the nodes.  We have read that while this means all cores will be used and have shown this to be the case in initial tests, it means that the OS is also free to move the jobs during the run, and also if it moves them this impacts memory.  Finally, we've built our codes against intel mpi and do not have this problem, so we believe that tells us that our hardware/BIOS/ENV/Slurm is setup correctly: happy to dig more but given the effect of MV2_ENABLE_AFFINITY we believe it's simply that we don't really know how to control mvapich properly at this level.  Or network is omni-path if that helps.

Ideally, we'd just like to be able to use all cores on the node when a job is submitted instead of them landing on the same cores when the job size submitted is less than 16. We'd like the jobs to be locked to a core for the life of the job for efficiency/speed/memory reasons.  How would we go about getting this to work as we describe?

Thanks,

Andy

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190419/119f3e8f/attachment-0001.html>