[mvapich-discuss] mvapich2-1.4.0 question about CPU affinity

Fri Oct 30 13:02:42 EDT 2009

Dear list members,

I ran mvapich2 v1.4.0 on clusters equipped with Intel Xeon E5472
(Harpertown/Penryn, 4 cores per socket, 2 sockets per node) and Intel
Xeon X5570 (Gainestown/Nehalem, 4 cores per socket, 2 sockets per node).

I analyzed the default CPU affinity maps applied by mvapich2 1.4.0
(MV2_CPU_MAPPING is unset, MV2_ENABLE_AFFINITY is 1). For code see
https://www.hlrn.de/home/view/System/PlaceMe

It seems to me that the following maps are applied:
1) Harpertown: 0:2:4:6:1:3:5:7
2) Gainestown: 0:1:2:3:4:5:6:7

The map found for Hapertown is different from previous mvapich2
releases, but is still far away from "Optimal runtime CPU binding" as
written in the Changelog.

The lstopo tool of the hwloc package
http://www.open-mpi.org/projects/hwloc/
gives for a node with Harpertown CPUs:

System(15GB)
  Socket#0
    L2(6144KB)
      L1(32KB) + Core#0 + P#0
      L1(32KB) + Core#1 + P#2
    L2(6144KB)
      L1(32KB) + Core#2 + P#4
      L1(32KB) + Core#3 + P#6
  Socket#1
    L2(6144KB)
      L1(32KB) + Core#0 + P#1
      L1(32KB) + Core#1 + P#3
    L2(6144KB)
      L1(32KB) + Core#2 + P#5
      L1(32KB) + Core#3 + P#7

So, on this architecture the "optimal" affinity map for a pure MPI
application is 0:1:4:5:2:3:6:7, because one has to try to minimize usage
of shared L2 caches as much as possible (run 4 tasks on 0:1:4:5, not on
0:2:4:6 as mvapich2 does).

On the other hand, the map applied on Gainestown is correct (minimizes
usage of shared L3 cache and NUMA node memory). The topology map is here

System(47GB)
  Node#0(23GB) + Socket#0 + L3(8192KB)
    L2(256KB) + L1(32KB) + Core#0
      P#0
      P#8
    L2(256KB) + L1(32KB) + Core#1
      P#2
      P#10
    L2(256KB) + L1(32KB) + Core#2
      P#4
      P#12
    L2(256KB) + L1(32KB) + Core#3
      P#6
      P#14
  Node#1(23GB) + Socket#1 + L3(8192KB)
    L2(256KB) + L1(32KB) + Core#0
      P#1
      P#9
    L2(256KB) + L1(32KB) + Core#1
      P#3
      P#11
    L2(256KB) + L1(32KB) + Core#2
      P#5
      P#13
    L2(256KB) + L1(32KB) + Core#3
      P#7
      P#15

I'm wondering if I made a mistake or understood something wrong, or if
there is some bug in the mvapich2 intelligence that seems to be used to
analyze the cpu topology, or if this intelligence might become increased
in future mvapich2 releases to reduce the need to know a value of
MV2_CPU_MAPPING on a particular architecture, which is more suitable
than the default.

Sincerely, BK

-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies at zib.de