[mvapich-discuss] mvapich2-1.4.0 question about CPU affinity

Fri Oct 30 14:49:13 EDT 2009

Dr. Kallies,

Thanks for your note. We are analyzing the situations you have described
and we will get back to you soon.  Our long-term objective is to provide
the best intelligence within the MVAPICH2 library to come up with the most
efficient CPU binding for multi-core platforms.  As you know, multi-core
platforms come with all different configurations, cache sizes/speeds and
memory sizes/speeds. Similarly, parallel applications have diverse
computation and communication requirements. Thus, if some specific
user-defined CPU mapping is good for a particular application and
platform, it can always be used by the user-defined CPU mapping option of
the MVAPICH2 library.

Best Regards,

DK

On Fri, 30 Oct 2009, Bernd Kallies wrote:

> Dear list members,
>
> I ran mvapich2 v1.4.0 on clusters equipped with Intel Xeon E5472
> (Harpertown/Penryn, 4 cores per socket, 2 sockets per node) and Intel
> Xeon X5570 (Gainestown/Nehalem, 4 cores per socket, 2 sockets per node).
>
> I analyzed the default CPU affinity maps applied by mvapich2 1.4.0
> (MV2_CPU_MAPPING is unset, MV2_ENABLE_AFFINITY is 1). For code see
> https://www.hlrn.de/home/view/System/PlaceMe
>
> It seems to me that the following maps are applied:
> 1) Harpertown: 0:2:4:6:1:3:5:7
> 2) Gainestown: 0:1:2:3:4:5:6:7
>
> The map found for Hapertown is different from previous mvapich2
> releases, but is still far away from "Optimal runtime CPU binding" as
> written in the Changelog.
>
> The lstopo tool of the hwloc package
> http://www.open-mpi.org/projects/hwloc/
> gives for a node with Harpertown CPUs:
>
> System(15GB)
>   Socket#0
>     L2(6144KB)
>       L1(32KB) + Core#0 + P#0
>       L1(32KB) + Core#1 + P#2
>     L2(6144KB)
>       L1(32KB) + Core#2 + P#4
>       L1(32KB) + Core#3 + P#6
>   Socket#1
>     L2(6144KB)
>       L1(32KB) + Core#0 + P#1
>       L1(32KB) + Core#1 + P#3
>     L2(6144KB)
>       L1(32KB) + Core#2 + P#5
>       L1(32KB) + Core#3 + P#7
>
> So, on this architecture the "optimal" affinity map for a pure MPI
> application is 0:1:4:5:2:3:6:7, because one has to try to minimize usage
> of shared L2 caches as much as possible (run 4 tasks on 0:1:4:5, not on
> 0:2:4:6 as mvapich2 does).
>
> On the other hand, the map applied on Gainestown is correct (minimizes
> usage of shared L3 cache and NUMA node memory). The topology map is here
>
> System(47GB)
>   Node#0(23GB) + Socket#0 + L3(8192KB)
>     L2(256KB) + L1(32KB) + Core#0
>       P#0
>       P#8
>     L2(256KB) + L1(32KB) + Core#1
>       P#2
>       P#10
>     L2(256KB) + L1(32KB) + Core#2
>       P#4
>       P#12
>     L2(256KB) + L1(32KB) + Core#3
>       P#6
>       P#14
>   Node#1(23GB) + Socket#1 + L3(8192KB)
>     L2(256KB) + L1(32KB) + Core#0
>       P#1
>       P#9
>     L2(256KB) + L1(32KB) + Core#1
>       P#3
>       P#11
>     L2(256KB) + L1(32KB) + Core#2
>       P#5
>       P#13
>     L2(256KB) + L1(32KB) + Core#3
>       P#7
>       P#15
>
> I'm wondering if I made a mistake or understood something wrong, or if
> there is some bug in the mvapich2 intelligence that seems to be used to
> analyze the cpu topology, or if this intelligence might become increased
> in future mvapich2 releases to reduce the need to know a value of
> MV2_CPU_MAPPING on a particular architecture, which is more suitable
> than the default.
>
> Sincerely, BK
>
> --
> Dr. Bernd Kallies
> Konrad-Zuse-Zentrum für Informationstechnik Berlin
> Takustr. 7
> 14195 Berlin
> Tel: +49-30-84185-270
> Fax: +49-30-84185-311
> e-mail: kallies at zib.de
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>