[mvapich-discuss] mvapich2-1.5.0 CPU mapping

Bernd Kallies kallies at zib.de
Fri Jul 16 07:37:55 EDT 2010


Dear MVAPICH2 dev team,

within mvapich2-1.5.0 you introduced different task placement schemes
based on CPU topologies that are retrieved via the hwloc API.

It seems to me that your evaluation of the topology yields wrong results
under certain circumstances. These circumstances are caused by strange
topologies, which occur when some cores of a machine are mapped-out from
the system, or when a batch system is able to create cpusets for a job
and the job requests less than all cores of a node.

On our systems with 2 quad-core Nehalem per node, and running an MPI job
with mvapich2-1.5.0 with 5 tasks per node under the constraint of a
cpuset created by the Torque batch system that covers the first 5 cores
(4 cores of the 1st socket/NUMA node, 1 core of the 2nd), the job gets
the placement 0,4,1,4,2 when defining MV2_CPU_BINDING_POLICY=scatter.
This is clearly wrong (the last core of the cpuset is used twice, core 3
is not used).

I guess this is because you applied some intelligence in calculating the
pinning map, which includes assumptions about the topology regarding
sockets, NUMA nodes and the like.

Attached you find a small example code that implements a scatter scheme
for arbitrary topologies. The basics consist of sorting a list of
HWLOC_OBJ_PU objects according to a max. distance approach in the
topology tree without implying any other knowledge than that the
topology is a tree. I believe this scheme is robust and scales well on
large SMP nodes like SGI Altix 4700 or UltraViolet.

I tested this algorithm successfully on our current systems containing
Intel Nehalem and Intel Harpertown. We use it to determine pinning maps
for various MPI implementations that are currently not able to do this,
and also for Hybrid MPI+OpenMP codes. With the strage example above, the
algorithm gives a correct pinning scheme 0,4,1,2,3.

I would be able to provide an mvapich2 patch, if you like. Otherwise
feel free to implement this in mvapich2 on your own.

Sincerely BK

-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies at zib.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter.c
Type: text/x-csrc
Size: 2314 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100716/19b3e012/scatter.bin


More information about the mvapich-discuss mailing list