[mvapich-discuss] mvapich2-1.5.0 CPU mapping

Fri Jul 16 17:42:39 EDT 2010

Dr. Kallies,
               Thank you for trying out the different CPU binding schemes
that we introduced in the 1.5 version. We would like to note that,
currently, we do not consider the information provided by the scheduler.
However, we have verified that when a job is launched through our
mpirun_rsh, irrespective of how many cores are requested by the user, our
binding policies provide the right mapping. We are definitely interested in
exploring the issues that you are facing and we also thank you for sharing
your enhanced version of scatter.
                On another note, could you please share the output "lstopo"
 generates on your nodes? Going by your explanation, we seem to be seeing a
slightly different core to socket mapping on our Nehalem machines.

Thanks,
Krishna

On Fri, Jul 16, 2010 at 7:37 AM, Bernd Kallies <kallies at zib.de> wrote:

> Dear MVAPICH2 dev team,
>
> within mvapich2-1.5.0 you introduced different task placement schemes
> based on CPU topologies that are retrieved via the hwloc API.
>
> It seems to me that your evaluation of the topology yields wrong results
> under certain circumstances. These circumstances are caused by strange
> topologies, which occur when some cores of a machine are mapped-out from
> the system, or when a batch system is able to create cpusets for a job
> and the job requests less than all cores of a node.
>
> On our systems with 2 quad-core Nehalem per node, and running an MPI job
> with mvapich2-1.5.0 with 5 tasks per node under the constraint of a
> cpuset created by the Torque batch system that covers the first 5 cores
> (4 cores of the 1st socket/NUMA node, 1 core of the 2nd), the job gets
> the placement 0,4,1,4,2 when defining MV2_CPU_BINDING_POLICY=scatter.
> This is clearly wrong (the last core of the cpuset is used twice, core 3
> is not used).
>
> I guess this is because you applied some intelligence in calculating the
> pinning map, which includes assumptions about the topology regarding
> sockets, NUMA nodes and the like.
>
> Attached you find a small example code that implements a scatter scheme
> for arbitrary topologies. The basics consist of sorting a list of
> HWLOC_OBJ_PU objects according to a max. distance approach in the
> topology tree without implying any other knowledge than that the
> topology is a tree. I believe this scheme is robust and scales well on
> large SMP nodes like SGI Altix 4700 or UltraViolet.
>
> I tested this algorithm successfully on our current systems containing
> Intel Nehalem and Intel Harpertown. We use it to determine pinning maps
> for various MPI implementations that are currently not able to do this,
> and also for Hybrid MPI+OpenMP codes. With the strage example above, the
> algorithm gives a correct pinning scheme 0,4,1,2,3.
>
> I would be able to provide an mvapich2 patch, if you like. Otherwise
> feel free to implement this in mvapich2 on your own.
>
> Sincerely BK
>
> --
> Dr. Bernd Kallies
> Konrad-Zuse-Zentrum für Informationstechnik Berlin
> Takustr. 7
> 14195 Berlin
> Tel: +49-30-84185-270
> Fax: +49-30-84185-311
> e-mail: kallies at zib.de
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100716/58cf710d/attachment.html