[mvapich-discuss] Cores are oversubscribed when running more than on mpirun instance

Wischert Raphael wischert at inorg.chem.ethz.ch
Thu Apr 19 06:22:24 EDT 2012


On 14.04.2012, at 16:42, Jonathan Perkins wrote:

> On Sat, Apr 14, 2012 at 10:26:57AM +0000, Wischert  Raphael wrote:
>> Devendar wrote: 
>>> You can find more details about CPU affinity settings in user guide section at : http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8rc1.html#x1-520006.3
>>> 
>>> You indicated that, you are not able to build official release. Is
>>> it mvapich2-1.8rc1? Can you give more details about this build
>>> issue?
>> 
>> I have the following problem, when attempting to build the
>> mvapich2-1.8rc1 release with
>> --prefix=/opt/mvapich2/1.8rc1/intel/11.1/075/ CC=icc FC=ifort --with-hwloc
>> 
>> mv -f .deps/libnodelist_a-nodelist_parser.Tpo .deps/libnodelist_a-nodelist_parser.Po
>> /bin/sh ../../../../../confdb/ylwrap nodelist_scanner.l .c nodelist_scanner.c -- :
>> make[7]: *** [nodelist_scanner.c] Error 1
>> make[7]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun/src/slurm'
>> make[6]: *** [all] Error 2
>> make[6]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun/src/slurm'
>> make[5]: *** [all-recursive] Error 1
>> make[5]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun/src'
>> make[4]: *** [all-recursive] Error 1
>> make[4]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun'
>> make[3]: *** [all] Error 2
>> make[3]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun'
>> make[2]: *** [all-redirect] Error 1
>> make[2]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm'
>> make[1]: *** [all-redirect] Error 2
>> make[1]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src'
>> make: *** [all-redirect] Error 2
>> 
>> This is similar to what is described in this post:
>> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-March/003804.html
> 
> Thanks for providing this info.  This problem should be in resolved in
> 1.8rc2.  In the meantime can you try applying the attached patch at the
> top level directory after extracting the tarball and before you build.
> 
> tar xf mvapich2-1.8rc1.tgz
> cd mvapich2-1.8rc1
> 
> patch -p0 < parser.patch
> 
> ./configure <options>
> make
> 
> Please let us know if this works for you.  After you get rc1 to build I
> would suggest using the CPU binding policies as Devendar has suggested.

Thanks a lot for your quick reply and sorry for the late answer. This patch worked, I was able to successfully build the release.
However, I still have the oversubscription issue, even when I set  MV2_CPU_BINDING_LEVEL=socket and MV2_CPU_BINDING_POLICY=scatter.
One can solve the problem with explicit CPU-Mapping, but this will be to difficult and tedious for "normal" users.

> 
> Other than that using a resource manager (like slurm) sounds like your
> best bet as they can do more advanced scheduling and use numactl tricks
> to only expose allocated cpus to each job.

In the meantime I have installed slurm 2.3.4 and rebuilt mvapich2 with slurm-support, as described in the manual. 
I can now simply run MPI applications with "srun -nX executable > out &". 

This works so far, but I have to set MV2_ENABLE_AFFINITY=0, otherwise I still run into the same oversubscription issues, no matter if bunch or scatter is set. It even persists when task affinity is activated in slurm.
However, I have the impression thats slurm's resource allocation is not working properly for me. But for that, I'll have to consult the slurm mailing-list, I guess. 

> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> <parser.patch>




More information about the mvapich-discuss mailing list