[mvapich-discuss] Cores are oversubscribed when running more
than on mpirun instance
Wischert Raphael
wischert at inorg.chem.ethz.ch
Thu Apr 19 06:22:24 EDT 2012
On 14.04.2012, at 16:42, Jonathan Perkins wrote:
> On Sat, Apr 14, 2012 at 10:26:57AM +0000, Wischert Raphael wrote:
>> Devendar wrote:
>>> You can find more details about CPU affinity settings in user guide section at : http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8rc1.html#x1-520006.3
>>>
>>> You indicated that, you are not able to build official release. Is
>>> it mvapich2-1.8rc1? Can you give more details about this build
>>> issue?
>>
>> I have the following problem, when attempting to build the
>> mvapich2-1.8rc1 release with
>> --prefix=/opt/mvapich2/1.8rc1/intel/11.1/075/ CC=icc FC=ifort --with-hwloc
>>
>> mv -f .deps/libnodelist_a-nodelist_parser.Tpo .deps/libnodelist_a-nodelist_parser.Po
>> /bin/sh ../../../../../confdb/ylwrap nodelist_scanner.l .c nodelist_scanner.c -- :
>> make[7]: *** [nodelist_scanner.c] Error 1
>> make[7]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun/src/slurm'
>> make[6]: *** [all] Error 2
>> make[6]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun/src/slurm'
>> make[5]: *** [all-recursive] Error 1
>> make[5]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun/src'
>> make[4]: *** [all-recursive] Error 1
>> make[4]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun'
>> make[3]: *** [all] Error 2
>> make[3]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm/mpirun'
>> make[2]: *** [all-redirect] Error 1
>> make[2]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src/pm'
>> make[1]: *** [all-redirect] Error 2
>> make[1]: Leaving directory `/home/rwischert/Downloads/mvapich2-1.8rc1/src'
>> make: *** [all-redirect] Error 2
>>
>> This is similar to what is described in this post:
>> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-March/003804.html
>
> Thanks for providing this info. This problem should be in resolved in
> 1.8rc2. In the meantime can you try applying the attached patch at the
> top level directory after extracting the tarball and before you build.
>
> tar xf mvapich2-1.8rc1.tgz
> cd mvapich2-1.8rc1
>
> patch -p0 < parser.patch
>
> ./configure <options>
> make
>
> Please let us know if this works for you. After you get rc1 to build I
> would suggest using the CPU binding policies as Devendar has suggested.
Thanks a lot for your quick reply and sorry for the late answer. This patch worked, I was able to successfully build the release.
However, I still have the oversubscription issue, even when I set MV2_CPU_BINDING_LEVEL=socket and MV2_CPU_BINDING_POLICY=scatter.
One can solve the problem with explicit CPU-Mapping, but this will be to difficult and tedious for "normal" users.
>
> Other than that using a resource manager (like slurm) sounds like your
> best bet as they can do more advanced scheduling and use numactl tricks
> to only expose allocated cpus to each job.
In the meantime I have installed slurm 2.3.4 and rebuilt mvapich2 with slurm-support, as described in the manual.
I can now simply run MPI applications with "srun -nX executable > out &".
This works so far, but I have to set MV2_ENABLE_AFFINITY=0, otherwise I still run into the same oversubscription issues, no matter if bunch or scatter is set. It even persists when task affinity is activated in slurm.
However, I have the impression thats slurm's resource allocation is not working properly for me. But for that, I'll have to consult the slurm mailing-list, I guess.
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> <parser.patch>
More information about the mvapich-discuss
mailing list