[mvapich-discuss] MVAPICH2: MPI process distribution error using Slurm

Subramoni, Hari subramoni.1 at osu.edu
Tue Dec 15 13:34:41 EST 2020


Hi, Vineet.

Sorry to hear that you’re facing issues.

Can you configure with the following options?


  1.  For PMIX support - ../configure --prefix=/home/ext/mr/smer/soniv/tools/install/mvapich2-2.3.5-slurm CC=icc CXX=icpc FC=ifort F77=ifort --enable-romio --with-file-system=lustre --disable-silent-rules --with-slurm=/usr --with-slurm-include=/usr/include/slurm --with-pmix=/opt/pmix/3.1.5 --with-pm=slurm --with-pmi=pmix --disable-rdma-cm --enable-hybrid --with-ch3-rank-bits=32



OR



  1.  For PMI2 support - ../configure --prefix=/home/ext/mr/smer/soniv/tools/install/mvapich2-2.3.5-slurm CC=icc CXX=icpc FC=ifort F77=ifort --enable-romio --with-file-system=lustre --disable-silent-rules --with-slurm=/usr --with-slurm-include=/usr/include/slurm --with-pm=slurm --with-pmi=pmi2 --disable-rdma-cm --enable-hybrid --with-ch3-rank-bits=32

When running use one of the following depending on what configure option you chose above and also set MV2_ENABLE_AFFINITY=0.


  1.  For PMIX support - srun --distribution=cyclic --mpi=pmi2…

OR


  1.  For PMI2 support srun --distribution=cyclic --mpi=pmix …

Here are some detailed comments.


  1.  For The following runtime options do not make sense to be used at the same time The first one requests MVAPICH2 to bind at a core level (which is default) and the second one asks MVAPICH2 to disable its own affinity. If you would like to disable the process to core mapping MVAPICH2 does, setting MV2_ENABLE_AFFINITY=0 will suffice.
     *   MV2_CPU_BINDING_LEVEL=core and MV2_ENABLE_AFFINITY=0.
  2.  For performance and portability reasons, it would be better to go with Linux’es CMA feature for intra-node transfers instead of using KNEM. Support for CMA is enabled by default in MVAPICH2. So, the following can be removed.
     *   --with-knem=/opt/knem-1.1.3.90mlnx1
  3.  Can you please let us know why the following options were chosen? This seems to tell MVAPICH2 to use PMIX but then mentions the PMI version as PMI2. Shouldn’t it be –with-pmi=pmix?
     *   --with-pmix=/opt/pmix/3.1.5 --with-pm=slurm --with-pmi=pmi2
  4.  MVAPICH2 does not need hcoll support from Mellanox. We have our own multi-level hierarchical collective communication support which is enabled by default. So, you can remove the following
     *   --with-hcoll=/opt/mellanox/hcoll --with-hcoll-include=/opt/mellanox/hcoll/include --with-hcoll-lib=/opt/mellanox/hcoll/lib
  5.  MVAPICH2 enables XRC automatically at configure time if the support is available in the underlying OFED, so the following can be removed
     *   --enable-xrc=yes

Please let us know if you have any other questions/comments.

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Vineet Soni
Sent: Tuesday, December 15, 2020 7:03 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] MVAPICH2: MPI process distribution error using Slurm

Hello,

I am facing an issue of process distribution with MVAPICH2-2.3.5 on AMD EPYC 7742 while using slurm.

I configured MVAPICH2-2.3.5 with:
../configure --prefix=/home/ext/mr/smer/soniv/tools/install/mvapich2-2.3.5-slurm CC=icc CXX=icpc FC=ifort F77=ifort --enable-romio --with-file-system=lustre --disable-silent-rules --with-hcoll=/opt/mellanox/hcoll --with-hcoll-include=/opt/mellanox/hcoll/include --with-hcoll-lib=/opt/mellanox/hcoll/lib --with-slurm=/usr --with-slurm-include=/usr/include/slurm --with-pmix=/opt/pmix/3.1.5 --with-pm=slurm --with-pmi=pmi2 --enable-xrc=yes --with-knem=/opt/knem-1.1.3.90mlnx1 --with-rdma=gen2 --disable-rdma-cm --enable-hybrid --with-ch3-rank-bits=32

It works when the MPI process distribution is handled through MVAPICH i.e without --distribution argument in srun.
But, if I try to use MV2_CPU_BINDING_LEVEL=core and MV2_ENABLE_AFFINITY=0 with
srun --distribution=cyclic --mpi=pmi2 --cpu_bind=cores...
I get
slurmstepd: error: *** STEP CANCELLED ...
srun: Job step aborted: Waiting up to 362 seconds for job step to finish.
It says it waited for 362 seconds, but in reality, it gets cancelled in 2-3 seconds with the same message every time.

And, if use SLURM_HOSTFILE to set the process distribution through a list of hostnames and use:
srun --distribution=arbitrary --mpi=pmi2 --cpu_bind=cores...
I get many errors like:
[error_sighandler] Caught error: Bus error (signal 7)

I can use both these methods of process distribution using Intel MPI 2018 and OpenMPI 4.0.2 with Slurm.

The slurm version is 19.05.7-Bull.1.1. The OS is RHEL 7.8 and the kernel release is 3.10.0-1127.19.1.el7.x86_64

I need to use the hostile (or cyclic distribution) to make sure one of the executables (in MPMD) is sufficiently scattered across nodes to avoid getting out-of-memory.

Is this an issue related to Slurm or MVAPICH2? And, if this is known, is there any available workaround?

Thank you in advance.

Best,
Vineet

--
Vineet Soni, PhD
HPC Expert
Mercator Ocean International
www.mercator-ocean.fr<https://urldefense.com/v3/__http:/www.mercator-ocean.fr__;!!KGKeukY!k42yS3ZePDLkGdYvyk6CwIe_Y8ZsV9yO0usgfTF8-pKaTHsy4uQAHQfghdwYSmZE0htxMRo3kfI7P-s$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201215/a813ca32/attachment-0001.html>


More information about the mvapich-discuss mailing list