[mvapich-discuss] MVAPICH2: MPI process distribution error using Slurm

Vineet Soni vsoni at mercator-ocean.fr
Tue Dec 15 07:02:56 EST 2020


Hello, 

I am facing an issue of process distribution with MVAPICH2-2.3.5 on AMD EPYC 7742 while using slurm. 

I configured MVAPICH2-2.3.5 with: 
../configure --prefix=/home/ext/mr/smer/soniv/tools/install/mvapich2-2.3.5-slurm CC=icc CXX=icpc FC=ifort F77=ifort --enable-romio --with-file-system=lustre --disable-silent-rules --with-hcoll=/opt/mellanox/hcoll --with-hcoll-include=/opt/mellanox/hcoll/include --with-hcoll-lib=/opt/mellanox/hcoll/lib --with-slurm=/usr --with-slurm-include=/usr/include/slurm --with-pmix=/opt/pmix/3.1.5 --with-pm=slurm --with-pmi=pmi2 --enable-xrc=yes --with-knem=/opt/knem-1.1.3.90mlnx1 --with-rdma=gen2 --disable-rdma-cm --enable-hybrid --with-ch3-rank-bits=32 

It works when the MPI process distribution is handled through MVAPICH i.e without --distribution argument in srun . 
But, if I try to use MV2_CPU_BINDING_LEVEL=core and MV2_ENABLE_AFFINITY=0 with 
srun --distribution=cyclic --mpi=pmi2 --cpu_bind=cores... 
I get 
slurmstepd: error: *** STEP CANCELLED ... 
srun: Job step aborted: Waiting up to 362 seconds for job step to finish. 
It says it waited for 362 seconds, but in reality, it gets cancelled in 2-3 seconds with the same message every time. 

And, if use SLURM_HOSTFILE to set the process distribution through a list of hostnames and use: 
srun --distribution=arbitrary --mpi=pmi2 --cpu_bind=cores... 
I get many errors like: 
[error_sighandler] Caught error: Bus error (signal 7) 

I can use both these methods of process distribution using Intel MPI 2018 and OpenMPI 4.0.2 with Slurm. 

The slurm version is 19.05.7-Bull.1.1. The OS is RHEL 7.8 and the kernel release is 3.10.0-1127.19.1.el7.x86_64 

I need to use the hostile (or cyclic distribution) to make sure one of the executables (in MPMD) is sufficiently scattered across nodes to avoid getting out-of-memory. 

Is this an issue related to Slurm or MVAPICH2? And, if this is known, is there any available workaround? 

Thank you in advance. 

Best, 
Vineet 

-- 
Vineet Soni, PhD 
HPC Expert 
Mercator Ocean International 
https://urldefense.com/v3/__http://www.mercator-ocean.fr__;!!KGKeukY!k42yS3ZePDLkGdYvyk6CwIe_Y8ZsV9yO0usgfTF8-pKaTHsy4uQAHQfghdwYSmZE0htxMRo3kfI7P-s$  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201215/b5da1d04/attachment.html>


More information about the mvapich-discuss mailing list