[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Angel de Vicente angelv at iac.es
Wed Dec 19 11:52:05 EST 2018


Hi Sourav,

"Chakraborty, Sourav" <chakraborty.52 at buckeyemail.osu.edu> writes:
> Can you please try the following steps and see if the issue gets resolved?
>
> 1. Try the latest Mvapich2-2.3 GA release
> 2. Use mpirun_rsh instead of mpirun (Hydra)
> 3. Set MV2_USE_RDMA_CM=0

Setting MV2_USE_RDMA_CM=0 seems to fix the problem (I tried with up to
600 processes), though I'm not sure what it implies in terms of
performance.

I tried with the previously mentioned mvapich release 2.3rc2, and the
newest one (downloaded yesterday, 2.3, don't know what GA means). In
both cases disabling RDMA seems to get around the problem.

I was looking yesterday at mpirun_rsh, but I didn't find how to use it
together with Slurm, because the example I saw had to specify the hosts
as a hostfile. Will look at the documentation again, as I understand
that mpirun_rsh is the recommended way of starting jobs?

> If the issue still persists, can you please share more details about
> the system you are using? (Number and type of HCA, etc). Please also
> share the output of the command mpiname -a

mpiname with the newly installed version of mvapich2:
,----
| $mpiname -a
| MVAPICH2 2.3 Mon Jul 23 22:00:00 EST 2018 ch3:mrail
| 
| Compilation
| CC: icc    -DNDEBUG -DNVALGRIND -O2
| CXX: icpc   -DNDEBUG -DNVALGRIND -O2
| F77: ifort   -O2
| FC: ifort   -O2
| 
| Configuration
| CC=icc CXX=icpc FC=ifort --prefix=/storage/projects/can30/angelv/local/libraries/MPI/MVAPICH2/intel-18.0.2_mvapich2-2.3 --with-pmi=pmi2 --with-pm=slurm
`----

-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/


More information about the mvapich-discuss mailing list