[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Subramoni, Hari subramoni.1 at osu.edu
Wed Dec 19 12:14:17 EST 2018


Hi,

Looks like Sourav and myself suggested the same thing.

Disabling RDMA_CM will not have any impact on performance. It is just one way we setup connections. RDMA_CM has better startup performance. However, unless the IP addresses on various machines are setup correctly, one may see weird issues with applications hanging at startup.

You cannot use mpirun_rsh together with SLURM.

Thx,
Hari.

-----Original Message-----
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf Of Angel de Vicente
Sent: Wednesday, December 19, 2018 11:52 AM
To: Chakraborty, Sourav <chakraborty.52 at buckeyemail.osu.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Hi Sourav,

"Chakraborty, Sourav" <chakraborty.52 at buckeyemail.osu.edu> writes:
> Can you please try the following steps and see if the issue gets resolved?
>
> 1. Try the latest Mvapich2-2.3 GA release 2. Use mpirun_rsh instead of 
> mpirun (Hydra) 3. Set MV2_USE_RDMA_CM=0

Setting MV2_USE_RDMA_CM=0 seems to fix the problem (I tried with up to
600 processes), though I'm not sure what it implies in terms of performance.

I tried with the previously mentioned mvapich release 2.3rc2, and the newest one (downloaded yesterday, 2.3, don't know what GA means). In both cases disabling RDMA seems to get around the problem.

I was looking yesterday at mpirun_rsh, but I didn't find how to use it together with Slurm, because the example I saw had to specify the hosts as a hostfile. Will look at the documentation again, as I understand that mpirun_rsh is the recommended way of starting jobs?

> If the issue still persists, can you please share more details about 
> the system you are using? (Number and type of HCA, etc). Please also 
> share the output of the command mpiname -a

mpiname with the newly installed version of mvapich2:
,----
| $mpiname -a
| MVAPICH2 2.3 Mon Jul 23 22:00:00 EST 2018 ch3:mrail
| 
| Compilation
| CC: icc    -DNDEBUG -DNVALGRIND -O2
| CXX: icpc   -DNDEBUG -DNVALGRIND -O2
| F77: ifort   -O2
| FC: ifort   -O2
| 
| Configuration
| CC=icc CXX=icpc FC=ifort 
| --prefix=/storage/projects/can30/angelv/local/libraries/MPI/MVAPICH2/i
| ntel-18.0.2_mvapich2-2.3 --with-pmi=pmi2 --with-pm=slurm
`----

--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5514 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181219/28be5a89/attachment-0001.p7s>


More information about the mvapich-discuss mailing list