[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Chakraborty, Sourav chakraborty.52 at buckeyemail.osu.edu
Wed Dec 19 10:45:29 EST 2018


Hi Angel,

Can you please try the following steps and see if the issue gets resolved?

1. Try the latest Mvapich2-2.3 GA release
2. Use mpirun_rsh instead of mpirun (Hydra)
3. Set MV2_USE_RDMA_CM=0

The Mvapich2 user guide has more details on using mpirun_rsh as the launcher and setting environment variables.

If the issue still persists, can you please share more details about the system you are using? (Number and type of HCA, etc). Please also share the output of the command mpiname -a

Thanks,
Sourav

On Tue, Dec 18, 2018, 4:40 PM Angel de Vicente <angelv at iac.es<mailto:angelv at iac.es> wrote:
Hi,

I'm just new to MVAPICH2. In our local cluster we have MVAPICH2-2.3
installed and I'm trying to run our codes in there. But the problem is
that, for some reason, when I submit a job using mvapich2, the job seems
to start, but it gets stuck right at the initialization phases, so I
don't get any output, but the job doesn't crash or ends either.

To make sure that this was nothing to do with our code (which uses
dynamic libraries, HDF5, etc.), I just tried to run the bcast OSU
benchmark, and I see the same problem

MVAPICH2 is compiled with Intel compilers (version 2018.2), and we have
Slurm as the job manager (and InfiniBand interconnect). The submission script is:
,----
| #$ cat test_submit.sh
| #!/bin/bash
| #
| #SBATCH -J test_OSU_mvapich2
| #SBATCH -N 4
| #SBATCH -n 40
| #SBATCH -t 00:20:00
| #SBATCH -o test_OSU_mvapich2-%j.out
| #SBATCH -e test_OSU_mvapich2-%j.err
| #SBATCH -D .
|
| ######## MVAPICH2
| module purge
| module load intel/2018.2
| module load mvapich2/intel/2.3rc2
| mpirun -np $SLURM_NTASKS build.intel.mvapich2/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast
`----

And the code works no problem, for example with 80 processes in 8 nodes:
,----
| #$ sbatch -N 8 -n 80 test_submit.sh
| Submitted batch job 30306
|
| #$ cat test_OSU_mvapich2-30306.out
|
| # OSU MPI Broadcast Latency Test v5.5
| # Size       Avg Latency(us)
| 1                       3.23
| 2                       2.98
| 4                       2.97
| [...]
`----

it does generate some output if I go to 9 nodes, but it gets stuck there:
,----
| #$ sbatch -N 9 -n 80 test_submit.sh
| Submitted batch job 30307
|
| #$ cat test_OSU_mvapich2-30307.out
|
| # OSU MPI Broadcast Latency Test v5.5
| # Size       Avg Latency(us)
`----

and it doesn't produce any output at all if I go to 100 processes:
,----
| #$ sbatch -N 10 -n 100 test_submit.sh
| Submitted batch job 30308
|
| #$ ls -ltr test_OSU_mvapich2-30308.out
| -rw-r--r-- 1 can30003 can30 0 Dec 18 16:09 test_OSU_mvapich2-30308.out
`----


Any pointers on things I can try to figure out what is going on and how
to solve it?

Many thanks,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181219/69de0ea3/attachment.html>


More information about the mvapich-discuss mailing list