[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Angel de Vicente angelv at iac.es
Tue Dec 18 17:40:27 EST 2018


Hi,

I'm just new to MVAPICH2. In our local cluster we have MVAPICH2-2.3
installed and I'm trying to run our codes in there. But the problem is
that, for some reason, when I submit a job using mvapich2, the job seems
to start, but it gets stuck right at the initialization phases, so I
don't get any output, but the job doesn't crash or ends either.

To make sure that this was nothing to do with our code (which uses
dynamic libraries, HDF5, etc.), I just tried to run the bcast OSU
benchmark, and I see the same problem

MVAPICH2 is compiled with Intel compilers (version 2018.2), and we have
Slurm as the job manager (and InfiniBand interconnect). The submission script is:
,----
| #$ cat test_submit.sh
| #!/bin/bash
| #
| #SBATCH -J test_OSU_mvapich2
| #SBATCH -N 4
| #SBATCH -n 40
| #SBATCH -t 00:20:00
| #SBATCH -o test_OSU_mvapich2-%j.out
| #SBATCH -e test_OSU_mvapich2-%j.err
| #SBATCH -D .
|
| ######## MVAPICH2
| module purge
| module load intel/2018.2
| module load mvapich2/intel/2.3rc2
| mpirun -np $SLURM_NTASKS build.intel.mvapich2/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast
`----

And the code works no problem, for example with 80 processes in 8 nodes:
,----
| #$ sbatch -N 8 -n 80 test_submit.sh
| Submitted batch job 30306
|
| #$ cat test_OSU_mvapich2-30306.out
|
| # OSU MPI Broadcast Latency Test v5.5
| # Size       Avg Latency(us)
| 1                       3.23
| 2                       2.98
| 4                       2.97
| [...]
`----

it does generate some output if I go to 9 nodes, but it gets stuck there:
,----
| #$ sbatch -N 9 -n 80 test_submit.sh
| Submitted batch job 30307
|
| #$ cat test_OSU_mvapich2-30307.out
|
| # OSU MPI Broadcast Latency Test v5.5
| # Size       Avg Latency(us)
`----

and it doesn't produce any output at all if I go to 100 processes:
,----
| #$ sbatch -N 10 -n 100 test_submit.sh
| Submitted batch job 30308
|
| #$ ls -ltr test_OSU_mvapich2-30308.out
| -rw-r--r-- 1 can30003 can30 0 Dec 18 16:09 test_OSU_mvapich2-30308.out
`----


Any pointers on things I can try to figure out what is going on and how
to solve it?

Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/


More information about the mvapich-discuss mailing list