[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Wed Dec 19 08:01:19 EST 2018

Hi,

Sorry to hear that you are facing issues.

Can you please send the following information
1. output of mpiname -a
2. output of 9 node run after setting the following environment variables "MV2_SHOW_ENV_INFO=2 MV2_SHOW_CPU_MAPPING=1 MV2_SHOW_HCA_MAPPING=1"

I see that you are running the 9 node case also with 80 processes. Could you please try running the 9 node case with 90 processes (I am assuming you are running 10 processes per node). If this runs, can you please try setting MV2_USE_SHMEM_COLL=0 to see if it makes things pass at 9 nodes with 80 processes?

Thx,
Hari.

-----Original Message-----
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf Of Angel de Vicente
Sent: Tuesday, December 18, 2018 5:40 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Hi,

I'm just new to MVAPICH2. In our local cluster we have MVAPICH2-2.3 installed and I'm trying to run our codes in there. But the problem is that, for some reason, when I submit a job using mvapich2, the job seems to start, but it gets stuck right at the initialization phases, so I don't get any output, but the job doesn't crash or ends either.

To make sure that this was nothing to do with our code (which uses dynamic libraries, HDF5, etc.), I just tried to run the bcast OSU benchmark, and I see the same problem

MVAPICH2 is compiled with Intel compilers (version 2018.2), and we have Slurm as the job manager (and InfiniBand interconnect). The submission script is:
,----
| #$ cat test_submit.sh
| #!/bin/bash
| #
| #SBATCH -J test_OSU_mvapich2
| #SBATCH -N 4
| #SBATCH -n 40
| #SBATCH -t 00:20:00
| #SBATCH -o test_OSU_mvapich2-%j.out
| #SBATCH -e test_OSU_mvapich2-%j.err
| #SBATCH -D .
|
| ######## MVAPICH2
| module purge
| module load intel/2018.2
| module load mvapich2/intel/2.3rc2
| mpirun -np $SLURM_NTASKS 
| build.intel.mvapich2/libexec/osu-micro-benchmarks/mpi/collective/osu_b
| cast
`----

And the code works no problem, for example with 80 processes in 8 nodes:
,----
| #$ sbatch -N 8 -n 80 test_submit.sh
| Submitted batch job 30306
|
| #$ cat test_OSU_mvapich2-30306.out
|
| # OSU MPI Broadcast Latency Test v5.5
| # Size       Avg Latency(us)
| 1                       3.23
| 2                       2.98
| 4                       2.97
| [...]
`----

it does generate some output if I go to 9 nodes, but it gets stuck there:
,----
| #$ sbatch -N 9 -n 80 test_submit.sh
| Submitted batch job 30307
|
| #$ cat test_OSU_mvapich2-30307.out
|
| # OSU MPI Broadcast Latency Test v5.5
| # Size       Avg Latency(us)
`----

and it doesn't produce any output at all if I go to 100 processes:
,----
| #$ sbatch -N 10 -n 100 test_submit.sh
| Submitted batch job 30308
|
| #$ ls -ltr test_OSU_mvapich2-30308.out
| -rw-r--r-- 1 can30003 can30 0 Dec 18 16:09 test_OSU_mvapich2-30308.out
`----

Any pointers on things I can try to figure out what is going on and how to solve it?

Many thanks,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss