[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Angel de Vicente angelv at iac.es
Wed Feb 6 11:51:41 EST 2019


Hi,

"Subramoni, Hari" <subramoni.1 at osu.edu> writes:
> Disabling RDMA_CM will not have any impact on performance. It is just one way we
> setup connections. RDMA_CM has better startup performance. However, unless the
> IP addresses on various machines are setup correctly, one may see weird issues
> with applications hanging at startup.

at last I got around running some benchmarks with our code, and three
differents software stacks:

(1)  Intel + IMPI 
   1) intel/2018.2
   2) szip/intel/2.1.1
   3) hdf5/intel/impi/1.10.1
   4) impi/2018.2

(2)  Intel + OpenMPI 
   1) intel/2018.2
   2) szip/intel/2.1.1
   3) openmpi/intel/3.0.1
   4) hdf5/intel/openmpi/1.10.1

(3)  Intel + MVAPICH2 
   1) intel/2018.2
   2) szip/intel/2.1.1
   3) mvapich2/intel/2.3rc2
   4) hdf5/intel/mvapich2/1.10.1

The largest run I tried was 401 cores, in 40 nodes. The times obtained were:

|-------+--------------------------------------+-----------------------+-------------|
| Stack | Best Time [sec] (after iteration 10) | % more (rel. to best) | Other times |
|-------+--------------------------------------+-----------------------+-------------|
|     1 |                                  340 |                    0. |             |
|     2 |                                  361 |             6.1764706 |             |
|     3 |                                  383 |             12.647059 |             |
|-------+--------------------------------------+-----------------------+-------------|


So, in this case MVAPICH2 was about 12% slower than Intel+MPI and also
slower than Intel+OpenMPI.

Since I'm basically new to MVAPICH2 I don't know what I can tune in
order to make it faster (if possible). Any advice is welcome.

Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/


More information about the mvapich-discuss mailing list