[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem
Angel de Vicente
angelv at iac.es
Wed Feb 6 11:51:41 EST 2019
Hi,
"Subramoni, Hari" <subramoni.1 at osu.edu> writes:
> Disabling RDMA_CM will not have any impact on performance. It is just one way we
> setup connections. RDMA_CM has better startup performance. However, unless the
> IP addresses on various machines are setup correctly, one may see weird issues
> with applications hanging at startup.
at last I got around running some benchmarks with our code, and three
differents software stacks:
(1) Intel + IMPI
1) intel/2018.2
2) szip/intel/2.1.1
3) hdf5/intel/impi/1.10.1
4) impi/2018.2
(2) Intel + OpenMPI
1) intel/2018.2
2) szip/intel/2.1.1
3) openmpi/intel/3.0.1
4) hdf5/intel/openmpi/1.10.1
(3) Intel + MVAPICH2
1) intel/2018.2
2) szip/intel/2.1.1
3) mvapich2/intel/2.3rc2
4) hdf5/intel/mvapich2/1.10.1
The largest run I tried was 401 cores, in 40 nodes. The times obtained were:
|-------+--------------------------------------+-----------------------+-------------|
| Stack | Best Time [sec] (after iteration 10) | % more (rel. to best) | Other times |
|-------+--------------------------------------+-----------------------+-------------|
| 1 | 340 | 0. | |
| 2 | 361 | 6.1764706 | |
| 3 | 383 | 12.647059 | |
|-------+--------------------------------------+-----------------------+-------------|
So, in this case MVAPICH2 was about 12% slower than Intel+MPI and also
slower than Intel+OpenMPI.
Since I'm basically new to MVAPICH2 I don't know what I can tune in
order to make it faster (if possible). Any advice is welcome.
Many thanks,
--
Ángel de Vicente
Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/
More information about the mvapich-discuss
mailing list