[Mvapich-discuss] Very slow startup on mvapich 4.0
Alex
mgs.rus.52 at gmail.com
Sat Aug 23 05:06:53 EDT 2025
Hi,
Recently I've compared mvapich (based on mpich 4.3.0 as I recall) and mpich
4.3.1 on single node Intel Xeon 6972p (it has mellanox fabric but since
it's single node it's not relevant). The application is quite tricky but
similar issue is observed in IMB: the more rank you start the longer the
delay is (2 ranks start almost instantly). The test was as follows:
1. Both MPIs is configured similarly:
./configure --prefix=$HOMEINIT/mvapich/4.0x-mt-ucx --enable-silent-rules \
--with-device=ch4:ucx:shm --with-pm=hydra --enable-romio
--with-ch3-rank-bits=32 --enable-threads=multiple --without-ze
--with-file-system=lustre+nfs \
--enable-shared --with-hwloc=embedded --with-ucx=embedded
--with-libfabric=embedded --enable-fortran=all --with-ch4-shmmods=posix \
CC=icx F77=ifx FC=ifx CXX=icpx \
MPICHLIB_CPPFLAGS="-I$WORKINIT/misc.libs/lustre-release/lustre/include
-I$WORKINIT/misc.libs/lustre-release/lustre/include/uapi" \
MPICHLIB_CFLAGS='-Wno-unused-but-set-variable
-Wno-tautological-constant-compare -Wno-initializer-overrides' \
MPICHLIB_FCFLAGS='-Wno-unused-but-set-variable
-Wno-tautological-constant-compare -Wno-initializer-overrides' \
MPICHLIB_CXXFLAGS='-Wno-unused-but-set-variable
-Wno-tautological-constant-compare -Wno-initializer-overrides' \
2>&1 | tee configure.log
(the only difference is the installation path)
2. Execute the application (mpiexec.hydra -launcher ssh -genvall -bind-to
core:1 -np 192 ./app)
3. Review its report.
4. Recompile MVAPICH without --with-ch4-shmmods=posix
5. Repeat MVAPICH test.
So the results are as follows:
1. MPICH 4.3.1
Initialization time : 4.02 s
Elapsed time : 94.39 s
2. MVAPICH
Initialization time : 55.06 s
Elapsed time : 131.06 s
3. MVAPICH with no posix shmem
Initialization time : 4.03 s
Elapsed time : 108.99 s
As you can see MVAPCIH is quite faster on execution stage (numbers are
inclusive) but startup ruinis the "picture".
Is there any differences in shmem (apart from having its own MV_SHM or so)
and how it can be fixed?
As I said earlier you can observe the same issue on IMB (presumably on all
high pppn runs). The only reason I took this application is because it
writes its init phase :).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20250823/5bbdbbc6/attachment.html>
More information about the Mvapich-discuss
mailing list