[mvapich-discuss] Basic comm failures with 2.3.3, on new OS
Sashi Balasingam
sashibala2 at yahoo.com
Tue Jan 28 15:51:14 EST 2020
Hi,
I have been a long time user ofMVAPICH on our products, and it has performed very well for us. However, I am seeingsome basic functionality issues on a next gen platform. See details below -
1. Platform: Three x86_64 SuperMicro Servers, connected by Mellanox FDR Infiniband,running MVAPICH 2.3.3, on SuSe LinuxEnterprise 15.0, Kernel – 4.12.14-23-default, gcc version 7.3.1
2. Problemstatement: MPI communication between the servers, is functional, but fails veryquickly, and stops all further transmits / receives on all nodes.
3. Details :
a. The same s/w runs successfully (for years) on asimilar h/w platform, but running MVAPICH 2.2.2a, on SuSe Linux Enterprise 12, SP-1, Kernel – 3.12.49-11-default,gcc version 4.8.5
b. We use combinations of : sync_MPI_Isend(), sync_MPI_Irecv(),sync_MPI_Test(), to execute Asynchronous communications between the nodes
c. There are multiple, ‘concurrent’ transmits andreceives occurring on every node.
d. Problem - after some successful comms, the codewill stall on sync_MPI_Test(), event though that buffer was received successfullyon the target node.
4. MPIOptions used
a. Output of mpichversion’
i. MVAPICH2Version: 2.3.3
ii. MVAPICH2Release date: Thu January 09 22:00:00 EST 2019
iii. MVAPICH2Device: ch3:mrail
iv. MVAPICH2configure: --prefix=/usr/mpi/gcc/mvapich-2.3.3 --enable-hybrid--enable-shared --enable-g=all --enable-error-messages=all
v. MVAPICH2CC: gcc -DNDEBUG -DNVALGRIND -g -O2
vi. MVAPICH2CXX: g++ -DNDEBUG -DNVALGRIND -g -O2
vii. MVAPICH2F77: gfortran -L/lib -L/lib -g -O2
viii. MVAPICH2FC: gfortran -g -O2
b. Launch cmd: mpirun_run -rsh -np 2 imc-host compute001MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=2 MV2_DEBUG_SHOW_BACKTRACE=1
5. Questions :
a. Do you know if MVAPICH 2.3.3 has been runsuccessfully on platform similar to #1 above, or any known issues ?
b. Are the build and run-time options shown above are OK, or do you recommend changeor addition of other options ?
c. Are there any other log options we can enable todebug the above problem ?
d. Any other debug hints ?
Appreciate a prompt response.
Thanks,
Sashi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200128/51a139b4/attachment-0001.html>
More information about the mvapich-discuss
mailing list