[mvapich-discuss] Basic comm failures with 2.3.3, on new OS

Sashi Balasingam sashibala2 at yahoo.com
Tue Jan 28 15:51:14 EST 2020


Hi,
I have been a long time user ofMVAPICH on our products, and it has performed very well for us. However, I am seeingsome basic functionality issues on a next gen platform. See details below - 

1.       Platform: Three x86_64 SuperMicro Servers, connected by Mellanox FDR Infiniband,running  MVAPICH 2.3.3, on SuSe LinuxEnterprise 15.0, Kernel – 4.12.14-23-default, gcc version 7.3.1



2.       Problemstatement: MPI communication between the servers, is functional, but fails veryquickly, and stops all further transmits / receives on all nodes. 



3.       Details :

a.      The same s/w runs successfully (for years) on asimilar h/w platform, but running MVAPICH 2.2.2a, on  SuSe Linux Enterprise 12, SP-1, Kernel – 3.12.49-11-default,gcc version 4.8.5

b.      We use combinations of : sync_MPI_Isend(), sync_MPI_Irecv(),sync_MPI_Test(), to execute Asynchronous communications between the nodes

c.      There are multiple, ‘concurrent’ transmits andreceives occurring on every node.

d.      Problem - after some successful comms, the codewill stall on sync_MPI_Test(), event though that buffer was received successfullyon the target node. 



4.       MPIOptions used

a.      Output of mpichversion’

                                                              i.     MVAPICH2Version:       2.3.3

                                                            ii.     MVAPICH2Release date:  Thu January 09 22:00:00 EST 2019

                                                          iii.     MVAPICH2Device:        ch3:mrail

                                                          iv.     MVAPICH2configure:     --prefix=/usr/mpi/gcc/mvapich-2.3.3 --enable-hybrid--enable-shared --enable-g=all --enable-error-messages=all

                                                            v.     MVAPICH2CC:    gcc    -DNDEBUG -DNVALGRIND -g -O2

                                                          vi.     MVAPICH2CXX:   g++   -DNDEBUG -DNVALGRIND -g -O2

                                                         vii.     MVAPICH2F77:   gfortran -L/lib -L/lib   -g -O2

                                                       viii.     MVAPICH2FC:    gfortran   -g -O2



b.       Launch cmd: mpirun_run -rsh  -np 2 imc-host compute001MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=2 MV2_DEBUG_SHOW_BACKTRACE=1



5.       Questions :

a.      Do you know if MVAPICH 2.3.3 has been runsuccessfully on platform similar to #1 above, or any known issues ?

b.      Are the build and run-time options shown above are OK, or do you recommend changeor addition of other options ?

c.      Are there any other log options we can enable todebug the above problem ?

d.      Any other debug hints ?

 

Appreciate a prompt response.  

Thanks,

Sashi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200128/51a139b4/attachment-0001.html>


More information about the mvapich-discuss mailing list