[mvapich-discuss] Basic comm failures with 2.3.3, on new OS

Subramoni, Hari subramoni.1 at osu.edu
Thu Jan 30 09:50:31 EST 2020


Dear, Sashi.

Sorry to hear that you are facing issues when running your program with MVAPICH2.

Would it be possible for us to have access to your reproducer program and/or your system (since we don’t have SuSE systems locally) so that we can debug the problem further?

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Sashi Balasingam
Sent: Tuesday, January 28, 2020 3:51 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Basic comm failures with 2.3.3, on new OS

Hi,

I have been a long time user of MVAPICH on our products, and it has performed very well for us. However, I am seeing some basic functionality issues on a next gen platform. See details below -

1.       Platform : Three x86_64 SuperMicro Servers, connected by Mellanox FDR Infiniband, running  MVAPICH 2.3.3, on SuSe Linux Enterprise 15.0, Kernel – 4.12.14-23-default, gcc version 7.3.1

2.       Problem statement: MPI communication between the servers, is functional, but fails very quickly, and stops all further transmits / receives on all nodes.

3.       Details :

a.       The same s/w runs successfully (for years) on a similar h/w platform, but running MVAPICH 2.2.2a, on  SuSe Linux Enterprise 12, SP-1, Kernel – 3.12.49-11-default, gcc version 4.8.5

b.       We use combinations of : sync_MPI_Isend(), sync_MPI_Irecv(), sync_MPI_Test(), to execute Asynchronous communications between the nodes

c.       There are multiple, ‘concurrent’ transmits and receives occurring on every node.

d.       Problem - after some successful comms, the code will stall on sync_MPI_Test(), event though that buffer was received successfully on the target node.

4.       MPI Options used

a.       Output of mpichversion’

                                                               i.      MVAPICH2 Version:       2.3.3

                                                             ii.      MVAPICH2 Release date:  Thu January 09 22:00:00 EST 2019

                                                           iii.      MVAPICH2 Device:        ch3:mrail

                                                           iv.      MVAPICH2 configure:     --prefix=/usr/mpi/gcc/mvapich-2.3.3 --enable-hybrid --enable-shared --enable-g=all --enable-error-messages=all

                                                             v.      MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -g -O2

                                                           vi.      MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -g -O2

                                                          vii.      MVAPICH2 F77:   gfortran -L/lib -L/lib   -g -O2

                                                        viii.      MVAPICH2 FC:    gfortran   -g -O2

b.        Launch cmd: mpirun_run -rsh  -np 2 imc-host compute001 MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=2 MV2_DEBUG_SHOW_BACKTRACE=1

5.       Questions :

a.       Do you know if MVAPICH 2.3.3 has been run successfully on platform similar to #1 above, or any known issues ?

b.       Are the  build and run-time options shown above are OK, or do you recommend change or addition of other options ?

c.       Are there any other log options we can enable to debug the above problem ?

d.       Any other debug hints ?



Appreciate a prompt response.

Thanks,

Sashi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200130/c38a0624/attachment-0001.html>


More information about the mvapich-discuss mailing list