[mvapich-discuss] mvapich-discuss Digest, Vol 169, Issue 13

Sashi Balasingam sashibala2 at yahoo.com
Fri Jan 31 01:36:58 EST 2020


 Hi Hari,Thanks for the response. Actually, we did manage to find the root cause, which was completely unrelated to MPI. 
t was due to the use of 'killall' system call, in a different module (launches child process), which is no longer available in SLES-15, and had to use an alternate call. It was a weird interaction, that somehow affected MPI comm. 
For IP  reasons, we may not be able to share the source code or platform. 
BTW, for future reference, do you have any response to questions 5b, 5c in my orig email below.
Thanks,Sashi
    On Thursday, January 30, 2020, 06:53:11 AM PST, <mvapich-discuss-request at cse.ohio-state.edu> wrote:  
 
 Send mvapich-discuss mailing list submissions to
    mvapich-discuss at cse.ohio-state.edu

To subscribe or unsubscribe via the World Wide Web, visit
    http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
or, via email, send a message with subject or body 'help' to
    mvapich-discuss-request at cse.ohio-state.edu

You can reach the person managing the list at
    mvapich-discuss-owner at cse.ohio-state.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of mvapich-discuss digest..."


Today's Topics:

  1. Re: Basic comm failures with 2.3.3, on new OS (Subramoni, Hari)


----------------------------------------------------------------------

Message: 1
Date: Thu, 30 Jan 2020 14:50:31 +0000
From: "Subramoni, Hari" <subramoni.1 at osu.edu>
To: Sashi Balasingam <sashibala2 at yahoo.com>,
    "mvapich-discuss at cse.ohio-state.edu"
    <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Basic comm failures with 2.3.3, on new
    OS
Message-ID:
    <CY4PR01MB2661B4097A4956C6B0F97358A4040 at CY4PR01MB2661.prod.exchangelabs.com>
    
Content-Type: text/plain; charset="utf-8"

Dear, Sashi.

Sorry to hear that you are facing issues when running your program with MVAPICH2.

Would it be possible for us to have access to your reproducer program and/or your system (since we don?t have SuSE systems locally) so that we can debug the problem further?

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Sashi Balasingam
Sent: Tuesday, January 28, 2020 3:51 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Basic comm failures with 2.3.3, on new OS

Hi,

I have been a long time user of MVAPICH on our products, and it has performed very well for us. However, I am seeing some basic functionality issues on a next gen platform. See details below -

1.      Platform : Three x86_64 SuperMicro Servers, connected by Mellanox FDR Infiniband, running  MVAPICH 2.3.3, on SuSe Linux Enterprise 15.0, Kernel ? 4.12.14-23-default, gcc version 7.3.1

2.      Problem statement: MPI communication between the servers, is functional, but fails very quickly, and stops all further transmits / receives on all nodes.

3.      Details :

a.      The same s/w runs successfully (for years) on a similar h/w platform, but running MVAPICH 2.2.2a, on  SuSe Linux Enterprise 12, SP-1, Kernel ? 3.12.49-11-default, gcc version 4.8.5

b.      We use combinations of : sync_MPI_Isend(), sync_MPI_Irecv(), sync_MPI_Test(), to execute Asynchronous communications between the nodes

c.      There are multiple, ?concurrent? transmits and receives occurring on every node.

d.      Problem - after some successful comms, the code will stall on sync_MPI_Test(), event though that buffer was received successfully on the target node.

4.      MPI Options used

a.      Output of mpichversion?

                                                              i.      MVAPICH2 Version:      2.3.3

                                                            ii.      MVAPICH2 Release date:  Thu January 09 22:00:00 EST 2019

                                                          iii.      MVAPICH2 Device:        ch3:mrail

                                                          iv.      MVAPICH2 configure:    --prefix=/usr/mpi/gcc/mvapich-2.3.3 --enable-hybrid --enable-shared --enable-g=all --enable-error-messages=all

                                                            v.      MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -g -O2

                                                          vi.      MVAPICH2 CXX:  g++  -DNDEBUG -DNVALGRIND -g -O2

                                                          vii.      MVAPICH2 F77:  gfortran -L/lib -L/lib  -g -O2

                                                        viii.      MVAPICH2 FC:    gfortran  -g -O2

b.        Launch cmd: mpirun_run -rsh  -np 2 imc-host compute001 MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=2 MV2_DEBUG_SHOW_BACKTRACE=1

5.      Questions :

a.      Do you know if MVAPICH 2.3.3 has been run successfully on platform similar to #1 above, or any known issues ?

b.      Are the  build and run-time options shown above are OK, or do you recommend change or addition of other options ?

c.      Are there any other log options we can enable to debug the above problem ?

d.      Any other debug hints ?



Appreciate a prompt response.

Thanks,

Sashi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200130/c38a0624/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


------------------------------

End of mvapich-discuss Digest, Vol 169, Issue 13
************************************************
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200131/c86e5613/attachment-0001.html>


More information about the mvapich-discuss mailing list