[mvapich-discuss] performance problems with gath/scat

Tue Aug 3 11:19:09 EDT 2010

See comments below.

On Mon, 2010-08-02 at 17:25 -0500, Dhabaleswar Panda wrote:
> Hi Dan,
> 
> > The message sizes used for the three process counts are as follows
> >
> > 	Scatterv	Gatherv
> > 256	13-15K		13-14K
> > 512	7K		7K
> > 720	5K		5K
> 
> Thanks for this information.
> 
> > setting MV2_IBA_EAGER_THRESHOLD=16384 MV2_VBUF_TOTAL_SIZE=16384 did
> > improve the performance of the 256 case nicely.
> >
> > 00:22:57 -> 00:17:17
> 
> Good to know that you are able to get good performance here by changing
> the above two parameters at run-time.
> 
> > FYI, the default eager/rendezvous crossover size in Intel MPI is 256KB
> 
> This seems to be too high. We are analyzing this.
> 
> > Nemesis results:
> >
> > I configured mva2-1.5 with the following and ran a synthetic MPI
> > benchmark.  I found a dramatic speedup in gatherv.
> 
> This is very good to know.
> 
> > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -O0
> > -traceback -debug" CXXFLAGS="-fpic -O0 -traceback -debug" FFLAGS="-fpic
> > -O0 -traceback -debug -nolib-inline -check all -fp-stack-check -ftrapuv"
> > F90FLAGS="-fpic -O0 -traceback -debug -nolib-inline -check all
> > -fp-stack-check -ftrapuv"
> > --prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083_nemesis_debug
> > --enable-error-checking=all --enable-error-messages=all --enable-g=all
> > --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
> > --enable-threads=default --with-device=ch3:nemesis:ib --with-hwloc
> >
> > Unfortunately, the GEOSgcm application would not run under the nemesis
> > build at 256 processes.  Interestingly, it would run on 512 processes.
> >
> > Here are some stack traces of the failure on 256.  Any ideas?
> 
> Thanks for the update here and sending us the stack traces. We are taking
> a look at these traces and trying to reproduce this error. We will get
> back to you on this soon.

I used Totalview to look at the actual call that fails.  It's quite a
complicated call.  Not all of the processes contribute data and the ones
that do don't contribute the same amount.

> 
> > 16x16
> > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > MPIR_Allgatherv(799):
> > (unknown)(): Other MPI error
> > MPI process (rank: 0) terminated unexpectedly on borgl065
> > forrtl: error (69): process interrupted (SIGINT)
> > Image              PC                Routine            Line        Source
> > GEOSgcm.x          0000000008539C15  MPIDI_nem_ib_get_         951  ib_poll.c
> > GEOSgcm.x          0000000008537969  MPIDI_nem_ib_read         316  ib_poll.c
> > GEOSgcm.x          0000000008537B96  MPID_nem_ib_poll          459  ib_poll.c
> > GEOSgcm.x          000000000852DC06  MPID_nem_network_          16  mpid_nem_network_poll.c
> > GEOSgcm.x          00000000084C4096  MPID_nem_mpich2_t         799  mpid_nem_inline.h
> > GEOSgcm.x          00000000084BE817  MPIDI_CH3I_Progre         148  ch3_progress.c
> > GEOSgcm.x          000000000845F120  MPIC_Wait                 512  helper_fns.c
> > GEOSgcm.x          000000000845D143  MPIC_Sendrecv             163  helper_fns.c
> > GEOSgcm.x          0000000008455CA6  MPIR_Allgatherv           793  allgatherv.c
> > GEOSgcm.x          0000000008456AE6  PMPI_Allgatherv          1082  allgatherv.c
> > GEOSgcm.x          000000000849B793  pmpi_allgatherv_          195  allgathervf.c
> >
> > 8x32
> > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > MPIR_Allgatherv(799):
> > (unknown)(): Other MPI error
> > MPI process (rank: 67) terminated unexpectedly on borgj004
> > Exit code -5 signaled from borgj004
> > forrtl: error (69): process interrupted (SIGINT)
> > Image              PC                Routine            Line        Source
> > libc.so.6          00002AFF28564B17  Unknown               Unknown  Unknown
> > GEOSgcm.x          00000000084CA06C  MPIDI_CH3I_Progre         100  ch3_progress.c
> > GEOSgcm.x          000000000846ABE0  MPIC_Wait                 512  helper_fns.c
> > GEOSgcm.x          0000000008468C03  MPIC_Sendrecv             163  helper_fns.c
> > GEOSgcm.x          0000000008461766  MPIR_Allgatherv           793  allgatherv.c
> > GEOSgcm.x          00000000084625A6  PMPI_Allgatherv          1082  allgatherv.c
> > GEOSgcm.x          00000000084A7253  pmpi_allgatherv_          195  allgathervf.c
> 
> Thanks,
> 
> DK
> 
> 
> > On Thu, 2010-07-29 at 16:49 -0500, Dhabaleswar Panda wrote:
> > > Hi Dan,
> > >
> > > Thanks for letting us know the details of the performance issues you are
> > > seeing. Good to know that MVAPICH2 1.5 is showing better performance
> > > compared to MVAPICH2 1.4.1 as the system size scales. This is because of
> > > some of the pt-to-pt tunings we have done in 1.5.
> > >
> > > Here are some suggestions you can try to see if the performance for these
> > > collectives and the GCM application can be enhanced.
> > >
> > > 1. There are two runtime parameters in MVAPICH2 which controls which MPI
> > > messages go through eager and which go through rendezvous protocol.
> > > Messages going through rendezvous protocol have higher overhead.  For
> > > different platforms and adapter types, some default values are defined for
> > > these two parameters. However, it is very hard to know whether these
> > > values match with the application characteristics.
> > >
> > > a. MV2_IBA_EAGER_THRESHOLD
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21
> > >
> > > b. MV2_VBUF_TOTAL_SIZE
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86
> > >
> > > Currently, both these parameters are set to 12K for ConnectX adapter. I do
> > > not know the exact adapter being used on the Discover system.
> > >
> > > If the average message sizes in the GCM-collectives are higher than 12K,
> > > it might be helpful to run your application by setting both these
> > > parameters to a higher value (say 16K, 20K, ..). You can do this at
> > > run-time itself. Change both these parameters simultaneously.
> > >
> > > Let us know if this helps.
> > >
> > > 2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
> > > This is based on Argonne's new Nemesis design. The following section in
> > > MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
> > > interface.
> > >
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5
> > >
> > > The Nemesis-IB interface has a little different design for intra-node
> > > communication. It also has a different set of algorithms for collectives.
> > > Currently, this interface is not as `feature-rich' as the `gen2' interface
> > > you are using. However, it is based on a new design. It will be helpful if
> > > you can have a build with this interface and see if this delivers better
> > > performance for the collectives and the overall application.
> > >
> > > In the mean time, we will also take a look at the performance of the
> > > collectives you have mentioned and get back to you by next week.
> > >
> > > Thanks,
> > >
> > > DK
> > >
> > > On Thu, 29 Jul 2010, Dan Kokron wrote:
> > >
> > > > Max Suarez asked me to respond to your questions and provide any support
> > > > necessary to enable us to effectively use MVAPICH2 with our
> > > > applications.
> > > >
> > > > We first noticed issues with performance when scaling the GEOS5 GCM to
> > > > 720 processes.  We had been using Intel MPI (3.2.x) before switching to
> > > > MVAPICH2 (1.4.1).  Walltimes (hh:mm:ss) for a test case are as follows
> > > > for 256p, 512p and 720p using indicated MPI.  All codes were compiled
> > > > with the Intel-11.0.083 suite of compilers.  I have attached a text file
> > > > with hardware and software stack information for the platform used in
> > > > these tests (discover.HW_SWstack).
> > > >
> > > > GCM application run wall time
> > > > 	mv2-1.4.1 	iMPI-3.2.2.006 	mv2-1.5-2010-07-22
> > > > 256 	00:23:45 	00:15:53 	00:22:57
> > > > 512 	00:26:45 	00:11:06 	00:13:58
> > > > 720 	00:43:12 	00:11:28 	00:16:15
> > > >
> > > > The test with the mv2-1.5 nightly snapshot was run at your suggestion.
> > > >
> > > > Next I instrumented the application with TAU
> > > > (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> > > > level timings.
> > > >
> > > > Results from 256p, 512p and 720p runs show that the performance
> > > > difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> > > > collective operations.  Specifically, Scatterv, Gatherv and
> > > > MPI_Allgatherv.
> > > >
> > > > Any suggestions for further tuning of mv2-1.5 for our particular needs
> > > > would be appreciated.
> > > >
> > > > Dan
> > > >
> > > > On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > > > > Hi Max,
> > > > >
> > > > > Thanks for your note.
> > > > >
> > > > > >   We are having serious performance problems
> > > > > > with collectives when using several hundred cores
> > > > > > on the Discover system at NASA Goddard.
> > > > >
> > > > > Could you please let us know some more details on the performance problems
> > > > > you are observing - which collectives, what data sizes, what system sizes,
> > > > > etc.?
> > > > >
> > > > > > I noticed some fixes were made to collectives in 1.5.
> > > > > > Would these help with scat/gath?
> > > > >
> > > > > In 1.5, in addition to some fixes in collectives, several thresholds were
> > > > > changed for point-to-point operations (based on platform and adapter
> > > > > characteristics) to obtain better performance. These changes will also
> > > > > have positive impact on the performance of collectives.
> > > > >
> > > > > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > > > > issues for collectives still remain, we will be happy to debug this issue
> > > > > further.
> > > > >
> > > > > > I noticed a couple of months ago someone reporting
> > > > > > very poor performance in global sums:
> > > > > >
> > > > > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > > > > >
> > > > > > But the thread ends unresolved.
> > > > >
> > > > > Since the 1.5 release procedure was getting overlapped with the
> > > > > examination of this issue, we got context-switched. We will take a closer
> > > > > look at this issue with 1.5 version.
> > > > >
> > > > > > Has anyone else had these problems?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > DK
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > --
> > > > Dan Kokron
> > > > Global Modeling and Assimilation Office
> > > > NASA Goddard Space Flight Center
> > > > Greenbelt, MD 20771
> > > > Daniel.S.Kokron at nasa.gov
> > > > Phone: (301) 614-5192
> > > > Fax:   (301) 614-5304
> > > >
> > >
> > --
> > Dan Kokron
> > Global Modeling and Assimilation Office
> > NASA Goddard Space Flight Center
> > Greenbelt, MD 20771
> > Daniel.S.Kokron at nasa.gov
> > Phone: (301) 614-5192
> > Fax:   (301) 614-5304
> >
> 
-- 
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax:   (301) 614-5304