[mvapich-discuss] performance problems with gath/scat

Mon Aug 2 11:39:12 EDT 2010

The message sizes used for the three process counts are as follows

	Scatterv	Gatherv
256	13-15K		13-14K
512	7K		7K
720	5K		5K

setting MV2_IBA_EAGER_THRESHOLD=16384 MV2_VBUF_TOTAL_SIZE=16384 did
improve the performance of the 256 case nicely.

00:22:57 -> 00:17:17

FYI, the default eager/rendezvous crossover size in Intel MPI is 256KB

Nemesis results:

I configured mva2-1.5 with the following and ran a synthetic MPI
benchmark.  I found a dramatic speedup in gatherv.

./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -O0
-traceback -debug" CXXFLAGS="-fpic -O0 -traceback -debug" FFLAGS="-fpic
-O0 -traceback -debug -nolib-inline -check all -fp-stack-check -ftrapuv"
F90FLAGS="-fpic -O0 -traceback -debug -nolib-inline -check all
-fp-stack-check -ftrapuv"
--prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083_nemesis_debug
--enable-error-checking=all --enable-error-messages=all --enable-g=all
--enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
--enable-threads=default --with-device=ch3:nemesis:ib --with-hwloc

Unfortunately, the GEOSgcm application would not run under the nemesis
build at 256 processes.  Interestingly, it would run on 512 processes.

Here are some stack traces of the failure on 256.  Any ideas?

16x16
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPIR_Allgatherv(799): 
(unknown)(): Other MPI error
MPI process (rank: 0) terminated unexpectedly on borgl065
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source             
GEOSgcm.x          0000000008539C15  MPIDI_nem_ib_get_         951  ib_poll.c
GEOSgcm.x          0000000008537969  MPIDI_nem_ib_read         316  ib_poll.c
GEOSgcm.x          0000000008537B96  MPID_nem_ib_poll          459  ib_poll.c
GEOSgcm.x          000000000852DC06  MPID_nem_network_          16  mpid_nem_network_poll.c
GEOSgcm.x          00000000084C4096  MPID_nem_mpich2_t         799  mpid_nem_inline.h
GEOSgcm.x          00000000084BE817  MPIDI_CH3I_Progre         148  ch3_progress.c
GEOSgcm.x          000000000845F120  MPIC_Wait                 512  helper_fns.c
GEOSgcm.x          000000000845D143  MPIC_Sendrecv             163  helper_fns.c
GEOSgcm.x          0000000008455CA6  MPIR_Allgatherv           793  allgatherv.c
GEOSgcm.x          0000000008456AE6  PMPI_Allgatherv          1082  allgatherv.c
GEOSgcm.x          000000000849B793  pmpi_allgatherv_          195  allgathervf.c

8x32
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPIR_Allgatherv(799):
(unknown)(): Other MPI error
MPI process (rank: 67) terminated unexpectedly on borgj004
Exit code -5 signaled from borgj004
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
libc.so.6          00002AFF28564B17  Unknown               Unknown  Unknown
GEOSgcm.x          00000000084CA06C  MPIDI_CH3I_Progre         100  ch3_progress.c
GEOSgcm.x          000000000846ABE0  MPIC_Wait                 512  helper_fns.c
GEOSgcm.x          0000000008468C03  MPIC_Sendrecv             163  helper_fns.c
GEOSgcm.x          0000000008461766  MPIR_Allgatherv           793  allgatherv.c
GEOSgcm.x          00000000084625A6  PMPI_Allgatherv          1082  allgatherv.c
GEOSgcm.x          00000000084A7253  pmpi_allgatherv_          195  allgathervf.c

On Thu, 2010-07-29 at 16:49 -0500, Dhabaleswar Panda wrote:
> Hi Dan,
> 
> Thanks for letting us know the details of the performance issues you are
> seeing. Good to know that MVAPICH2 1.5 is showing better performance
> compared to MVAPICH2 1.4.1 as the system size scales. This is because of
> some of the pt-to-pt tunings we have done in 1.5.
> 
> Here are some suggestions you can try to see if the performance for these
> collectives and the GCM application can be enhanced.
> 
> 1. There are two runtime parameters in MVAPICH2 which controls which MPI
> messages go through eager and which go through rendezvous protocol.
> Messages going through rendezvous protocol have higher overhead.  For
> different platforms and adapter types, some default values are defined for
> these two parameters. However, it is very hard to know whether these
> values match with the application characteristics.
> 
> a. MV2_IBA_EAGER_THRESHOLD
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21
> 
> b. MV2_VBUF_TOTAL_SIZE
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86
> 
> Currently, both these parameters are set to 12K for ConnectX adapter. I do
> not know the exact adapter being used on the Discover system.
> 
> If the average message sizes in the GCM-collectives are higher than 12K,
> it might be helpful to run your application by setting both these
> parameters to a higher value (say 16K, 20K, ..). You can do this at
> run-time itself. Change both these parameters simultaneously.
> 
> Let us know if this helps.
> 
> 2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
> This is based on Argonne's new Nemesis design. The following section in
> MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
> interface.
> 
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5
> 
> The Nemesis-IB interface has a little different design for intra-node
> communication. It also has a different set of algorithms for collectives.
> Currently, this interface is not as `feature-rich' as the `gen2' interface
> you are using. However, it is based on a new design. It will be helpful if
> you can have a build with this interface and see if this delivers better
> performance for the collectives and the overall application.
> 
> In the mean time, we will also take a look at the performance of the
> collectives you have mentioned and get back to you by next week.
> 
> Thanks,
> 
> DK
> 
> On Thu, 29 Jul 2010, Dan Kokron wrote:
> 
> > Max Suarez asked me to respond to your questions and provide any support
> > necessary to enable us to effectively use MVAPICH2 with our
> > applications.
> >
> > We first noticed issues with performance when scaling the GEOS5 GCM to
> > 720 processes.  We had been using Intel MPI (3.2.x) before switching to
> > MVAPICH2 (1.4.1).  Walltimes (hh:mm:ss) for a test case are as follows
> > for 256p, 512p and 720p using indicated MPI.  All codes were compiled
> > with the Intel-11.0.083 suite of compilers.  I have attached a text file
> > with hardware and software stack information for the platform used in
> > these tests (discover.HW_SWstack).
> >
> > GCM application run wall time
> > 	mv2-1.4.1 	iMPI-3.2.2.006 	mv2-1.5-2010-07-22
> > 256 	00:23:45 	00:15:53 	00:22:57
> > 512 	00:26:45 	00:11:06 	00:13:58
> > 720 	00:43:12 	00:11:28 	00:16:15
> >
> > The test with the mv2-1.5 nightly snapshot was run at your suggestion.
> >
> > Next I instrumented the application with TAU
> > (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> > level timings.
> >
> > Results from 256p, 512p and 720p runs show that the performance
> > difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> > collective operations.  Specifically, Scatterv, Gatherv and
> > MPI_Allgatherv.
> >
> > Any suggestions for further tuning of mv2-1.5 for our particular needs
> > would be appreciated.
> >
> > Dan
> >
> > On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > > Hi Max,
> > >
> > > Thanks for your note.
> > >
> > > >   We are having serious performance problems
> > > > with collectives when using several hundred cores
> > > > on the Discover system at NASA Goddard.
> > >
> > > Could you please let us know some more details on the performance problems
> > > you are observing - which collectives, what data sizes, what system sizes,
> > > etc.?
> > >
> > > > I noticed some fixes were made to collectives in 1.5.
> > > > Would these help with scat/gath?
> > >
> > > In 1.5, in addition to some fixes in collectives, several thresholds were
> > > changed for point-to-point operations (based on platform and adapter
> > > characteristics) to obtain better performance. These changes will also
> > > have positive impact on the performance of collectives.
> > >
> > > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > > issues for collectives still remain, we will be happy to debug this issue
> > > further.
> > >
> > > > I noticed a couple of months ago someone reporting
> > > > very poor performance in global sums:
> > > >
> > > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > > >
> > > > But the thread ends unresolved.
> > >
> > > Since the 1.5 release procedure was getting overlapped with the
> > > examination of this issue, we got context-switched. We will take a closer
> > > look at this issue with 1.5 version.
> > >
> > > > Has anyone else had these problems?
> > >
> > > Thanks,
> > >
> > > DK
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > --
> > Dan Kokron
> > Global Modeling and Assimilation Office
> > NASA Goddard Space Flight Center
> > Greenbelt, MD 20771
> > Daniel.S.Kokron at nasa.gov
> > Phone: (301) 614-5192
> > Fax:   (301) 614-5304
> >
> 
-- 
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax:   (301) 614-5304