[mvapich-discuss] performance problems with gath/scat

Thu Aug 5 12:27:15 EDT 2010

Your 'corner-case' hypothesis regarding Allgatherv is probably correct
as I am not able to reproduce the failure using IMB either.  The
application is quite large and has many external dependencies (ESMF,
NetCDF and others).  I'd like to pass a tarball with just the particular
call and it's arguments.  I will look into this.  Is there a checkpoint
feature available for the Nemesis channel?

One other note regarding Nemesis.  I had another failure, this time at
512 processes.  The failure message was 
[ib_vbuf.c 256] Cannot register vbuf region

I look forward to testing the new Scatterv/gatherv code.

Dan

On Thu, 2010-08-05 at 10:52 -0500, Dhabaleswar Panda wrote:
> Hi Dan,
> 
> > > The message sizes used for the three process counts are as follows
> > >
> > > 	Scatterv	Gatherv
> > > 256	13-15K		13-14K
> > > 512	7K		7K
> > > 720	5K		5K
> >
> > Thanks for this information.
> >
> > > setting MV2_IBA_EAGER_THRESHOLD=16384 MV2_VBUF_TOTAL_SIZE=16384 did
> > > improve the performance of the 256 case nicely.
> > >
> > > 00:22:57 -> 00:17:17
> >
> > Good to know that you are able to get good performance here by changing
> > the above two parameters at run-time.
> 
> We have analyzed the Scatterv and Gatherv algorithms being used in 1.5 and
> think that they can be improved further to deliver better performance. We
> are working on these improvements. We might be able to send you a patch
> (or an updated tarball) by next week.
> 
> > > FYI, the default eager/rendezvous crossover size in Intel MPI is 256KB
> >
> > This seems to be too high. We are analyzing this.
> >
> > > Nemesis results:
> > >
> > > I configured mva2-1.5 with the following and ran a synthetic MPI
> > > benchmark.  I found a dramatic speedup in gatherv.
> >
> > This is very good to know.
> >
> > > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -O0
> > > -traceback -debug" CXXFLAGS="-fpic -O0 -traceback -debug" FFLAGS="-fpic
> > > -O0 -traceback -debug -nolib-inline -check all -fp-stack-check -ftrapuv"
> > > F90FLAGS="-fpic -O0 -traceback -debug -nolib-inline -check all
> > > -fp-stack-check -ftrapuv"
> > > --prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083_nemesis_debug
> > > --enable-error-checking=all --enable-error-messages=all --enable-g=all
> > > --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
> > > --enable-threads=default --with-device=ch3:nemesis:ib --with-hwloc
> > >
> > > Unfortunately, the GEOSgcm application would not run under the nemesis
> > > build at 256 processes.  Interestingly, it would run on 512 processes.
> > >
> > > Here are some stack traces of the failure on 256.  Any ideas?
> >
> > Thanks for the update here and sending us the stack traces. We are taking
> > a look at these traces and trying to reproduce this error. We will get
> > back to you on this soon.
> 
> We tried Allgatherv tests from IMB suite on 256 cores (32 nodes with 8
> cores/node) and the tests are passing.
> 
> The failure could be because of some corner cases being reached for the
> varying message sizes being used in the MPI_Allgatherv call.
> 
> I do not know whether this GEOSgcm application is public. If this is
> public, we will be happy to run it on our cluster and try to debug the
> problem. Let us know how to get a copy of it.
> 
> Alternatively, will it be possible for you to send us a code snippet which
> uses the MPI_Allgatherv in this application with the appropriate messages.
> This will help us to run this code snippet and debug this problem faster.
> 
> > > 16x16
> > > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > > MPIR_Allgatherv(799):
> > > (unknown)(): Other MPI error
> > > MPI process (rank: 0) terminated unexpectedly on borgl065
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image              PC                Routine            Line        Source
> > > GEOSgcm.x          0000000008539C15  MPIDI_nem_ib_get_         951  ib_poll.c
> > > GEOSgcm.x          0000000008537969  MPIDI_nem_ib_read         316  ib_poll.c
> > > GEOSgcm.x          0000000008537B96  MPID_nem_ib_poll          459  ib_poll.c
> > > GEOSgcm.x          000000000852DC06  MPID_nem_network_          16  mpid_nem_network_poll.c
> > > GEOSgcm.x          00000000084C4096  MPID_nem_mpich2_t         799  mpid_nem_inline.h
> > > GEOSgcm.x          00000000084BE817  MPIDI_CH3I_Progre         148  ch3_progress.c
> > > GEOSgcm.x          000000000845F120  MPIC_Wait                 512  helper_fns.c
> > > GEOSgcm.x          000000000845D143  MPIC_Sendrecv             163  helper_fns.c
> > > GEOSgcm.x          0000000008455CA6  MPIR_Allgatherv           793  allgatherv.c
> > > GEOSgcm.x          0000000008456AE6  PMPI_Allgatherv          1082  allgatherv.c
> > > GEOSgcm.x          000000000849B793  pmpi_allgatherv_          195  allgathervf.c
> > >
> > > 8x32
> > > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > > MPIR_Allgatherv(799):
> > > (unknown)(): Other MPI error
> > > MPI process (rank: 67) terminated unexpectedly on borgj004
> > > Exit code -5 signaled from borgj004
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image              PC                Routine            Line        Source
> > > libc.so.6          00002AFF28564B17  Unknown               Unknown  Unknown
> > > GEOSgcm.x          00000000084CA06C  MPIDI_CH3I_Progre         100  ch3_progress.c
> > > GEOSgcm.x          000000000846ABE0  MPIC_Wait                 512  helper_fns.c
> > > GEOSgcm.x          0000000008468C03  MPIC_Sendrecv             163  helper_fns.c
> > > GEOSgcm.x          0000000008461766  MPIR_Allgatherv           793  allgatherv.c
> > > GEOSgcm.x          00000000084625A6  PMPI_Allgatherv          1082  allgatherv.c
> > > GEOSgcm.x          00000000084A7253  pmpi_allgatherv_          195  allgathervf.c
> 
> Thanks,
> 
> DK
> 
> >
> > > On Thu, 2010-07-29 at 16:49 -0500, Dhabaleswar Panda wrote:
> > > > Hi Dan,
> > > >
> > > > Thanks for letting us know the details of the performance issues you are
> > > > seeing. Good to know that MVAPICH2 1.5 is showing better performance
> > > > compared to MVAPICH2 1.4.1 as the system size scales. This is because of
> > > > some of the pt-to-pt tunings we have done in 1.5.
> > > >
> > > > Here are some suggestions you can try to see if the performance for these
> > > > collectives and the GCM application can be enhanced.
> > > >
> > > > 1. There are two runtime parameters in MVAPICH2 which controls which MPI
> > > > messages go through eager and which go through rendezvous protocol.
> > > > Messages going through rendezvous protocol have higher overhead.  For
> > > > different platforms and adapter types, some default values are defined for
> > > > these two parameters. However, it is very hard to know whether these
> > > > values match with the application characteristics.
> > > >
> > > > a. MV2_IBA_EAGER_THRESHOLD
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21
> > > >
> > > > b. MV2_VBUF_TOTAL_SIZE
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86
> > > >
> > > > Currently, both these parameters are set to 12K for ConnectX adapter. I do
> > > > not know the exact adapter being used on the Discover system.
> > > >
> > > > If the average message sizes in the GCM-collectives are higher than 12K,
> > > > it might be helpful to run your application by setting both these
> > > > parameters to a higher value (say 16K, 20K, ..). You can do this at
> > > > run-time itself. Change both these parameters simultaneously.
> > > >
> > > > Let us know if this helps.
> > > >
> > > > 2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
> > > > This is based on Argonne's new Nemesis design. The following section in
> > > > MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
> > > > interface.
> > > >
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5
> > > >
> > > > The Nemesis-IB interface has a little different design for intra-node
> > > > communication. It also has a different set of algorithms for collectives.
> > > > Currently, this interface is not as `feature-rich' as the `gen2' interface
> > > > you are using. However, it is based on a new design. It will be helpful if
> > > > you can have a build with this interface and see if this delivers better
> > > > performance for the collectives and the overall application.
> > > >
> > > > In the mean time, we will also take a look at the performance of the
> > > > collectives you have mentioned and get back to you by next week.
> > > >
> > > > Thanks,
> > > >
> > > > DK
> > > >
> > > > On Thu, 29 Jul 2010, Dan Kokron wrote:
> > > >
> > > > > Max Suarez asked me to respond to your questions and provide any support
> > > > > necessary to enable us to effectively use MVAPICH2 with our
> > > > > applications.
> > > > >
> > > > > We first noticed issues with performance when scaling the GEOS5 GCM to
> > > > > 720 processes.  We had been using Intel MPI (3.2.x) before switching to
> > > > > MVAPICH2 (1.4.1).  Walltimes (hh:mm:ss) for a test case are as follows
> > > > > for 256p, 512p and 720p using indicated MPI.  All codes were compiled
> > > > > with the Intel-11.0.083 suite of compilers.  I have attached a text file
> > > > > with hardware and software stack information for the platform used in
> > > > > these tests (discover.HW_SWstack).
> > > > >
> > > > > GCM application run wall time
> > > > > 	mv2-1.4.1 	iMPI-3.2.2.006 	mv2-1.5-2010-07-22
> > > > > 256 	00:23:45 	00:15:53 	00:22:57
> > > > > 512 	00:26:45 	00:11:06 	00:13:58
> > > > > 720 	00:43:12 	00:11:28 	00:16:15
> > > > >
> > > > > The test with the mv2-1.5 nightly snapshot was run at your suggestion.
> > > > >
> > > > > Next I instrumented the application with TAU
> > > > > (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> > > > > level timings.
> > > > >
> > > > > Results from 256p, 512p and 720p runs show that the performance
> > > > > difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> > > > > collective operations.  Specifically, Scatterv, Gatherv and
> > > > > MPI_Allgatherv.
> > > > >
> > > > > Any suggestions for further tuning of mv2-1.5 for our particular needs
> > > > > would be appreciated.
> > > > >
> > > > > Dan
> > > > >
> > > > > On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > > > > > Hi Max,
> > > > > >
> > > > > > Thanks for your note.
> > > > > >
> > > > > > >   We are having serious performance problems
> > > > > > > with collectives when using several hundred cores
> > > > > > > on the Discover system at NASA Goddard.
> > > > > >
> > > > > > Could you please let us know some more details on the performance problems
> > > > > > you are observing - which collectives, what data sizes, what system sizes,
> > > > > > etc.?
> > > > > >
> > > > > > > I noticed some fixes were made to collectives in 1.5.
> > > > > > > Would these help with scat/gath?
> > > > > >
> > > > > > In 1.5, in addition to some fixes in collectives, several thresholds were
> > > > > > changed for point-to-point operations (based on platform and adapter
> > > > > > characteristics) to obtain better performance. These changes will also
> > > > > > have positive impact on the performance of collectives.
> > > > > >
> > > > > > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > > > > > issues for collectives still remain, we will be happy to debug this issue
> > > > > > further.
> > > > > >
> > > > > > > I noticed a couple of months ago someone reporting
> > > > > > > very poor performance in global sums:
> > > > > > >
> > > > > > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > > > > > >
> > > > > > > But the thread ends unresolved.
> > > > > >
> > > > > > Since the 1.5 release procedure was getting overlapped with the
> > > > > > examination of this issue, we got context-switched. We will take a closer
> > > > > > look at this issue with 1.5 version.
> > > > > >
> > > > > > > Has anyone else had these problems?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > DK
> > > > > >
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > --
> > > > > Dan Kokron
> > > > > Global Modeling and Assimilation Office
> > > > > NASA Goddard Space Flight Center
> > > > > Greenbelt, MD 20771
> > > > > Daniel.S.Kokron at nasa.gov
> > > > > Phone: (301) 614-5192
> > > > > Fax:   (301) 614-5304
> > > > >
> > > >
> > > --
> > > Dan Kokron
> > > Global Modeling and Assimilation Office
> > > NASA Goddard Space Flight Center
> > > Greenbelt, MD 20771
> > > Daniel.S.Kokron at nasa.gov
> > > Phone: (301) 614-5192
> > > Fax:   (301) 614-5304
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> 
-- 
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax:   (301) 614-5304