[mvapich-discuss] performance problems with gath/scat

Thu Aug 5 13:26:24 EDT 2010

> Your 'corner-case' hypothesis regarding Allgatherv is probably correct
> as I am not able to reproduce the failure using IMB either.  The
> application is quite large and has many external dependencies (ESMF,
> NetCDF and others).  I'd like to pass a tarball with just the particular
> call and it's arguments.  I will look into this.  Is there a checkpoint
> feature available for the Nemesis channel?

Thanks. We look forward to your tarball. This will help us to debug the
problem.

At this point of time, the checkpoint feature is not available for the
Nemesis channel. It is on our plan.

> One other note regarding Nemesis.  I had another failure, this time at
> 512 processes.  The failure message was
> [ib_vbuf.c 256] Cannot register vbuf region

This error indicates that memory is being exhaused on the system and the
InfiniBand library is not able to register any new vbuf (internal data
structure in MVAPICH2) to send messages. Is this happening with
MPI_AllgatherV or with the overall application. Do you see this error if
you reduce your application size. This may provide us more clues to what
could be happening. Once again, if we can reproduce this error with a
conde snippet or a tarball, we will be able to debug it faster.

> I look forward to testing the new Scatterv/gatherv code.

Thanks.

DK

> Dan
>
> On Thu, 2010-08-05 at 10:52 -0500, Dhabaleswar Panda wrote:
> > Hi Dan,
> >
> > > > The message sizes used for the three process counts are as follows
> > > >
> > > > 	Scatterv	Gatherv
> > > > 256	13-15K		13-14K
> > > > 512	7K		7K
> > > > 720	5K		5K
> > >
> > > Thanks for this information.
> > >
> > > > setting MV2_IBA_EAGER_THRESHOLD=16384 MV2_VBUF_TOTAL_SIZE=16384 did
> > > > improve the performance of the 256 case nicely.
> > > >
> > > > 00:22:57 -> 00:17:17
> > >
> > > Good to know that you are able to get good performance here by changing
> > > the above two parameters at run-time.
> >
> > We have analyzed the Scatterv and Gatherv algorithms being used in 1.5 and
> > think that they can be improved further to deliver better performance. We
> > are working on these improvements. We might be able to send you a patch
> > (or an updated tarball) by next week.
> >
> > > > FYI, the default eager/rendezvous crossover size in Intel MPI is 256KB
> > >
> > > This seems to be too high. We are analyzing this.
> > >
> > > > Nemesis results:
> > > >
> > > > I configured mva2-1.5 with the following and ran a synthetic MPI
> > > > benchmark.  I found a dramatic speedup in gatherv.
> > >
> > > This is very good to know.
> > >
> > > > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -O0
> > > > -traceback -debug" CXXFLAGS="-fpic -O0 -traceback -debug" FFLAGS="-fpic
> > > > -O0 -traceback -debug -nolib-inline -check all -fp-stack-check -ftrapuv"
> > > > F90FLAGS="-fpic -O0 -traceback -debug -nolib-inline -check all
> > > > -fp-stack-check -ftrapuv"
> > > > --prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083_nemesis_debug
> > > > --enable-error-checking=all --enable-error-messages=all --enable-g=all
> > > > --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
> > > > --enable-threads=default --with-device=ch3:nemesis:ib --with-hwloc
> > > >
> > > > Unfortunately, the GEOSgcm application would not run under the nemesis
> > > > build at 256 processes.  Interestingly, it would run on 512 processes.
> > > >
> > > > Here are some stack traces of the failure on 256.  Any ideas?
> > >
> > > Thanks for the update here and sending us the stack traces. We are taking
> > > a look at these traces and trying to reproduce this error. We will get
> > > back to you on this soon.
> >
> > We tried Allgatherv tests from IMB suite on 256 cores (32 nodes with 8
> > cores/node) and the tests are passing.
> >
> > The failure could be because of some corner cases being reached for the
> > varying message sizes being used in the MPI_Allgatherv call.
> >
> > I do not know whether this GEOSgcm application is public. If this is
> > public, we will be happy to run it on our cluster and try to debug the
> > problem. Let us know how to get a copy of it.
> >
> > Alternatively, will it be possible for you to send us a code snippet which
> > uses the MPI_Allgatherv in this application with the appropriate messages.
> > This will help us to run this code snippet and debug this problem faster.
> >
> > > > 16x16
> > > > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > > > MPIR_Allgatherv(799):
> > > > (unknown)(): Other MPI error
> > > > MPI process (rank: 0) terminated unexpectedly on borgl065
> > > > forrtl: error (69): process interrupted (SIGINT)
> > > > Image              PC                Routine            Line        Source
> > > > GEOSgcm.x          0000000008539C15  MPIDI_nem_ib_get_         951  ib_poll.c
> > > > GEOSgcm.x          0000000008537969  MPIDI_nem_ib_read         316  ib_poll.c
> > > > GEOSgcm.x          0000000008537B96  MPID_nem_ib_poll          459  ib_poll.c
> > > > GEOSgcm.x          000000000852DC06  MPID_nem_network_          16  mpid_nem_network_poll.c
> > > > GEOSgcm.x          00000000084C4096  MPID_nem_mpich2_t         799  mpid_nem_inline.h
> > > > GEOSgcm.x          00000000084BE817  MPIDI_CH3I_Progre         148  ch3_progress.c
> > > > GEOSgcm.x          000000000845F120  MPIC_Wait                 512  helper_fns.c
> > > > GEOSgcm.x          000000000845D143  MPIC_Sendrecv             163  helper_fns.c
> > > > GEOSgcm.x          0000000008455CA6  MPIR_Allgatherv           793  allgatherv.c
> > > > GEOSgcm.x          0000000008456AE6  PMPI_Allgatherv          1082  allgatherv.c
> > > > GEOSgcm.x          000000000849B793  pmpi_allgatherv_          195  allgathervf.c
> > > >
> > > > 8x32
> > > > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > > > MPIR_Allgatherv(799):
> > > > (unknown)(): Other MPI error
> > > > MPI process (rank: 67) terminated unexpectedly on borgj004
> > > > Exit code -5 signaled from borgj004
> > > > forrtl: error (69): process interrupted (SIGINT)
> > > > Image              PC                Routine            Line        Source
> > > > libc.so.6          00002AFF28564B17  Unknown               Unknown  Unknown
> > > > GEOSgcm.x          00000000084CA06C  MPIDI_CH3I_Progre         100  ch3_progress.c
> > > > GEOSgcm.x          000000000846ABE0  MPIC_Wait                 512  helper_fns.c
> > > > GEOSgcm.x          0000000008468C03  MPIC_Sendrecv             163  helper_fns.c
> > > > GEOSgcm.x          0000000008461766  MPIR_Allgatherv           793  allgatherv.c
> > > > GEOSgcm.x          00000000084625A6  PMPI_Allgatherv          1082  allgatherv.c
> > > > GEOSgcm.x          00000000084A7253  pmpi_allgatherv_          195  allgathervf.c
> >
> > Thanks,
> >
> > DK
> >
> > >
> > > > On Thu, 2010-07-29 at 16:49 -0500, Dhabaleswar Panda wrote:
> > > > > Hi Dan,
> > > > >
> > > > > Thanks for letting us know the details of the performance issues you are
> > > > > seeing. Good to know that MVAPICH2 1.5 is showing better performance
> > > > > compared to MVAPICH2 1.4.1 as the system size scales. This is because of
> > > > > some of the pt-to-pt tunings we have done in 1.5.
> > > > >
> > > > > Here are some suggestions you can try to see if the performance for these
> > > > > collectives and the GCM application can be enhanced.
> > > > >
> > > > > 1. There are two runtime parameters in MVAPICH2 which controls which MPI
> > > > > messages go through eager and which go through rendezvous protocol.
> > > > > Messages going through rendezvous protocol have higher overhead.  For
> > > > > different platforms and adapter types, some default values are defined for
> > > > > these two parameters. However, it is very hard to know whether these
> > > > > values match with the application characteristics.
> > > > >
> > > > > a. MV2_IBA_EAGER_THRESHOLD
> > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21
> > > > >
> > > > > b. MV2_VBUF_TOTAL_SIZE
> > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86
> > > > >
> > > > > Currently, both these parameters are set to 12K for ConnectX adapter. I do
> > > > > not know the exact adapter being used on the Discover system.
> > > > >
> > > > > If the average message sizes in the GCM-collectives are higher than 12K,
> > > > > it might be helpful to run your application by setting both these
> > > > > parameters to a higher value (say 16K, 20K, ..). You can do this at
> > > > > run-time itself. Change both these parameters simultaneously.
> > > > >
> > > > > Let us know if this helps.
> > > > >
> > > > > 2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
> > > > > This is based on Argonne's new Nemesis design. The following section in
> > > > > MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
> > > > > interface.
> > > > >
> > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5
> > > > >
> > > > > The Nemesis-IB interface has a little different design for intra-node
> > > > > communication. It also has a different set of algorithms for collectives.
> > > > > Currently, this interface is not as `feature-rich' as the `gen2' interface
> > > > > you are using. However, it is based on a new design. It will be helpful if
> > > > > you can have a build with this interface and see if this delivers better
> > > > > performance for the collectives and the overall application.
> > > > >
> > > > > In the mean time, we will also take a look at the performance of the
> > > > > collectives you have mentioned and get back to you by next week.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > DK
> > > > >
> > > > > On Thu, 29 Jul 2010, Dan Kokron wrote:
> > > > >
> > > > > > Max Suarez asked me to respond to your questions and provide any support
> > > > > > necessary to enable us to effectively use MVAPICH2 with our
> > > > > > applications.
> > > > > >
> > > > > > We first noticed issues with performance when scaling the GEOS5 GCM to
> > > > > > 720 processes.  We had been using Intel MPI (3.2.x) before switching to
> > > > > > MVAPICH2 (1.4.1).  Walltimes (hh:mm:ss) for a test case are as follows
> > > > > > for 256p, 512p and 720p using indicated MPI.  All codes were compiled
> > > > > > with the Intel-11.0.083 suite of compilers.  I have attached a text file
> > > > > > with hardware and software stack information for the platform used in
> > > > > > these tests (discover.HW_SWstack).
> > > > > >
> > > > > > GCM application run wall time
> > > > > > 	mv2-1.4.1 	iMPI-3.2.2.006 	mv2-1.5-2010-07-22
> > > > > > 256 	00:23:45 	00:15:53 	00:22:57
> > > > > > 512 	00:26:45 	00:11:06 	00:13:58
> > > > > > 720 	00:43:12 	00:11:28 	00:16:15
> > > > > >
> > > > > > The test with the mv2-1.5 nightly snapshot was run at your suggestion.
> > > > > >
> > > > > > Next I instrumented the application with TAU
> > > > > > (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> > > > > > level timings.
> > > > > >
> > > > > > Results from 256p, 512p and 720p runs show that the performance
> > > > > > difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> > > > > > collective operations.  Specifically, Scatterv, Gatherv and
> > > > > > MPI_Allgatherv.
> > > > > >
> > > > > > Any suggestions for further tuning of mv2-1.5 for our particular needs
> > > > > > would be appreciated.
> > > > > >
> > > > > > Dan
> > > > > >
> > > > > > On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > > > > > > Hi Max,
> > > > > > >
> > > > > > > Thanks for your note.
> > > > > > >
> > > > > > > >   We are having serious performance problems
> > > > > > > > with collectives when using several hundred cores
> > > > > > > > on the Discover system at NASA Goddard.
> > > > > > >
> > > > > > > Could you please let us know some more details on the performance problems
> > > > > > > you are observing - which collectives, what data sizes, what system sizes,
> > > > > > > etc.?
> > > > > > >
> > > > > > > > I noticed some fixes were made to collectives in 1.5.
> > > > > > > > Would these help with scat/gath?
> > > > > > >
> > > > > > > In 1.5, in addition to some fixes in collectives, several thresholds were
> > > > > > > changed for point-to-point operations (based on platform and adapter
> > > > > > > characteristics) to obtain better performance. These changes will also
> > > > > > > have positive impact on the performance of collectives.
> > > > > > >
> > > > > > > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > > > > > > issues for collectives still remain, we will be happy to debug this issue
> > > > > > > further.
> > > > > > >
> > > > > > > > I noticed a couple of months ago someone reporting
> > > > > > > > very poor performance in global sums:
> > > > > > > >
> > > > > > > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > > > > > > >
> > > > > > > > But the thread ends unresolved.
> > > > > > >
> > > > > > > Since the 1.5 release procedure was getting overlapped with the
> > > > > > > examination of this issue, we got context-switched. We will take a closer
> > > > > > > look at this issue with 1.5 version.
> > > > > > >
> > > > > > > > Has anyone else had these problems?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > DK
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > mvapich-discuss mailing list
> > > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > > --
> > > > > > Dan Kokron
> > > > > > Global Modeling and Assimilation Office
> > > > > > NASA Goddard Space Flight Center
> > > > > > Greenbelt, MD 20771
> > > > > > Daniel.S.Kokron at nasa.gov
> > > > > > Phone: (301) 614-5192
> > > > > > Fax:   (301) 614-5304
> > > > > >
> > > > >
> > > > --
> > > > Dan Kokron
> > > > Global Modeling and Assimilation Office
> > > > NASA Goddard Space Flight Center
> > > > Greenbelt, MD 20771
> > > > Daniel.S.Kokron at nasa.gov
> > > > Phone: (301) 614-5192
> > > > Fax:   (301) 614-5304
> > > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
> --
> Dan Kokron
> Global Modeling and Assimilation Office
> NASA Goddard Space Flight Center
> Greenbelt, MD 20771
> Daniel.S.Kokron at nasa.gov
> Phone: (301) 614-5192
> Fax:   (301) 614-5304
>