[mvapich-discuss] performance problems with gath/scat
Dan Kokron
daniel.kokron at nasa.gov
Thu Aug 5 12:27:15 EDT 2010
Your 'corner-case' hypothesis regarding Allgatherv is probably correct
as I am not able to reproduce the failure using IMB either. The
application is quite large and has many external dependencies (ESMF,
NetCDF and others). I'd like to pass a tarball with just the particular
call and it's arguments. I will look into this. Is there a checkpoint
feature available for the Nemesis channel?
One other note regarding Nemesis. I had another failure, this time at
512 processes. The failure message was
[ib_vbuf.c 256] Cannot register vbuf region
I look forward to testing the new Scatterv/gatherv code.
Dan
On Thu, 2010-08-05 at 10:52 -0500, Dhabaleswar Panda wrote:
> Hi Dan,
>
> > > The message sizes used for the three process counts are as follows
> > >
> > > Scatterv Gatherv
> > > 256 13-15K 13-14K
> > > 512 7K 7K
> > > 720 5K 5K
> >
> > Thanks for this information.
> >
> > > setting MV2_IBA_EAGER_THRESHOLD=16384 MV2_VBUF_TOTAL_SIZE=16384 did
> > > improve the performance of the 256 case nicely.
> > >
> > > 00:22:57 -> 00:17:17
> >
> > Good to know that you are able to get good performance here by changing
> > the above two parameters at run-time.
>
> We have analyzed the Scatterv and Gatherv algorithms being used in 1.5 and
> think that they can be improved further to deliver better performance. We
> are working on these improvements. We might be able to send you a patch
> (or an updated tarball) by next week.
>
> > > FYI, the default eager/rendezvous crossover size in Intel MPI is 256KB
> >
> > This seems to be too high. We are analyzing this.
> >
> > > Nemesis results:
> > >
> > > I configured mva2-1.5 with the following and ran a synthetic MPI
> > > benchmark. I found a dramatic speedup in gatherv.
> >
> > This is very good to know.
> >
> > > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -O0
> > > -traceback -debug" CXXFLAGS="-fpic -O0 -traceback -debug" FFLAGS="-fpic
> > > -O0 -traceback -debug -nolib-inline -check all -fp-stack-check -ftrapuv"
> > > F90FLAGS="-fpic -O0 -traceback -debug -nolib-inline -check all
> > > -fp-stack-check -ftrapuv"
> > > --prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083_nemesis_debug
> > > --enable-error-checking=all --enable-error-messages=all --enable-g=all
> > > --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
> > > --enable-threads=default --with-device=ch3:nemesis:ib --with-hwloc
> > >
> > > Unfortunately, the GEOSgcm application would not run under the nemesis
> > > build at 256 processes. Interestingly, it would run on 512 processes.
> > >
> > > Here are some stack traces of the failure on 256. Any ideas?
> >
> > Thanks for the update here and sending us the stack traces. We are taking
> > a look at these traces and trying to reproduce this error. We will get
> > back to you on this soon.
>
> We tried Allgatherv tests from IMB suite on 256 cores (32 nodes with 8
> cores/node) and the tests are passing.
>
> The failure could be because of some corner cases being reached for the
> varying message sizes being used in the MPI_Allgatherv call.
>
> I do not know whether this GEOSgcm application is public. If this is
> public, we will be happy to run it on our cluster and try to debug the
> problem. Let us know how to get a copy of it.
>
> Alternatively, will it be possible for you to send us a code snippet which
> uses the MPI_Allgatherv in this application with the appropriate messages.
> This will help us to run this code snippet and debug this problem faster.
>
> > > 16x16
> > > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > > MPIR_Allgatherv(799):
> > > (unknown)(): Other MPI error
> > > MPI process (rank: 0) terminated unexpectedly on borgl065
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image PC Routine Line Source
> > > GEOSgcm.x 0000000008539C15 MPIDI_nem_ib_get_ 951 ib_poll.c
> > > GEOSgcm.x 0000000008537969 MPIDI_nem_ib_read 316 ib_poll.c
> > > GEOSgcm.x 0000000008537B96 MPID_nem_ib_poll 459 ib_poll.c
> > > GEOSgcm.x 000000000852DC06 MPID_nem_network_ 16 mpid_nem_network_poll.c
> > > GEOSgcm.x 00000000084C4096 MPID_nem_mpich2_t 799 mpid_nem_inline.h
> > > GEOSgcm.x 00000000084BE817 MPIDI_CH3I_Progre 148 ch3_progress.c
> > > GEOSgcm.x 000000000845F120 MPIC_Wait 512 helper_fns.c
> > > GEOSgcm.x 000000000845D143 MPIC_Sendrecv 163 helper_fns.c
> > > GEOSgcm.x 0000000008455CA6 MPIR_Allgatherv 793 allgatherv.c
> > > GEOSgcm.x 0000000008456AE6 PMPI_Allgatherv 1082 allgatherv.c
> > > GEOSgcm.x 000000000849B793 pmpi_allgatherv_ 195 allgathervf.c
> > >
> > > 8x32
> > > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > > MPIR_Allgatherv(799):
> > > (unknown)(): Other MPI error
> > > MPI process (rank: 67) terminated unexpectedly on borgj004
> > > Exit code -5 signaled from borgj004
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image PC Routine Line Source
> > > libc.so.6 00002AFF28564B17 Unknown Unknown Unknown
> > > GEOSgcm.x 00000000084CA06C MPIDI_CH3I_Progre 100 ch3_progress.c
> > > GEOSgcm.x 000000000846ABE0 MPIC_Wait 512 helper_fns.c
> > > GEOSgcm.x 0000000008468C03 MPIC_Sendrecv 163 helper_fns.c
> > > GEOSgcm.x 0000000008461766 MPIR_Allgatherv 793 allgatherv.c
> > > GEOSgcm.x 00000000084625A6 PMPI_Allgatherv 1082 allgatherv.c
> > > GEOSgcm.x 00000000084A7253 pmpi_allgatherv_ 195 allgathervf.c
>
> Thanks,
>
> DK
>
> >
> > > On Thu, 2010-07-29 at 16:49 -0500, Dhabaleswar Panda wrote:
> > > > Hi Dan,
> > > >
> > > > Thanks for letting us know the details of the performance issues you are
> > > > seeing. Good to know that MVAPICH2 1.5 is showing better performance
> > > > compared to MVAPICH2 1.4.1 as the system size scales. This is because of
> > > > some of the pt-to-pt tunings we have done in 1.5.
> > > >
> > > > Here are some suggestions you can try to see if the performance for these
> > > > collectives and the GCM application can be enhanced.
> > > >
> > > > 1. There are two runtime parameters in MVAPICH2 which controls which MPI
> > > > messages go through eager and which go through rendezvous protocol.
> > > > Messages going through rendezvous protocol have higher overhead. For
> > > > different platforms and adapter types, some default values are defined for
> > > > these two parameters. However, it is very hard to know whether these
> > > > values match with the application characteristics.
> > > >
> > > > a. MV2_IBA_EAGER_THRESHOLD
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21
> > > >
> > > > b. MV2_VBUF_TOTAL_SIZE
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86
> > > >
> > > > Currently, both these parameters are set to 12K for ConnectX adapter. I do
> > > > not know the exact adapter being used on the Discover system.
> > > >
> > > > If the average message sizes in the GCM-collectives are higher than 12K,
> > > > it might be helpful to run your application by setting both these
> > > > parameters to a higher value (say 16K, 20K, ..). You can do this at
> > > > run-time itself. Change both these parameters simultaneously.
> > > >
> > > > Let us know if this helps.
> > > >
> > > > 2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
> > > > This is based on Argonne's new Nemesis design. The following section in
> > > > MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
> > > > interface.
> > > >
> > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5
> > > >
> > > > The Nemesis-IB interface has a little different design for intra-node
> > > > communication. It also has a different set of algorithms for collectives.
> > > > Currently, this interface is not as `feature-rich' as the `gen2' interface
> > > > you are using. However, it is based on a new design. It will be helpful if
> > > > you can have a build with this interface and see if this delivers better
> > > > performance for the collectives and the overall application.
> > > >
> > > > In the mean time, we will also take a look at the performance of the
> > > > collectives you have mentioned and get back to you by next week.
> > > >
> > > > Thanks,
> > > >
> > > > DK
> > > >
> > > > On Thu, 29 Jul 2010, Dan Kokron wrote:
> > > >
> > > > > Max Suarez asked me to respond to your questions and provide any support
> > > > > necessary to enable us to effectively use MVAPICH2 with our
> > > > > applications.
> > > > >
> > > > > We first noticed issues with performance when scaling the GEOS5 GCM to
> > > > > 720 processes. We had been using Intel MPI (3.2.x) before switching to
> > > > > MVAPICH2 (1.4.1). Walltimes (hh:mm:ss) for a test case are as follows
> > > > > for 256p, 512p and 720p using indicated MPI. All codes were compiled
> > > > > with the Intel-11.0.083 suite of compilers. I have attached a text file
> > > > > with hardware and software stack information for the platform used in
> > > > > these tests (discover.HW_SWstack).
> > > > >
> > > > > GCM application run wall time
> > > > > mv2-1.4.1 iMPI-3.2.2.006 mv2-1.5-2010-07-22
> > > > > 256 00:23:45 00:15:53 00:22:57
> > > > > 512 00:26:45 00:11:06 00:13:58
> > > > > 720 00:43:12 00:11:28 00:16:15
> > > > >
> > > > > The test with the mv2-1.5 nightly snapshot was run at your suggestion.
> > > > >
> > > > > Next I instrumented the application with TAU
> > > > > (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> > > > > level timings.
> > > > >
> > > > > Results from 256p, 512p and 720p runs show that the performance
> > > > > difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> > > > > collective operations. Specifically, Scatterv, Gatherv and
> > > > > MPI_Allgatherv.
> > > > >
> > > > > Any suggestions for further tuning of mv2-1.5 for our particular needs
> > > > > would be appreciated.
> > > > >
> > > > > Dan
> > > > >
> > > > > On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > > > > > Hi Max,
> > > > > >
> > > > > > Thanks for your note.
> > > > > >
> > > > > > > We are having serious performance problems
> > > > > > > with collectives when using several hundred cores
> > > > > > > on the Discover system at NASA Goddard.
> > > > > >
> > > > > > Could you please let us know some more details on the performance problems
> > > > > > you are observing - which collectives, what data sizes, what system sizes,
> > > > > > etc.?
> > > > > >
> > > > > > > I noticed some fixes were made to collectives in 1.5.
> > > > > > > Would these help with scat/gath?
> > > > > >
> > > > > > In 1.5, in addition to some fixes in collectives, several thresholds were
> > > > > > changed for point-to-point operations (based on platform and adapter
> > > > > > characteristics) to obtain better performance. These changes will also
> > > > > > have positive impact on the performance of collectives.
> > > > > >
> > > > > > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > > > > > issues for collectives still remain, we will be happy to debug this issue
> > > > > > further.
> > > > > >
> > > > > > > I noticed a couple of months ago someone reporting
> > > > > > > very poor performance in global sums:
> > > > > > >
> > > > > > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > > > > > >
> > > > > > > But the thread ends unresolved.
> > > > > >
> > > > > > Since the 1.5 release procedure was getting overlapped with the
> > > > > > examination of this issue, we got context-switched. We will take a closer
> > > > > > look at this issue with 1.5 version.
> > > > > >
> > > > > > > Has anyone else had these problems?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > DK
> > > > > >
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > --
> > > > > Dan Kokron
> > > > > Global Modeling and Assimilation Office
> > > > > NASA Goddard Space Flight Center
> > > > > Greenbelt, MD 20771
> > > > > Daniel.S.Kokron at nasa.gov
> > > > > Phone: (301) 614-5192
> > > > > Fax: (301) 614-5304
> > > > >
> > > >
> > > --
> > > Dan Kokron
> > > Global Modeling and Assimilation Office
> > > NASA Goddard Space Flight Center
> > > Greenbelt, MD 20771
> > > Daniel.S.Kokron at nasa.gov
> > > Phone: (301) 614-5192
> > > Fax: (301) 614-5304
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
--
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax: (301) 614-5304
More information about the mvapich-discuss
mailing list