[mvapich-discuss] performance problems with gath/scat
Dhabaleswar Panda
panda at cse.ohio-state.edu
Thu Aug 5 11:52:36 EDT 2010
Hi Dan,
> > The message sizes used for the three process counts are as follows
> >
> > Scatterv Gatherv
> > 256 13-15K 13-14K
> > 512 7K 7K
> > 720 5K 5K
>
> Thanks for this information.
>
> > setting MV2_IBA_EAGER_THRESHOLD=16384 MV2_VBUF_TOTAL_SIZE=16384 did
> > improve the performance of the 256 case nicely.
> >
> > 00:22:57 -> 00:17:17
>
> Good to know that you are able to get good performance here by changing
> the above two parameters at run-time.
We have analyzed the Scatterv and Gatherv algorithms being used in 1.5 and
think that they can be improved further to deliver better performance. We
are working on these improvements. We might be able to send you a patch
(or an updated tarball) by next week.
> > FYI, the default eager/rendezvous crossover size in Intel MPI is 256KB
>
> This seems to be too high. We are analyzing this.
>
> > Nemesis results:
> >
> > I configured mva2-1.5 with the following and ran a synthetic MPI
> > benchmark. I found a dramatic speedup in gatherv.
>
> This is very good to know.
>
> > ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -O0
> > -traceback -debug" CXXFLAGS="-fpic -O0 -traceback -debug" FFLAGS="-fpic
> > -O0 -traceback -debug -nolib-inline -check all -fp-stack-check -ftrapuv"
> > F90FLAGS="-fpic -O0 -traceback -debug -nolib-inline -check all
> > -fp-stack-check -ftrapuv"
> > --prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083_nemesis_debug
> > --enable-error-checking=all --enable-error-messages=all --enable-g=all
> > --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
> > --enable-threads=default --with-device=ch3:nemesis:ib --with-hwloc
> >
> > Unfortunately, the GEOSgcm application would not run under the nemesis
> > build at 256 processes. Interestingly, it would run on 512 processes.
> >
> > Here are some stack traces of the failure on 256. Any ideas?
>
> Thanks for the update here and sending us the stack traces. We are taking
> a look at these traces and trying to reproduce this error. We will get
> back to you on this soon.
We tried Allgatherv tests from IMB suite on 256 cores (32 nodes with 8
cores/node) and the tests are passing.
The failure could be because of some corner cases being reached for the
varying message sizes being used in the MPI_Allgatherv call.
I do not know whether this GEOSgcm application is public. If this is
public, we will be happy to run it on our cluster and try to debug the
problem. Let us know how to get a copy of it.
Alternatively, will it be possible for you to send us a code snippet which
uses the MPI_Allgatherv in this application with the appropriate messages.
This will help us to run this code snippet and debug this problem faster.
> > 16x16
> > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > MPIR_Allgatherv(799):
> > (unknown)(): Other MPI error
> > MPI process (rank: 0) terminated unexpectedly on borgl065
> > forrtl: error (69): process interrupted (SIGINT)
> > Image PC Routine Line Source
> > GEOSgcm.x 0000000008539C15 MPIDI_nem_ib_get_ 951 ib_poll.c
> > GEOSgcm.x 0000000008537969 MPIDI_nem_ib_read 316 ib_poll.c
> > GEOSgcm.x 0000000008537B96 MPID_nem_ib_poll 459 ib_poll.c
> > GEOSgcm.x 000000000852DC06 MPID_nem_network_ 16 mpid_nem_network_poll.c
> > GEOSgcm.x 00000000084C4096 MPID_nem_mpich2_t 799 mpid_nem_inline.h
> > GEOSgcm.x 00000000084BE817 MPIDI_CH3I_Progre 148 ch3_progress.c
> > GEOSgcm.x 000000000845F120 MPIC_Wait 512 helper_fns.c
> > GEOSgcm.x 000000000845D143 MPIC_Sendrecv 163 helper_fns.c
> > GEOSgcm.x 0000000008455CA6 MPIR_Allgatherv 793 allgatherv.c
> > GEOSgcm.x 0000000008456AE6 PMPI_Allgatherv 1082 allgatherv.c
> > GEOSgcm.x 000000000849B793 pmpi_allgatherv_ 195 allgathervf.c
> >
> > 8x32
> > Fatal error in MPI_Allgatherv: Other MPI error, error stack:
> > MPIR_Allgatherv(799):
> > (unknown)(): Other MPI error
> > MPI process (rank: 67) terminated unexpectedly on borgj004
> > Exit code -5 signaled from borgj004
> > forrtl: error (69): process interrupted (SIGINT)
> > Image PC Routine Line Source
> > libc.so.6 00002AFF28564B17 Unknown Unknown Unknown
> > GEOSgcm.x 00000000084CA06C MPIDI_CH3I_Progre 100 ch3_progress.c
> > GEOSgcm.x 000000000846ABE0 MPIC_Wait 512 helper_fns.c
> > GEOSgcm.x 0000000008468C03 MPIC_Sendrecv 163 helper_fns.c
> > GEOSgcm.x 0000000008461766 MPIR_Allgatherv 793 allgatherv.c
> > GEOSgcm.x 00000000084625A6 PMPI_Allgatherv 1082 allgatherv.c
> > GEOSgcm.x 00000000084A7253 pmpi_allgatherv_ 195 allgathervf.c
Thanks,
DK
>
> > On Thu, 2010-07-29 at 16:49 -0500, Dhabaleswar Panda wrote:
> > > Hi Dan,
> > >
> > > Thanks for letting us know the details of the performance issues you are
> > > seeing. Good to know that MVAPICH2 1.5 is showing better performance
> > > compared to MVAPICH2 1.4.1 as the system size scales. This is because of
> > > some of the pt-to-pt tunings we have done in 1.5.
> > >
> > > Here are some suggestions you can try to see if the performance for these
> > > collectives and the GCM application can be enhanced.
> > >
> > > 1. There are two runtime parameters in MVAPICH2 which controls which MPI
> > > messages go through eager and which go through rendezvous protocol.
> > > Messages going through rendezvous protocol have higher overhead. For
> > > different platforms and adapter types, some default values are defined for
> > > these two parameters. However, it is very hard to know whether these
> > > values match with the application characteristics.
> > >
> > > a. MV2_IBA_EAGER_THRESHOLD
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21
> > >
> > > b. MV2_VBUF_TOTAL_SIZE
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86
> > >
> > > Currently, both these parameters are set to 12K for ConnectX adapter. I do
> > > not know the exact adapter being used on the Discover system.
> > >
> > > If the average message sizes in the GCM-collectives are higher than 12K,
> > > it might be helpful to run your application by setting both these
> > > parameters to a higher value (say 16K, 20K, ..). You can do this at
> > > run-time itself. Change both these parameters simultaneously.
> > >
> > > Let us know if this helps.
> > >
> > > 2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
> > > This is based on Argonne's new Nemesis design. The following section in
> > > MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
> > > interface.
> > >
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5
> > >
> > > The Nemesis-IB interface has a little different design for intra-node
> > > communication. It also has a different set of algorithms for collectives.
> > > Currently, this interface is not as `feature-rich' as the `gen2' interface
> > > you are using. However, it is based on a new design. It will be helpful if
> > > you can have a build with this interface and see if this delivers better
> > > performance for the collectives and the overall application.
> > >
> > > In the mean time, we will also take a look at the performance of the
> > > collectives you have mentioned and get back to you by next week.
> > >
> > > Thanks,
> > >
> > > DK
> > >
> > > On Thu, 29 Jul 2010, Dan Kokron wrote:
> > >
> > > > Max Suarez asked me to respond to your questions and provide any support
> > > > necessary to enable us to effectively use MVAPICH2 with our
> > > > applications.
> > > >
> > > > We first noticed issues with performance when scaling the GEOS5 GCM to
> > > > 720 processes. We had been using Intel MPI (3.2.x) before switching to
> > > > MVAPICH2 (1.4.1). Walltimes (hh:mm:ss) for a test case are as follows
> > > > for 256p, 512p and 720p using indicated MPI. All codes were compiled
> > > > with the Intel-11.0.083 suite of compilers. I have attached a text file
> > > > with hardware and software stack information for the platform used in
> > > > these tests (discover.HW_SWstack).
> > > >
> > > > GCM application run wall time
> > > > mv2-1.4.1 iMPI-3.2.2.006 mv2-1.5-2010-07-22
> > > > 256 00:23:45 00:15:53 00:22:57
> > > > 512 00:26:45 00:11:06 00:13:58
> > > > 720 00:43:12 00:11:28 00:16:15
> > > >
> > > > The test with the mv2-1.5 nightly snapshot was run at your suggestion.
> > > >
> > > > Next I instrumented the application with TAU
> > > > (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> > > > level timings.
> > > >
> > > > Results from 256p, 512p and 720p runs show that the performance
> > > > difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> > > > collective operations. Specifically, Scatterv, Gatherv and
> > > > MPI_Allgatherv.
> > > >
> > > > Any suggestions for further tuning of mv2-1.5 for our particular needs
> > > > would be appreciated.
> > > >
> > > > Dan
> > > >
> > > > On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > > > > Hi Max,
> > > > >
> > > > > Thanks for your note.
> > > > >
> > > > > > We are having serious performance problems
> > > > > > with collectives when using several hundred cores
> > > > > > on the Discover system at NASA Goddard.
> > > > >
> > > > > Could you please let us know some more details on the performance problems
> > > > > you are observing - which collectives, what data sizes, what system sizes,
> > > > > etc.?
> > > > >
> > > > > > I noticed some fixes were made to collectives in 1.5.
> > > > > > Would these help with scat/gath?
> > > > >
> > > > > In 1.5, in addition to some fixes in collectives, several thresholds were
> > > > > changed for point-to-point operations (based on platform and adapter
> > > > > characteristics) to obtain better performance. These changes will also
> > > > > have positive impact on the performance of collectives.
> > > > >
> > > > > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > > > > issues for collectives still remain, we will be happy to debug this issue
> > > > > further.
> > > > >
> > > > > > I noticed a couple of months ago someone reporting
> > > > > > very poor performance in global sums:
> > > > > >
> > > > > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > > > > >
> > > > > > But the thread ends unresolved.
> > > > >
> > > > > Since the 1.5 release procedure was getting overlapped with the
> > > > > examination of this issue, we got context-switched. We will take a closer
> > > > > look at this issue with 1.5 version.
> > > > >
> > > > > > Has anyone else had these problems?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > DK
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > --
> > > > Dan Kokron
> > > > Global Modeling and Assimilation Office
> > > > NASA Goddard Space Flight Center
> > > > Greenbelt, MD 20771
> > > > Daniel.S.Kokron at nasa.gov
> > > > Phone: (301) 614-5192
> > > > Fax: (301) 614-5304
> > > >
> > >
> > --
> > Dan Kokron
> > Global Modeling and Assimilation Office
> > NASA Goddard Space Flight Center
> > Greenbelt, MD 20771
> > Daniel.S.Kokron at nasa.gov
> > Phone: (301) 614-5192
> > Fax: (301) 614-5304
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list