[mvapich-discuss] performance problems with gath/scat

Thu Jul 29 17:49:02 EDT 2010

Hi Dan,

Thanks for letting us know the details of the performance issues you are
seeing. Good to know that MVAPICH2 1.5 is showing better performance
compared to MVAPICH2 1.4.1 as the system size scales. This is because of
some of the pt-to-pt tunings we have done in 1.5.

Here are some suggestions you can try to see if the performance for these
collectives and the GCM application can be enhanced.

1. There are two runtime parameters in MVAPICH2 which controls which MPI
messages go through eager and which go through rendezvous protocol.
Messages going through rendezvous protocol have higher overhead.  For
different platforms and adapter types, some default values are defined for
these two parameters. However, it is very hard to know whether these
values match with the application characteristics.

a. MV2_IBA_EAGER_THRESHOLD
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-11000011.21

b. MV2_VBUF_TOTAL_SIZE
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-17500011.86

Currently, both these parameters are set to 12K for ConnectX adapter. I do
not know the exact adapter being used on the Discover system.

If the average message sizes in the GCM-collectives are higher than 12K,
it might be helpful to run your application by setting both these
parameters to a higher value (say 16K, 20K, ..). You can do this at
run-time itself. Change both these parameters simultaneously.

Let us know if this helps.

2. In MVAPICH2 1.5 release, we also introduced a new Nemesis-IB interface.
This is based on Argonne's new Nemesis design. The following section in
MVAPICH2 user guide shows how to configure a build for the Nemesis-IB
interface.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-110004.5

The Nemesis-IB interface has a little different design for intra-node
communication. It also has a different set of algorithms for collectives.
Currently, this interface is not as `feature-rich' as the `gen2' interface
you are using. However, it is based on a new design. It will be helpful if
you can have a build with this interface and see if this delivers better
performance for the collectives and the overall application.

In the mean time, we will also take a look at the performance of the
collectives you have mentioned and get back to you by next week.

Thanks,

DK

On Thu, 29 Jul 2010, Dan Kokron wrote:

> Max Suarez asked me to respond to your questions and provide any support
> necessary to enable us to effectively use MVAPICH2 with our
> applications.
>
> We first noticed issues with performance when scaling the GEOS5 GCM to
> 720 processes.  We had been using Intel MPI (3.2.x) before switching to
> MVAPICH2 (1.4.1).  Walltimes (hh:mm:ss) for a test case are as follows
> for 256p, 512p and 720p using indicated MPI.  All codes were compiled
> with the Intel-11.0.083 suite of compilers.  I have attached a text file
> with hardware and software stack information for the platform used in
> these tests (discover.HW_SWstack).
>
> GCM application run wall time
> 	mv2-1.4.1 	iMPI-3.2.2.006 	mv2-1.5-2010-07-22
> 256 	00:23:45 	00:15:53 	00:22:57
> 512 	00:26:45 	00:11:06 	00:13:58
> 720 	00:43:12 	00:11:28 	00:16:15
>
> The test with the mv2-1.5 nightly snapshot was run at your suggestion.
>
> Next I instrumented the application with TAU
> (http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
> level timings.
>
> Results from 256p, 512p and 720p runs show that the performance
> difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
> collective operations.  Specifically, Scatterv, Gatherv and
> MPI_Allgatherv.
>
> Any suggestions for further tuning of mv2-1.5 for our particular needs
> would be appreciated.
>
> Dan
>
> On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> > Hi Max,
> >
> > Thanks for your note.
> >
> > >   We are having serious performance problems
> > > with collectives when using several hundred cores
> > > on the Discover system at NASA Goddard.
> >
> > Could you please let us know some more details on the performance problems
> > you are observing - which collectives, what data sizes, what system sizes,
> > etc.?
> >
> > > I noticed some fixes were made to collectives in 1.5.
> > > Would these help with scat/gath?
> >
> > In 1.5, in addition to some fixes in collectives, several thresholds were
> > changed for point-to-point operations (based on platform and adapter
> > characteristics) to obtain better performance. These changes will also
> > have positive impact on the performance of collectives.
> >
> > Thus, I will suggest you to upgrade to 1.5 first. If the performance
> > issues for collectives still remain, we will be happy to debug this issue
> > further.
> >
> > > I noticed a couple of months ago someone reporting
> > > very poor performance in global sums:
> > >
> > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> > >
> > > But the thread ends unresolved.
> >
> > Since the 1.5 release procedure was getting overlapped with the
> > examination of this issue, we got context-switched. We will take a closer
> > look at this issue with 1.5 version.
> >
> > > Has anyone else had these problems?
> >
> > Thanks,
> >
> > DK
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> --
> Dan Kokron
> Global Modeling and Assimilation Office
> NASA Goddard Space Flight Center
> Greenbelt, MD 20771
> Daniel.S.Kokron at nasa.gov
> Phone: (301) 614-5192
> Fax:   (301) 614-5304
>