[mvapich-discuss] Single node performance issue

Mon Oct 24 20:51:14 EDT 2016

Using MV2_USE_SHARED_MEM=0 is about 30x slower than the previous runs.

I rebuilt with:

./configure --with-pm=hydra \

--with-pbs=/opt/torque \

--with-rdma=gen2 \

--prefix=$SW_BLDDIR

MV2_USE_SHARED_MEM=0 was run 10 times. The result here is representative of the sample.

Shared memory = 1

# OSU MPI Allgather Latency Test v5.3.2

# Size       Avg Latency(us)

1                       4.46

2                       4.42

4                       4.46

8                       4.56

16                      4.43

32                      4.86

64                      6.04

128                     7.20

256                    11.67

512                    16.49

1024                   25.16

2048                   76.92

4096                   99.80

8192                  306.14

16384                 409.94

32768                 827.37

65536                1716.86

131072               2108.85

262144               4060.77

524288               8127.10

1048576             18490.57

# OSU MPI Allgather Latency Test v5.3.2

# Size       Avg Latency(us)

1                     121.62

2                     141.59

4                     121.11

8                     124.60

16                    126.33

32                    125.98

64                    133.60

128                   146.16

256                   180.88

512                   206.15

1024                  279.51

2048                  572.69

4096                  934.59

8192                 1892.55

16384                3215.50

32768                6456.95

65536               12965.76

131072              27705.45

262144              54529.73

524288             109218.00

1048576            216808.96

________________________________
From: hari.subramoni at gmail.com <hari.subramoni at gmail.com> on behalf of Hari Subramoni <subramoni.1 at osu.edu>
Sent: Monday, October 24, 2016 6:54 PM
To: Mayer, Benjamin W.
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Single node performance issue

Hello,

Sorry to hear that you're getting performance variations. Looks like there are a few issues here.

Please do not use "--with-device=ch3:nemesis:ib" as it is not the default communication channel. You don't need to specify any extra configure options to build for the InfiniBand channel. Please look at the following section of the userguide for more information on this. Having said that, the flags you've used will not negatively affect it.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2-userguide.html#x1-120004.4

Can you try running the single node case after disabling shared memory support (MV2_USE_SHARED_MEM=0)?

Regards,
Hari.

On Mon, Oct 24, 2016 at 10:25 AM, Mayer, Benjamin W. <mayerbw at ornl.gov<mailto:mayerbw at ornl.gov>> wrote:

We are seeing a performance issue with MVAPICH2 2.2 and 2.2rc1 on a single node, and worse on multiple nodes. This behavior is not seeing while using OpenMPI.

For all data running the OSU Allgather microbenchmark with sample sizes of 50-10 instances. MVAPICH has been compiled and tested with Intel 2017.0.098, Intel 2016.1, GNU 5.3.0. The machine has Mellanox CX4 adaptors.

With MVAPICH2 2.2 on a single node, 32 tasks, 1 thread, we see some high variability on small data sizes (~50 samples). There is a large percentage of runs that are normal run time (4-5 us), but also a moderate number of the runs that are up to 2x slower (10us), and then a handful of extreme outliers (10,000us) for each data size, along with a small number that are killed because they are not finishing.

MVAPICH2 2.2rc1 in the same configuration (80 samples) has similar behavior except has about 4x the rate of extreme outliers, and is generally a bit slower once outliers are removed.

OpenMPI in the same configuration (100 samples) has no outliers and expected level of performance.

Small numbers of runs have been performed across 32 nodes. MVAPICH2 2.2 performance has been much worse. For example at 16k data size the time was 134,000 us.

For the above, I have raw data and plots that I can share if those would be helpful.

The configuration for 2.2rc1:

./configure --prefix=$SW_BLDDIR \

--with-pbs=/opt/torque

--enable-fortran=yes \

--enable-cxx \

--with-device=ch3:mrail \

--with-rdma=gen2

The configuration for 2.2:

./configure --prefix=$SW_BLDDIR \

--with-pbs=/opt/torque \

--with-pm=hydra \

--with-device=ch3:mrail \

--with-rdma=gen2 \

--with-hwloc \

We have tried a new configuration with 2.2 to try to explicitly call out the IB interface.

./configure --prefix=$SW_BLDDIR \

--with-pbs=/opt/torque \

--with-pm=hydra \

--with-device=ch3:nemesis:ib \

--with-hwloc \

--with-rdma=gen2

This configuration ends with the application having a bus error.

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 42148 RUNNING AT mod-pbs-c01.ornl.gov<http://mod-pbs-c01.ornl.gov>

=   EXIT CODE: 7

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)

This typically refers to a problem with your application.

Please see the FAQ page for debugging suggestions

- What is the likely solution of the single node performance issue?

- What configuration should be given to use the IB adaptors?

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161025/dfd0a998/attachment-0001.html>