[mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

Mon Mar 5 13:22:17 EST 2018

Hi Adam and Hari,

Following up on this thread, I ran a few tests experimenting with a few of the tuning variables suggested that the following link:

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/#_tuning_and_usage_parameters 

Since there are multiple combinations possible, I will include the key environment variables along with the output of each.  But here is a brief summary:

Hardware and software environment was summarized before, and I am adding that OFED stack that is used for the tests is the one that comes with RHEL 7.4.

Currently Loaded Modules:

  1) intel/18.1.163   2) cuda/8.0   3) mvapich2-gdr/2.2-4-cuda-8.0-intel

Running osu_bw from osu-micro-benchmarks-5.3.2, and I will first summarize three runs:

-          Host-Host transfers, to get an idea about the best achievable numbers (does not involve GPUs)

-          Device-Device with MV2_USE_GPUDIRECT_GDRCOPY=1

-          Device-Device with MV2_USE_GPUDIRECT_GDRCOPY=0

sfe01% cat osu_bw.summary                                                                                      

osu_bw:

The first column is from one node to another;

The second and third columns are dev to dev on another node

with MV2_USE_GPUDIRECT_GDRCOPY=1 and =0 respectively:

Bytes     host-host  dev-dev-1     dev-dev-0

1             1.83        0.75         0.03

2             3.70        1.41         0.06

4             7.56        3.03         0.12

8            14.74        5.61         0.24

16           29.89        7.26         0.47

32           57.65        0.99         0.94

64          112.11        1.99         1.89

128         214.41        3.92         3.77

256         422.24        7.91         7.55

512         813.87       15.77        15.05

1024       1506.18       31.41        29.99

2048       2664.97       62.50        59.58

4096       3399.11      124.17       118.58

8192       3238.29      380.02       421.15

16384      3642.88      966.68       451.32

32768      3732.85     1545.72      1330.50

65536      3779.08     2043.06      1835.08

131072     3814.95     2283.34      2148.80

262144     3840.66     1947.62      1922.57

524288     3854.33     3823.84      3747.85

1048576    3864.26     3844.70      3810.02

2097152    3868.47     3856.94      3838.38

4194304    3870.53     3862.78      3853.08

sfe01%

The summary above is the same information I had included in my previous e-mail on this thread, the only thing is I added the second column which is the host-to-host transfer rate.

Then  I tried a few of environment variables from the link above and the actual output is included below.  But here is a brief summary:

-          MV2_USE_CUDA is required for Device-Device transfers, which is of course expected.

-          MV2_CUDA_BLOCK_SIZE setting of 8k caused transfers over 8k to fail. So I set it back to the default for this test.

-          MV2_GPUDIRECT_LIMIT of 0 gave better performance than the default value of 8192, and has less of a drastic drop off at 32 bytes seen above with default values.

Some of the output is included below; a sample command that I used included below to gather this output to provide the context for the output you’re seeing below (my default shell is tcsh):

sg001% ( set echo ; env | egrep 'CUDA|GDR|MV2' ; env LD_PRELOAD=$MPIROOT/lib64/libmpi.so mpirun -np 2 -hosts sg001,sg002 libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d cuda D D ) | & tee out-cuda-D-D-MV2_USE_GPUDIRECT_GDRCOPY-1-MV2_GPUDIRECT_LIMIT-0

sfe01% cat out-cuda-D-D-MV2_USE_GPUDIRECT_GDRCOPY-1-MV2_CUDA_BLOCK_SIZE-8192

env

egrep --color=auto CUDA|GDR|MV2

CUDALIBDIR=/apps/cuda/cuda-8.0/lib64

CUDA_INCLUDE_OPTS=-I /apps/cuda/cuda-8.0/include

CUDA_PATH=/apps/cuda/cuda-8.0

CUDA_ROOT=/apps/cuda/cuda-8.0

GDRCOPY_ROOT=/apps/gdrcopy/0.0.0

MPICH_RDMA_ENABLED_CUDA=1

MV2_GPUDIRECT_GDRCOPY_LIB=/apps/gdrcopy/0.0.0/libgdrapi.so

MV2_USE_CUDA=1

MV2_USE_GPUDIRECT_GDRCOPY=1

MV2_CUDA_BLOCK_SIZE=8192

env LD_PRELOAD=/apps/mvapich2-gdr/2.2-4/cuda8.0-intel/lib64/libmpi.so mpirun -np 2 -hosts sg001,sg002 libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d cuda D D

# OSU MPI-CUDA Bandwidth Test v5.3.2

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size      Bandwidth (MB/s)

Warning *** The GPU and IB selected are not on the same socket.

        *** This configuration may not deliver the best performance.

Warning *** The GPU and IB selected are not on the same socket.

        *** This configuration may not deliver the best performance.

1                       0.76

2                       1.41

4                       3.04

8                       5.64

16                      7.39

32                      1.05

64                      2.11

128                     4.19

256                     8.38

512                    16.68

1024                   33.16

2048                   65.88

4096                  131.43

[sg002:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

[sg001:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0

[sg001:mpi_rank_0][handle_cqe] Msg from 1: wc.status=9, wc.wr_id=0x1e10040, wc.opcode=0, vbuf->phead->type=20 = MPIDI_CH3_PKT_PACKETIZED_SEND_START

[sg001:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:547: [] Got completion with error 9, vendor code=0x8a, dest rank=1

: Bad address (14)

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 127904 RUNNING AT sg002

=   EXIT CODE: 139

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

This typically refers to a problem with your application.

Please see the FAQ page for debugging suggestions

sfe01%

sfe01% cat out-cuda-D-D-MV2_USE_GPUDIRECT_GDRCOPY-1-MV2_GPUDIRECT_LIMIT-0

env

egrep --color=auto CUDA|GDR|MV2

CUDALIBDIR=/apps/cuda/cuda-8.0/lib64

CUDA_INCLUDE_OPTS=-I /apps/cuda/cuda-8.0/include

CUDA_PATH=/apps/cuda/cuda-8.0

CUDA_ROOT=/apps/cuda/cuda-8.0

GDRCOPY_ROOT=/apps/gdrcopy/0.0.0

MPICH_RDMA_ENABLED_CUDA=1

MV2_GPUDIRECT_GDRCOPY_LIB=/apps/gdrcopy/0.0.0/libgdrapi.so

MV2_USE_CUDA=1

MV2_USE_GPUDIRECT_GDRCOPY=1

MV2_CUDA_BLOCK_SIZE=262144

MV2_GPUDIRECT_LIMIT=0

env LD_PRELOAD=/apps/mvapich2-gdr/2.2-4/cuda8.0-intel/lib64/libmpi.so mpirun -np 2 -hosts sg001,sg002 libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d cuda D D

# OSU MPI-CUDA Bandwidth Test v5.3.2

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size      Bandwidth (MB/s)

Warning *** The GPU and IB selected are not on the same socket.

        *** This configuration may not deliver the best performance.

Warning *** The GPU and IB selected are not on the same socket.

        *** This configuration may not deliver the best performance.

1                       0.76

2                       1.41

4                       3.04

8                       5.63

16                      7.28

32                      3.01

64                      6.07

128                    12.02

256                    24.14

512                    46.16

1024                   93.68

2048                  181.33

4096                  348.49

8192                  382.70

16384                 981.47

32768                1547.27

65536                2048.77

131072               2332.59

262144               2211.07

524288               3823.78

1048576              3845.12

2097152              3857.25

4194304              3862.84

sfe01%

I’m sharing this information to find out how others are normally using it and what kind of performance they’re getting.

Thanks,

Raghu

From: Subramoni, Hari [mailto:subramoni.1 at osu.edu] 
Sent: Tuesday, February 27, 2018 9:48 PM
To: Moody, Adam T.; Raghu Reddy; mvapich-discuss at cse.ohio-state.edu
Cc: Subramoni, Hari
Subject: RE: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

Hi, Adam.

Yes, tuning *may be* possible. It depends on various factors including the type of GPU available and its hardware characteristics. We will discuss this and get back to you.

Thx,

Hari.

From: Moody, Adam T. [mailto:moody20 at llnl.gov] 
Sent: Tuesday, February 27, 2018 7:18 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>; Raghu Reddy <raghu.reddy at noaa.gov>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

Just browsing the GDRCOPY=1 numbers, it looks like the bandwith drop off after 16 bytes is sharp.

Can that be smoothed out by moving a threshold up to higher byte counts?

-Adam

From: "mvapich-discuss-bounces at cse.ohio-state.edu" <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of "Subramoni, Hari" <subramoni.1 at osu.edu>
Date: Tuesday, February 27, 2018 at 8:15 AM
To: Raghu Reddy <raghu.reddy at noaa.gov>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

Hello,

GDRCopy is meant for the very small message range. So the behavior is expected.

GPUDirectRDMA and GDRCopy are two different technologies. One is not needed for the other to work. Both technologies needs certain drivers from NVIDIA and Mellanox to be installed. Without these being installed, they will not work and thus MVAPICH2-GDR will not be able to take advantage of them. To the best of our knowledge, GPUDirectRDMA will need MLNX_OFED and will not work with non MLNX_OFED.

>From your performance numbers, it looks like you have the GDRCopy module installed hence MVAPICH2-GDR is able to take advantage of GDRCopy to deliver better performance for the smaller message range.

Please let me know if you have any other questions.

Thx,

Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Raghu Reddy
Sent: Tuesday, February 27, 2018 9:16 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

In the context of our cluster configuration:

Hardware:

               Intel Haswell processors, 2 sockets/node (20 cores/node)

               8 P100 GPUs – 4 GPUs connected to socket 0 and 4 GPUs connected to socket 1

               Single rail MLNX QDR fabric connected to socket 1

Software:

               Running RHEL 7.4

               Using stock OFED (Later there is a question about whether MLNX OFED is required)

               Intel 18.1 compiler

               Mvapich2-GDR/2.2-4 Intel version

As an aside, the reason for sticking with stock OFED is because we have a mixed environment; non-GPU part of the machine with about 1K nodes has Intel TrueScale fabric, and we have about 100 nodes with MLNX fabric, and we would prefer to have a single image for all the nodes.

I am looking at the following documentation, specifically section 7:

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/

>From my understanding, since P100 GPUs have the GDR capability, it is not necessary to use GPUDIRECT, is that a correct statement? Or is GDRCOPY necessary even for nodes with these latest GPUs?

But I am seeing some differences with and without MV2_USE_GPUDIRECT_GDRCOPY being set.  I am including the output from two runs, one with this variable set to 1 and one to 0, and I have pasted output from osu_bw:

sg001% paste osu_bw.out-gdrcopy-0 osu_bw.out-gdrcopy-1

1                       0.03    1                       0.75

2                       0.06    2                       1.41

4                       0.12    4                       3.03

8                       0.24    8                       5.61

16                      0.47    16                      7.26

32                      0.94    32                      0.99

64                      1.89    64                      1.99

128                     3.77    128                     3.92

256                     7.55    256                     7.91

512                    15.05    512                    15.77

1024                   29.99    1024                   31.41

2048                   59.58    2048                   62.50

4096                  118.58    4096                  124.17

8192                  421.15    8192                  380.02

16384                 451.32    16384                 966.68

32768                1330.50    32768                1545.72

65536                1835.08    65536                2043.06

131072               2148.80    131072               2283.34

262144               1922.57    262144               1947.62

524288               3747.85    524288               3823.84

1048576              3810.02    1048576              3844.70

2097152              3838.38    2097152              3856.94

4194304              3853.08    4194304              3862.78

sg001%

For long messages there is no significant difference in performance, but for smaller messages there is quite a bit of difference.  Is this what is expected? 

Similar question about regular OFED from the same section, since these are newer GPUs, is the mvapich2-gdr library capable of taking advantage of GDR capability even without the MLNX OFED?

Not sure if the second question is something that should be put to either Red Hat support or MLNX support?

We are trying to determine if we have a problem in our hardware/software configuration or of this is what is expected.

We appreciate any comments and suggestions about our observations above!

Thanks,

Raghu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180305/670b87a3/attachment-0001.html>