[mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

Tue Feb 27 10:16:18 EST 2018

In the context of our cluster configuration:

Hardware:

               Intel Haswell processors, 2 sockets/node (20 cores/node)

               8 P100 GPUs - 4 GPUs connected to socket 0 and 4 GPUs
connected to socket 1

               Single rail MLNX QDR fabric connected to socket 1

Software:

               Running RHEL 7.4

               Using stock OFED (Later there is a question about whether
MLNX OFED is required)

               Intel 18.1 compiler

               Mvapich2-GDR/2.2-4 Intel version

As an aside, the reason for sticking with stock OFED is because we have a
mixed environment; non-GPU part of the machine with about 1K nodes has Intel
TrueScale fabric, and we have about 100 nodes with MLNX fabric, and we would
prefer to have a single image for all the nodes.

I am looking at the following documentation, specifically section 7:

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/

>From my understanding, since P100 GPUs have the GDR capability, it is not
necessary to use GPUDIRECT, is that a correct statement? Or is GDRCOPY
necessary even for nodes with these latest GPUs?

But I am seeing some differences with and without MV2_USE_GPUDIRECT_GDRCOPY
being set.  I am including the output from two runs, one with this variable
set to 1 and one to 0, and I have pasted output from osu_bw:

sg001% paste osu_bw.out-gdrcopy-0 osu_bw.out-gdrcopy-1

1                       0.03    1                       0.75

2                       0.06    2                       1.41

4                       0.12    4                       3.03

8                       0.24    8                       5.61

16                      0.47    16                      7.26

32                      0.94    32                      0.99

64                      1.89    64                      1.99

128                     3.77    128                     3.92

256                     7.55    256                     7.91

512                    15.05    512                    15.77

1024                   29.99    1024                   31.41

2048                   59.58    2048                   62.50

4096                  118.58    4096                  124.17

8192                  421.15    8192                  380.02

16384                 451.32    16384                 966.68

32768                1330.50    32768                1545.72

65536                1835.08    65536                2043.06

131072               2148.80    131072               2283.34

262144               1922.57    262144               1947.62

524288               3747.85    524288               3823.84

1048576              3810.02    1048576              3844.70

2097152              3838.38    2097152              3856.94

4194304              3853.08    4194304              3862.78

sg001%

For long messages there is no significant difference in performance, but for
smaller messages there is quite a bit of difference.  Is this what is
expected? 

Similar question about regular OFED from the same section, since these are
newer GPUs, is the mvapich2-gdr library capable of taking advantage of GDR
capability even without the MLNX OFED?

Not sure if the second question is something that should be put to either
Red Hat support or MLNX support?

We are trying to determine if we have a problem in our hardware/software
configuration or of this is what is expected.

We appreciate any comments and suggestions about our observations above!

Thanks,

Raghu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180227/a1d5c8cb/attachment.html>