[mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

Tue Feb 27 11:14:08 EST 2018

Hello,

GDRCopy is meant for the very small message range. So the behavior is expected.

GPUDirectRDMA and GDRCopy are two different technologies. One is not needed for the other to work. Both technologies needs certain drivers from NVIDIA and Mellanox to be installed. Without these being installed, they will not work and thus MVAPICH2-GDR will not be able to take advantage of them. To the best of our knowledge, GPUDirectRDMA will need MLNX_OFED and will not work with non MLNX_OFED.

>From your performance numbers, it looks like you have the GDRCopy module installed hence MVAPICH2-GDR is able to take advantage of GDRCopy to deliver better performance for the smaller message range.

Please let me know if you have any other questions.

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Raghu Reddy
Sent: Tuesday, February 27, 2018 9:16 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs

In the context of our cluster configuration:

Hardware:
               Intel Haswell processors, 2 sockets/node (20 cores/node)
               8 P100 GPUs - 4 GPUs connected to socket 0 and 4 GPUs connected to socket 1
               Single rail MLNX QDR fabric connected to socket 1

Software:
               Running RHEL 7.4
               Using stock OFED (Later there is a question about whether MLNX OFED is required)
               Intel 18.1 compiler
               Mvapich2-GDR/2.2-4 Intel version

As an aside, the reason for sticking with stock OFED is because we have a mixed environment; non-GPU part of the machine with about 1K nodes has Intel TrueScale fabric, and we have about 100 nodes with MLNX fabric, and we would prefer to have a single image for all the nodes.

I am looking at the following documentation, specifically section 7:

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/

>From my understanding, since P100 GPUs have the GDR capability, it is not necessary to use GPUDIRECT, is that a correct statement? Or is GDRCOPY necessary even for nodes with these latest GPUs?

But I am seeing some differences with and without MV2_USE_GPUDIRECT_GDRCOPY being set.  I am including the output from two runs, one with this variable set to 1 and one to 0, and I have pasted output from osu_bw:

sg001% paste osu_bw.out-gdrcopy-0 osu_bw.out-gdrcopy-1
1                       0.03    1                       0.75
2                       0.06    2                       1.41
4                       0.12    4                       3.03
8                       0.24    8                       5.61
16                      0.47    16                      7.26
32                      0.94    32                      0.99
64                      1.89    64                      1.99
128                     3.77    128                     3.92
256                     7.55    256                     7.91
512                    15.05    512                    15.77
1024                   29.99    1024                   31.41
2048                   59.58    2048                   62.50
4096                  118.58    4096                  124.17
8192                  421.15    8192                  380.02
16384                 451.32    16384                 966.68
32768                1330.50    32768                1545.72
65536                1835.08    65536                2043.06
131072               2148.80    131072               2283.34
262144               1922.57    262144               1947.62
524288               3747.85    524288               3823.84
1048576              3810.02    1048576              3844.70
2097152              3838.38    2097152              3856.94
4194304              3853.08    4194304              3862.78
sg001%

For long messages there is no significant difference in performance, but for smaller messages there is quite a bit of difference.  Is this what is expected?

Similar question about regular OFED from the same section, since these are newer GPUs, is the mvapich2-gdr library capable of taking advantage of GDR capability even without the MLNX OFED?

Not sure if the second question is something that should be put to either Red Hat support or MLNX support?

We are trying to determine if we have a problem in our hardware/software configuration or of this is what is expected.

We appreciate any comments and suggestions about our observations above!

Thanks,
Raghu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 18770 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180227/3a731b93/attachment-0001.bin>