[mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs
Raghu Reddy
raghu.reddy at noaa.gov
Mon Mar 5 13:22:17 EST 2018
Hi Adam and Hari,
Following up on this thread, I ran a few tests experimenting with a few of the tuning variables suggested that the following link:
http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/#_tuning_and_usage_parameters
Since there are multiple combinations possible, I will include the key environment variables along with the output of each. But here is a brief summary:
Hardware and software environment was summarized before, and I am adding that OFED stack that is used for the tests is the one that comes with RHEL 7.4.
Currently Loaded Modules:
1) intel/18.1.163 2) cuda/8.0 3) mvapich2-gdr/2.2-4-cuda-8.0-intel
Running osu_bw from osu-micro-benchmarks-5.3.2, and I will first summarize three runs:
- Host-Host transfers, to get an idea about the best achievable numbers (does not involve GPUs)
- Device-Device with MV2_USE_GPUDIRECT_GDRCOPY=1
- Device-Device with MV2_USE_GPUDIRECT_GDRCOPY=0
sfe01% cat osu_bw.summary
osu_bw:
The first column is from one node to another;
The second and third columns are dev to dev on another node
with MV2_USE_GPUDIRECT_GDRCOPY=1 and =0 respectively:
Bytes host-host dev-dev-1 dev-dev-0
1 1.83 0.75 0.03
2 3.70 1.41 0.06
4 7.56 3.03 0.12
8 14.74 5.61 0.24
16 29.89 7.26 0.47
32 57.65 0.99 0.94
64 112.11 1.99 1.89
128 214.41 3.92 3.77
256 422.24 7.91 7.55
512 813.87 15.77 15.05
1024 1506.18 31.41 29.99
2048 2664.97 62.50 59.58
4096 3399.11 124.17 118.58
8192 3238.29 380.02 421.15
16384 3642.88 966.68 451.32
32768 3732.85 1545.72 1330.50
65536 3779.08 2043.06 1835.08
131072 3814.95 2283.34 2148.80
262144 3840.66 1947.62 1922.57
524288 3854.33 3823.84 3747.85
1048576 3864.26 3844.70 3810.02
2097152 3868.47 3856.94 3838.38
4194304 3870.53 3862.78 3853.08
sfe01%
The summary above is the same information I had included in my previous e-mail on this thread, the only thing is I added the second column which is the host-to-host transfer rate.
Then I tried a few of environment variables from the link above and the actual output is included below. But here is a brief summary:
- MV2_USE_CUDA is required for Device-Device transfers, which is of course expected.
- MV2_CUDA_BLOCK_SIZE setting of 8k caused transfers over 8k to fail. So I set it back to the default for this test.
- MV2_GPUDIRECT_LIMIT of 0 gave better performance than the default value of 8192, and has less of a drastic drop off at 32 bytes seen above with default values.
Some of the output is included below; a sample command that I used included below to gather this output to provide the context for the output you’re seeing below (my default shell is tcsh):
sg001% ( set echo ; env | egrep 'CUDA|GDR|MV2' ; env LD_PRELOAD=$MPIROOT/lib64/libmpi.so mpirun -np 2 -hosts sg001,sg002 libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d cuda D D ) | & tee out-cuda-D-D-MV2_USE_GPUDIRECT_GDRCOPY-1-MV2_GPUDIRECT_LIMIT-0
sfe01% cat out-cuda-D-D-MV2_USE_GPUDIRECT_GDRCOPY-1-MV2_CUDA_BLOCK_SIZE-8192
env
egrep --color=auto CUDA|GDR|MV2
CUDALIBDIR=/apps/cuda/cuda-8.0/lib64
CUDA_INCLUDE_OPTS=-I /apps/cuda/cuda-8.0/include
CUDA_PATH=/apps/cuda/cuda-8.0
CUDA_ROOT=/apps/cuda/cuda-8.0
GDRCOPY_ROOT=/apps/gdrcopy/0.0.0
MPICH_RDMA_ENABLED_CUDA=1
MV2_GPUDIRECT_GDRCOPY_LIB=/apps/gdrcopy/0.0.0/libgdrapi.so
MV2_USE_CUDA=1
MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_CUDA_BLOCK_SIZE=8192
env LD_PRELOAD=/apps/mvapich2-gdr/2.2-4/cuda8.0-intel/lib64/libmpi.so mpirun -np 2 -hosts sg001,sg002 libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
1 0.76
2 1.41
4 3.04
8 5.64
16 7.39
32 1.05
64 2.11
128 4.19
256 8.38
512 16.68
1024 33.16
2048 65.88
4096 131.43
[sg002:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[sg001:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
[sg001:mpi_rank_0][handle_cqe] Msg from 1: wc.status=9, wc.wr_id=0x1e10040, wc.opcode=0, vbuf->phead->type=20 = MPIDI_CH3_PKT_PACKETIZED_SEND_START
[sg001:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:547: [] Got completion with error 9, vendor code=0x8a, dest rank=1
: Bad address (14)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 127904 RUNNING AT sg002
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
sfe01%
sfe01% cat out-cuda-D-D-MV2_USE_GPUDIRECT_GDRCOPY-1-MV2_GPUDIRECT_LIMIT-0
env
egrep --color=auto CUDA|GDR|MV2
CUDALIBDIR=/apps/cuda/cuda-8.0/lib64
CUDA_INCLUDE_OPTS=-I /apps/cuda/cuda-8.0/include
CUDA_PATH=/apps/cuda/cuda-8.0
CUDA_ROOT=/apps/cuda/cuda-8.0
GDRCOPY_ROOT=/apps/gdrcopy/0.0.0
MPICH_RDMA_ENABLED_CUDA=1
MV2_GPUDIRECT_GDRCOPY_LIB=/apps/gdrcopy/0.0.0/libgdrapi.so
MV2_USE_CUDA=1
MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_CUDA_BLOCK_SIZE=262144
MV2_GPUDIRECT_LIMIT=0
env LD_PRELOAD=/apps/mvapich2-gdr/2.2-4/cuda8.0-intel/lib64/libmpi.so mpirun -np 2 -hosts sg001,sg002 libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
1 0.76
2 1.41
4 3.04
8 5.63
16 7.28
32 3.01
64 6.07
128 12.02
256 24.14
512 46.16
1024 93.68
2048 181.33
4096 348.49
8192 382.70
16384 981.47
32768 1547.27
65536 2048.77
131072 2332.59
262144 2211.07
524288 3823.78
1048576 3845.12
2097152 3857.25
4194304 3862.84
sfe01%
I’m sharing this information to find out how others are normally using it and what kind of performance they’re getting.
Thanks,
Raghu
From: Subramoni, Hari [mailto:subramoni.1 at osu.edu]
Sent: Tuesday, February 27, 2018 9:48 PM
To: Moody, Adam T.; Raghu Reddy; mvapich-discuss at cse.ohio-state.edu
Cc: Subramoni, Hari
Subject: RE: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs
Hi, Adam.
Yes, tuning *may be* possible. It depends on various factors including the type of GPU available and its hardware characteristics. We will discuss this and get back to you.
Thx,
Hari.
From: Moody, Adam T. [mailto:moody20 at llnl.gov]
Sent: Tuesday, February 27, 2018 7:18 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>; Raghu Reddy <raghu.reddy at noaa.gov>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs
Just browsing the GDRCOPY=1 numbers, it looks like the bandwith drop off after 16 bytes is sharp.
Can that be smoothed out by moving a threshold up to higher byte counts?
-Adam
From: "mvapich-discuss-bounces at cse.ohio-state.edu" <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of "Subramoni, Hari" <subramoni.1 at osu.edu>
Date: Tuesday, February 27, 2018 at 8:15 AM
To: Raghu Reddy <raghu.reddy at noaa.gov>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs
Hello,
GDRCopy is meant for the very small message range. So the behavior is expected.
GPUDirectRDMA and GDRCopy are two different technologies. One is not needed for the other to work. Both technologies needs certain drivers from NVIDIA and Mellanox to be installed. Without these being installed, they will not work and thus MVAPICH2-GDR will not be able to take advantage of them. To the best of our knowledge, GPUDirectRDMA will need MLNX_OFED and will not work with non MLNX_OFED.
>From your performance numbers, it looks like you have the GDRCopy module installed hence MVAPICH2-GDR is able to take advantage of GDRCopy to deliver better performance for the smaller message range.
Please let me know if you have any other questions.
Thx,
Hari.
From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Raghu Reddy
Sent: Tuesday, February 27, 2018 9:16 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Regular OFED vs MLNX OFED on systems with P100 GPUs
In the context of our cluster configuration:
Hardware:
Intel Haswell processors, 2 sockets/node (20 cores/node)
8 P100 GPUs – 4 GPUs connected to socket 0 and 4 GPUs connected to socket 1
Single rail MLNX QDR fabric connected to socket 1
Software:
Running RHEL 7.4
Using stock OFED (Later there is a question about whether MLNX OFED is required)
Intel 18.1 compiler
Mvapich2-GDR/2.2-4 Intel version
As an aside, the reason for sticking with stock OFED is because we have a mixed environment; non-GPU part of the machine with about 1K nodes has Intel TrueScale fabric, and we have about 100 nodes with MLNX fabric, and we would prefer to have a single image for all the nodes.
I am looking at the following documentation, specifically section 7:
http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2/
>From my understanding, since P100 GPUs have the GDR capability, it is not necessary to use GPUDIRECT, is that a correct statement? Or is GDRCOPY necessary even for nodes with these latest GPUs?
But I am seeing some differences with and without MV2_USE_GPUDIRECT_GDRCOPY being set. I am including the output from two runs, one with this variable set to 1 and one to 0, and I have pasted output from osu_bw:
sg001% paste osu_bw.out-gdrcopy-0 osu_bw.out-gdrcopy-1
1 0.03 1 0.75
2 0.06 2 1.41
4 0.12 4 3.03
8 0.24 8 5.61
16 0.47 16 7.26
32 0.94 32 0.99
64 1.89 64 1.99
128 3.77 128 3.92
256 7.55 256 7.91
512 15.05 512 15.77
1024 29.99 1024 31.41
2048 59.58 2048 62.50
4096 118.58 4096 124.17
8192 421.15 8192 380.02
16384 451.32 16384 966.68
32768 1330.50 32768 1545.72
65536 1835.08 65536 2043.06
131072 2148.80 131072 2283.34
262144 1922.57 262144 1947.62
524288 3747.85 524288 3823.84
1048576 3810.02 1048576 3844.70
2097152 3838.38 2097152 3856.94
4194304 3853.08 4194304 3862.78
sg001%
For long messages there is no significant difference in performance, but for smaller messages there is quite a bit of difference. Is this what is expected?
Similar question about regular OFED from the same section, since these are newer GPUs, is the mvapich2-gdr library capable of taking advantage of GDR capability even without the MLNX OFED?
Not sure if the second question is something that should be put to either Red Hat support or MLNX support?
We are trying to determine if we have a problem in our hardware/software configuration or of this is what is expected.
We appreciate any comments and suggestions about our observations above!
Thanks,
Raghu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180305/670b87a3/attachment-0001.html>
More information about the mvapich-discuss
mailing list