[Mvapich-discuss] MVAPICH2 GDR from source code?

John Moore john at flexcompute.com
Tue Jan 11 14:58:43 EST 2022


Hi DK,

Do the CUDA and GCC versions on our system need to match the RPM exactly?
We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu.

Thank you,
John

On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar <panda at cse.ohio-state.edu>
wrote:

> Hi,
>
> Thanks for your note. For GPU support with MVAPICH2, it is strongly
> recommended to use the MVAPICH2-GDR package. This package supports many
> features related to GPUs and delivers the best performance and scalability
> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR
> download page for your system. Please refer to the corresponding user guide
> also. The MVAPICH2-GDR package can also be installed through Spack. Let us
> know if you experience any issues in using the MVAPICH2-GDR package on your
> GPU cluster.
>
> Thanks,
>
> DK
>
>
> ________________________________________
> From: Mvapich-discuss <mvapich-discuss-bounces+panda.2=
> osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss <
> mvapich-discuss at lists.osu.edu>
> Sent: Tuesday, January 11, 2022 2:48 PM
> To: mvapich-discuss at lists.osu.edu
> Cc: Maitham Alhubail
> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code?
>
> Hello,
>
> We have been struggling to get MVAPICH2 to work with cuda-aware support
> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda
> option, but when we run the osu_bibw bandwidth test using Device to Device
> communication, we get a segmentation fault.
>
> Below is the output from osu_bibw using MVAPICH2:
>  MVAPICH2-2.3.6 Parameters
> ---------------------------------------------------------------------
>         PROCESSOR ARCH NAME            : MV2_ARCH_AMD_EPYC_7401_48
>         PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_AMD
>         PROCESSOR MODEL NUMBER         : 1
>         HCA NAME                       : MV2_HCA_MLX_CX_HDR
>         HETEROGENEOUS HCA              : NO
>         MV2_EAGERSIZE_1SC              : 0
>         MV2_SMP_EAGERSIZE              : 16385
>         MV2_SMP_QUEUE_LENGTH           : 65536
>         MV2_SMP_NUM_SEND_BUFFER        : 16
>         MV2_SMP_BATCH_SIZE             : 8
>         Tuning Table:                  : MV2_ARCH_AMD_EPYC_7401_48
> MV2_HCA_MLX_CX_HDR
> ---------------------------------------------------------------------
> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size      Bandwidth (MB/s)
> 1                       0.07
> 2                       0.15
> 4                       0.29
> 8                       0.57
> 16                      1.12
> 32                      2.30
> 64                      4.75
> 128                     9.41
> 256                    18.44
> 512                    37.22
> 1024                   74.82
> 2048                  144.70
> 4096                  289.96
> 8192                  577.33
> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
> (signal 11)
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 471850 RUNNING AT cell3
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> And this is with OpenMPI:
> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size      Bandwidth (MB/s)
> 1                       0.43
> 2                       0.83
> 4                       1.68
> 8                       3.37
> 16                      6.72
> 32                     13.42
> 64                     27.02
> 128                    53.78
> 256                   107.88
> 512                   219.45
> 1024                  437.81
> 2048                  875.12
> 4096                 1747.23
> 8192                 3528.97
> 16384                7015.15
> 32768               13973.59
> 65536               27702.68
> 131072              51877.67
> 262144              94556.99
> 524288             157755.18
> 1048576            236772.67
> 2097152            333635.13
> 4194304            408865.93
>
>
> Can GDR support be obtained by compiling from source like we are trying to
> do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any
> recommendations would be greatly appreciated.
>
> Thanks,
> John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220111/6bbfc67e/attachment-0022.html>


More information about the Mvapich-discuss mailing list