[Mvapich-discuss] MVAPICH2 GDR from source code?

John Moore john at flexcompute.com
Wed Jan 12 11:14:15 EST 2022


HI Nat,

We have been struggling to get the RPM to work for us -- we've been working
on it for about a week. We are using this RPM:
http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm

If you could build us a custom RPM for our system, that would be very
helpful.

We're running Ubuntu 20.04  kernel 5.4.0-92-generic

GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

CUDA version is CUDA 11.4
CUDA driver:  470.82.01

Please let me know if there is any other information that you need.

Thanks,
John


On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat <shineman.5 at osu.edu> wrote:

> Hi John,
>
> You should be able to use the RPMs on Ubuntu by converting them with
> alien. Regarding the CUDA and compiler versioning, you will want to make
> sure CUDA is an exact match, but the compiler should only need to be the
> same major version. You will also want to make sure that you match the
> mofed major version as well, though we recommend matching the exact version
> if possible. Please take a look at the download page and see if any of the
> RPMs there match your needs. Otherwise, we would be happy to generate a
> custom RPM based on your system specifications.
>
> Thanks,
> Nat
> ------------------------------
> *From:* Mvapich-discuss <mvapich-discuss-bounces+shineman.5=
> osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss <
> mvapich-discuss at lists.osu.edu>
> *Sent:* Tuesday, January 11, 2022 14:58
> *To:* Panda, Dhabaleswar <panda at cse.ohio-state.edu>
> *Cc:* Maitham Alhubail <maitham at flexcompute.com>;
> mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code?
>
> Hi DK,
>
> Do the CUDA and GCC versions on our system need to match the RPM exactly?
> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu.
>
> Thank you,
> John
>
> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar <
> panda at cse.ohio-state.edu> wrote:
>
> Hi,
>
> Thanks for your note. For GPU support with MVAPICH2, it is strongly
> recommended to use the MVAPICH2-GDR package. This package supports many
> features related to GPUs and delivers the best performance and scalability
> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR
> download page for your system. Please refer to the corresponding user guide
> also. The MVAPICH2-GDR package can also be installed through Spack. Let us
> know if you experience any issues in using the MVAPICH2-GDR package on your
> GPU cluster.
>
> Thanks,
>
> DK
>
>
> ________________________________________
> From: Mvapich-discuss <mvapich-discuss-bounces+panda.2=
> osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss <
> mvapich-discuss at lists.osu.edu>
> Sent: Tuesday, January 11, 2022 2:48 PM
> To: mvapich-discuss at lists.osu.edu
> Cc: Maitham Alhubail
> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code?
>
> Hello,
>
> We have been struggling to get MVAPICH2 to work with cuda-aware support
> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda
> option, but when we run the osu_bibw bandwidth test using Device to Device
> communication, we get a segmentation fault.
>
> Below is the output from osu_bibw using MVAPICH2:
>  MVAPICH2-2.3.6 Parameters
> ---------------------------------------------------------------------
>         PROCESSOR ARCH NAME            : MV2_ARCH_AMD_EPYC_7401_48
>         PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_AMD
>         PROCESSOR MODEL NUMBER         : 1
>         HCA NAME                       : MV2_HCA_MLX_CX_HDR
>         HETEROGENEOUS HCA              : NO
>         MV2_EAGERSIZE_1SC              : 0
>         MV2_SMP_EAGERSIZE              : 16385
>         MV2_SMP_QUEUE_LENGTH           : 65536
>         MV2_SMP_NUM_SEND_BUFFER        : 16
>         MV2_SMP_BATCH_SIZE             : 8
>         Tuning Table:                  : MV2_ARCH_AMD_EPYC_7401_48
> MV2_HCA_MLX_CX_HDR
> ---------------------------------------------------------------------
> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size      Bandwidth (MB/s)
> 1                       0.07
> 2                       0.15
> 4                       0.29
> 8                       0.57
> 16                      1.12
> 32                      2.30
> 64                      4.75
> 128                     9.41
> 256                    18.44
> 512                    37.22
> 1024                   74.82
> 2048                  144.70
> 4096                  289.96
> 8192                  577.33
> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
> (signal 11)
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 471850 RUNNING AT cell3
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> And this is with OpenMPI:
> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size      Bandwidth (MB/s)
> 1                       0.43
> 2                       0.83
> 4                       1.68
> 8                       3.37
> 16                      6.72
> 32                     13.42
> 64                     27.02
> 128                    53.78
> 256                   107.88
> 512                   219.45
> 1024                  437.81
> 2048                  875.12
> 4096                 1747.23
> 8192                 3528.97
> 16384                7015.15
> 32768               13973.59
> 65536               27702.68
> 131072              51877.67
> 262144              94556.99
> 524288             157755.18
> 1048576            236772.67
> 2097152            333635.13
> 4194304            408865.93
>
>
> Can GDR support be obtained by compiling from source like we are trying to
> do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any
> recommendations would be greatly appreciated.
>
> Thanks,
> John
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220112/21867930/attachment-0022.html>


More information about the Mvapich-discuss mailing list