[mvapich-discuss] Announcing the release of MVAPICH2-GDR 2.3.5 GA and OSU Micro-Benchamrks (OMB) 5.7

Fri Dec 11 21:14:27 EST 2020

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3.5 GA and OSU Micro-Benchmarks (OMB) 5.7.

The MVAPICH2-GDR 2.3.5 release incorporates initial support for the
new Radeon Instinct series of GPUs from AMD via Radeon Open Compute
(ROCm) platform. In particular, MVAPICH2-GDR 2.3.5 supports ROCm
PeerDirect, ROCm IPC, and unified memory based device-to-device
communication. Detailed performance numbers are available from the
MVAPICH website (under Performance->MV2-GDR->ROCM).

MVAPICH2-GDR 2.3.5 also provides enhanced designs for GPU-aware
MPI_Alltoall and MPI_Allgather. It also has support for enhanced MPI
derived datatype processing via GPU kernel fusion techniques.

MVAPICH2-GDR 2.3.5 provides optimized support at MPI-level for deep
learning workloads. These include efficient large-message collectives
(e.g.  Allreduce) on CPUs and GPUs and GPU-Direct algorithms for all
collective operations (including those commonly used for
model-parallelism, e.g.  Allgather and Alltoall).

MVAPICH2-GDR 2.3.5 is based on the standard MVAPICH2 2.3.5 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
on NVIDIA GPUs and ROCmRDMA on AMD GPUs for inter-node data movement
on GPU clusters with Mellanox InfiniBand interconnect. It also
provides support for DGX-2, OpenPOWER and NVLink2, GDRCopyv2,
efficient intra-node CUDA-Aware unified memory communication and
support for RDMA_CM, RoCE-V1, and RoCE-V2.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3.5 GA are
listed here.

* Features and Enhancements (Since 2.3.4)
    - Based on MVAPICH2 2.3.5
    - Added support for AMD GPUs via Radeon Open Compute (ROCm) platform
    - Added support for ROCm PeerDirect, ROCm IPC, and unified memory based
      device-to-device communication for AMD GPUs
    - Enhanced designs for GPU-aware MPI_Alltoall
    - Enhanced designs for GPU-aware MPI_Allgather
    - Added support for enhanced MPI derived datatype processing via kernel
      fusion
    - Added architecture specific flags to improve the performance of CUDA
      operations
    - Added support for Apache MXNet Deep Learning Framework
    - Tested with PyTorch and DeepSpeed framework for distributed Deep Learning

Bug Fixes (Since 2.3.4)
    - Fix performance degradation in first CUDA call due to CUDA JIT
      compilation for PTX Compatibility
    - Added GPU-based point-to-point tuning for AMD Mi50 and Mi60 GPUs
    - Enhanced GPU-based Alltoall and Allgather tuning for POWER9 systems
    - Enhanced GPU-based Allreduce tuning for Frontera RTX system
    - Fix validation issue with kernel-based datatype processing
    - Fix validation issue with GPU based MPI_Scatter
    - Fix a potential issue when using MPI_Win_allocate
      - Thanks to Bert Wesarg at TU Dresden and George Katevenisi at ICS Forth
        for reporting the issue and providing the initial patch
    - Fix out of memory issue when allocating CUDA events
    - Fix compilation errors with PGI 20.x compilers
    - Fix compilation warnings and memory leaks

Further, MVAPICH2-GDR 2.3.5 GA provides support on GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3.5 GA continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.85 microseconds (8
bytes) with CUDA 10.1 and Volta GPUs. On OpenPOWER platforms with
NVLink2, it delivers up to 70.4 GBps unidirectional intra-node
Device-to-Device bandwidth for large messages. On DGX-2 platforms, it
delivers up to 144.79 GBps unidirectional intra-node Device-to-Device
bandwidth for large messages. More performance numbers are available
from the MVAPICH website (under Performance->MV2-GDR->CUDA)).

OSU Micro-Benchmarks 5.7 introduces support to benchmark
point-to-point, collective, and one-sided MPI operations for
intra-node and inter-node configurations for the new Radeon Instinct
series of GPUs from AMD. The new features, enhancements and bug fixes
for OSU Micro-Benchmarks (OMB) 5.7 are listed here

* New Features & Enhancements (since v5.6.3)
    - Add support to OMB to evaluate the performance of various primitives with
      AMD GPU device and ROCm support
        - This functionality is exposed when configured with --enable-rocm
          option
        - Thanks to AMD for the initial patch

* Bug Fixes (since v5.6.3)
    - Enhance one-sided window creation and fix a potential issue when using
       MPI_Win_allocate
        * Thanks to Bert Wesarg and George Katevenis for reporting the issue
          and providing initial patch
    - Remove additional '-M' option that gets printed with help message for
      osu_latency_mt and osu_latency_mp
        * Thanks to Nick Papior for the report
    - Added missing '-W' option support for one-sided bandwidth tests

For downloading MVAPICH2-GDR 2.3.5 GA, OMB 5.7, and associated user
guides, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
3,125 worldwide (in 89 countries). The number of downloads from the
MVAPICH site has crossed 1,166,000 (1.16 million).  The MVAPICH team
would like to thank all its users and organizations!!