[Mvapich] Announcing the Release of MVAPICH2-GDR 2.3.6 GA and OSU Micro-Benchmarks (OMB) 5.8

Thu Aug 12 19:48:30 EDT 2021

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3.6 GA and OSU Micro-Benchmarks (OMB) 5.8.

The MVAPICH2-GDR 2.3.6 release incorporates several novel features as
listed below

* Support for 'on-the-fly' compression of point-to-point messages used for
  GPU to GPU communication for NVIDIA GPUs.

* Support for hybrid communication protocols using NCCL-based, CUDA-based,
  and IB verbs-based primitives for the following MPI collective operations
  - MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv,
    MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv,
    MPI_Gather, MPI_Gatherv, and MPI_Bcast.

* Full support for NVIDIA DGX, NVIDIA DGX-2 V-100, and NVIDIA DGX-2 A-100
  systems.

MVAPICH2-GDR 2.3.6 provides optimized support at MPI-level for HPC,
deep learning, machine learning, and data science workloads. These
include efficient large-message collectives (e.g.  Allreduce) on CPUs
and GPUs and GPU-Direct algorithms for all collective operations
(including those commonly used for model-parallelism, e.g.  Allgather
and Alltoall).

MVAPICH2-GDR 2.3.6 is based on the standard MVAPICH2 2.3.6 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
on NVIDIA GPUs and ROCmRDMA on AMD GPUs for inter-node data movement
on GPU clusters with Mellanox InfiniBand interconnect. It also
provides support for DGX-2, OpenPOWER and NVLink2, GDRCopyv2,
efficient intra-node CUDA-Aware unified memory communication and
support for RDMA_CM, RoCE-V1, and RoCE-V2.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3.6 GA are
listed here.

* Features and Enhancements (Since 2.3.5)
    - Based on MVAPICH2 2.3.6
    - Added support for 'on-the-fly' compression of point-to-point messages
      used for GPU to GPU communication
        - Applicable to NVIDIA GPUs
    - Added NCCL communication substrate for various MPI collectives
        - Support for hybrid communication protocols using NCCL-based,
          CUDA-based, and IB verbs-based primitives
        - MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv,
          MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv,
          MPI_Gather, MPI_Gatherv, and MPI_Bcast
    - Full support for NVIDIA DGX, NVIDIA DGX-2 V-100, and NVIDIA DGX-2 A-100
      systems
        - Enhanced architecture detection, process placement, and HCA selection
        - Enhanced intra-node and inter-node point-to-point tuning
        - Enhanced collective tuning

    - Introduced architecture detection, point-to-point tuning, and
      collective tuning for ThetaGPU @ANL
    - Enhanced point-to-point and collective tuning for NVIDIA GPUs on
      Frontera @TACC, Lassen @LLNL, and Sierra @LLNL
    - Enhanced point-to-point and collective tuning for Mi50 and Mi60 AM
      GPUs on Corona @LLNL
    - Added several new MPI_T PVARs
    - Added support for CUDA 11.3
    - Added support for ROCm 4.1+
    - Enhanced output for runtime variable MV2_SHOW_ENV_INFO
    - Tested with Horovod and common DL Frameworks
        - TensorFlow, PyTorch, and MXNet
    - Tested with MPI4Dask 0.2
        - MPI4Dask is a custom Dask Distributed package with MPI support
    - Tested with MPI4cuML 0.1
        - MPI4cuML is a custom cuML package with MPI support

Bug Fixes (Since 2.3.5)
    - Fix a bug where GPUs and HCAs were incorrectly identified as being
      on different sockets
        - Thanks to Chris Chambreau @LLNL for the report
    - Fix issues in collective tuning tables
    - Fix issues with adaptive HCA selection
    - Fix compilation warnings and memory leaks

Further, MVAPICH2-GDR 2.3.6 GA provides support on GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3.6 GA continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.85 microseconds (8
bytes) with CUDA 10.1 and Volta GPUs. On OpenPOWER platforms with
NVLink2, it delivers up to 70.4 GBps unidirectional intra-node
Device-to-Device bandwidth for large messages. On DGX-2 platforms, it
delivers up to 144.79 GBps unidirectional intra-node Device-to-Device
bandwidth for large messages. More performance numbers are available
from the MVAPICH website (under Performance->MV2-GDR->CUDA)).

OSU Micro-Benchmarks 5.8 introduces support to benchmark NCCL-based
collective MPI operations for intra-node and inter-node
configurations. The new features, enhancements, and bug fixes for OSU
Micro-Benchmarks (OMB) 5.8 are listed here

* New Features & Enhancements (since v5.7.1)
    - Add support for NCCL pt2pt benchmarks
        * osu_nccl_bibw
        * osu_nccl_bw
        * osu_nccl_latency
    - Add support for NCCL collective benchmarks
        * osu_nccl_allgather
        * osu_nccl_allreduce
        * osu_nccl_bcast
        * osu_nccl_reduce
        * osu_nccl_reduce_scatter
    - Add data validation support for
        * osu_allreduce
        * osu_nccl_allreduce
        * osu_reduce
        * osu_nccl_reduce
        * osu_alltoall

* Bug Fixes (since v5.7.1)
    - Fix bug in support for CUDA managed memory benchmarks
        - Thanks to Adam Goldman @Intel for the report and the
          initial patch
    - Protect managed memory functionality with appropriate compile
      time flag

For downloading MVAPICH2-GDR 2.3.6 GA, OMB 5.8, and associated user
guides, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches, and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform you that the number of organizations
using MVAPICH2 libraries (and registered at the MVAPICH site) has
crossed 3,200 worldwide (in 89 countries). The number of downloads
from the MVAPICH has crossed 1,419,000 (1.41 million).  The MVAPICH
team would like to thank all its users and organizations!!