[Mvapich-discuss] Announcing the release of MVAPICH2-GDR 2.3.7 GA

Sat May 28 07:57:00 EDT 2022

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3.7 GA.

The MVAPICH2-GDR 2.3.7 release incorporates several novel features as
listed below

* Support for 'on-the-fly' compression of point-to-point messages used for
  GPU to GPU communication for NVIDIA GPUs.

* Support for hybrid communication protocols using NCCL-based, CUDA-based,
  and IB verbs-based primitives for the following MPI blocking and
  non-blocking collective operations
  - MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv,
    MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv,
    MPI_Gather, MPI_Gatherv, and MPI_Bcast.
  - MPI_Iallreduce, MPI_Ireduce, MPI_Iallgather, MPI_Iallgatherv,
    MPI_Ialltoall, MPI_Ialltoallv, MPI_Iscatter, MPI_Iscatterv,
    MPI_Igather, MPI_Igatherv, and MPI_Ibcast

* Full support for NVIDIA DGX, NVIDIA DGX V-100, NVIDIA DGX A-100, and
  AMD systems with Mi100 GPUs.

MVAPICH2-GDR 2.3.7 provides optimized support at MPI-level for HPC,
deep learning, machine learning, and data science workloads. These
include efficient large-message collectives (e.g.  Allreduce) on CPUs
and GPUs and GPU-Direct algorithms for all collective operations
(including those commonly used for model-parallelism, e.g.  Allgather
and Alltoall).

MVAPICH2-GDR 2.3.7 is based on the standard MVAPICH2 2.3.7 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
on NVIDIA GPUs and ROCmRDMA on AMD GPUs for inter-node data movement
on GPU clusters with Mellanox InfiniBand interconnect. It also
provides support for DGX-2, OpenPOWER, and NVLink2, GDRCopyv2,
efficient intra-node CUDA-Aware unified memory communication and
support for RDMA_CM, RoCE-V1, and RoCE-V2.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3.7 GA are
listed here.

* Features and Enhancements (Since 2.3.6)
    - Enhanced performance for GPU-aware MPI_Alltoall and MPI_Alltoallv
    - Added automatic rebinding of processes to cores based on GPU NUMA domain
        - This is enabled by setting the env MV2_GPU_AUTO_REBIND=1
    - Added NCCL communication substrate for various non-blocking MPI
      collectives
        - MPI_Iallreduce, MPI_Ireduce, MPI_Iallgather, MPI_Iallgatherv,
          MPI_Ialltoall, MPI_Ialltoallv, MPI_Iscatter, MPI_Iscatterv,
          MPI_Igather, MPI_Igatherv, and MPI_Ibcast
    - Enhanced point-to-point and collective tuning for AMD Milan processors
      with NVIDIA A-100 and AMD Mi100 GPUs
    - Enhanced point-to-point and collective tuning for NVIDIA DGX A-100 systems
    - Added support for Cray Slingshot-10 interconnect

Further, MVAPICH2-GDR 2.3.7 GA provides support for GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3.7 GA continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.85 microseconds (8
bytes) with CUDA 10.1 and Volta GPUs. On OpenPOWER platforms with
NVLink2, it delivers up to 70.4 GBps unidirectional intra-node
Device-to-Device bandwidth for large messages. On DGX-2 platforms, it
delivers up to 144.79 GBps unidirectional intra-node Device-to-Device
bandwidth for large messages. More performance numbers are available
from the MVAPICH website (under Performance->MV2-GDR->CUDA)).

For downloading MVAPICH2-GDR 2.3.7 GA and associated user
guides, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches, and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at lists.osu.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform you that the number of organizations
using MVAPICH2 libraries (and registered at the MVAPICH site) has
crossed 3,200 worldwide (in 89 countries). The number of downloads
from the MVAPICH has crossed 1,590,000 (1.59 million).  The MVAPICH
team would like to thank all its users and organizations!!