[mvapich-discuss] Announcing the release of MVAPICH2-GDR 2.3 GA and OMB 5.5

Sun Nov 11 01:04:51 EST 2018

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR 2.3 GA and OSU
Micro-Benchmarks (OMB) 5.5.

MVAPICH2-GDR 2.3 is based on the standard MVAPICH2 2.3 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR) technology
for inter-node data movement on NVIDIA GPUs clusters with Mellanox InfiniBand
interconnect. It also provides support for OpenPower and NVLink, efficient
intra-node CUDA-Aware unified memory communication and support for RDMA_CM,
RoCE-V1, and RoCE-V2. Further, MVAPICH2-GDR 2.3 provides optimized large
message collectives (broadcast, reduce, and allreduce) for emerging Deep
Learning and Streaming frameworks.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3 GA are listed here.

* Features and Enhancements (Since 2.2 GA)
    - Based on MVAPICH2 2.3 GA
    - Support for CUDA 10.0, 9.2, 9.0
    - Support for Volta (V100) GPU
    - Support for OpenPOWER9 with NVLink
    - Support IBM XLC and PGI compilers with CUDA kernel features
    - Enhanced point-to-point performance for the small messages
    - Enhanced performance of GPU-based point-to-point communication
    - Efficient Multiple CUDA stream-based IPC communication for
      multi-GPU systems with and without NVLink
    - Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based
      communication
    - Enhanced Alltoallv operation for host buffers
    - Support collective offload using Mellanox's SHArP for Allreduce on
      host-buffers
        - Enhance tuning framework for Allreduce using SHArP
    - Enhanced large-message Reduce, Broadcast and Allreduce for Deep Learning
      workloads
    - Enhanced performance of MPI_Allreduce for GPU-resident data
    - InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and
      streaming applications
        * Basic support for IB-MCAST designs with GPUDirect RDMA
        * Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
        * Advanced reliability support for IB-MCAST designs
    - Add new runtime variables 'MV2_USE_GPUDIRECT_RDMA' and 'MV2_USE_GDR' to
      replace 'MV2_USE_GPUDIRECT'
    - Enhanced CUDA-based collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems
    - Enhanced host-based collectives for IBM POWER8/9, Intel
      Skylake, Intel KNL, and Intel Broadwell architectures

* Bug Fixes (since 2.2 GA):
    - Fix memory leaks in CUDA-based collectives
    - Fix memory leaks in CUDA IPC cache designs
    - Fix segfault when freeing NULL IPC resources
    - Fix issues with InfiniBand Multicast (IB-MCAST) based designs for GPU-based Broadcast
    - Fix hang issue with the zero-copy Broadcast operation
    - Fix issue with datatype processing for host buffer
    - Fix application crash with GDRCopy feature
    - Fix memory leak in CUDA-based Allreduce algorithms
    - Fix data validation issue for Allreduce algorithms
    - Fix data validation issue for non-blocking Gather operation
    - Fix issue with MPI_Finalize when MV2_USE_GPUDIRECT=0
    - Fix data validation issue with GDRCOPY and Loopback
    - Fix issue with runtime error when MV2_USE_CUDA=0
    - Fix issue with MPI_Allreduce for R3 protocol
    - Fix warning message when GDRCOPY module cannot be used

Further, MVAPICH2-GDR 2.3 GA provides support on GPU-Cluster using regular OFED
(without GPUDirect RDMA).

MVAPICH2-GDR 2.3 GA continues to deliver excellent performance. It provides
inter-node Device-to-Device latency of 1.85 microsec (8 bytes) with CUDA 9.2 and
Volta GPUs. On OpenPOWER platforms with NVLink, it delivers up to 34.4 Gbps
unidirectional intra-node Device-to-Device bandwidth for large messages. More
performance numbers are available from the MVAPICH website (under Performance
link).

New features, enhancements and bug fixes for OSU Micro-Benchmarks
(OMB) 5.5 are listed here.

OSU Micro Benchmarks v5.5

* New Features & Enhancements (since 5.4.4)
    - Introduce new MPI non-blocking collective benchmarks with support to
      measure overlap of computation and communication for CPUs and GPUs
        - osu_ireduce
        - osu_iallreduce

For downloading MVAPICH2-GDR 2.3 GA, OSU Micro-Benchmarks (OMB) 5.5 associated user
guide, and sample performance numbers please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning, and
enhancements are welcome. Please post it to the mvapich-discuss mailing list
(mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
2,950 worldwide (in 86 countries). The number of downloads from the MVAPICH site
has crossed 505,000 (>0.50 million).  The MVAPICH team would like to thank all
its users and organizations!