[mvapich-discuss] Announcing the Release of MVAPICH2-GDR 2.3.2 GA and OSU Micro-Benchmarks (OMB) 5.6.2

Thu Aug 8 22:10:03 EDT 2019

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3.2 GA and OSU Micro-Benchmarks (OMB) 5.6.2.

MVAPICH2-GDR 2.3.2 is based on the standard MVAPICH2 2.3.1 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
technology for inter-node data movement on NVIDIA GPUs clusters with
Mellanox InfiniBand interconnect. It also provides support for DGX-2,
OpenPOWER and NVLink2, efficient intra-node CUDA-Aware unified memory
communication and support for RDMA_CM, RoCE-V1, and RoCE-V2. Further,
MVAPICH2-GDR 2.3.2 provides optimized large message collectives
(broadcast, reduce, and allreduce) for emerging Deep Learning and
Streaming frameworks.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3.2 GA are
listed here.

* Features and Enhancements (Since 2.3.1)
    - Based on MVAPICH2 2.3.1
    - Support for CUDA 10.1
    - Support for PGI 19.x
    - Enhanced intra-node and inter-node point-to-point performance
    - Enhanced MPI_Allreduce performance for DGX-2 system
    - Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode
    - Enhanced performance of datatype support for GPU-resident data
      *  Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe
    - Enhanced GPU-based point-to-point and collective tuning on
      *  OpenPOWER systems such as Summit at ORNL and Sierra/Lassen @LLNL
          ABCI system @AIST, Owens and Pitzer systems @OSC
    - Scaled Allreduce to 24,576 Volta GPUs on Summit

* Bug Fixes (Since 2.3.1)
    - Fix hang issue in host-based MPI_Alltoallv
    - Fix GPU communication progress in MPI_THREAD_MULTIPLE mode
    - Fix potential failures in GDRCopy registration
    - Fix compilation warnings

Further, MVAPICH2-GDR 2.3.2 GA provides support on GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3.2 GA continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.85 microsec (8
bytes) with CUDA 10.1 and Volta GPUs. On OpenPOWER platforms with
NVLink2, it delivers up to 70.4 Gbps unidirectional intra-node
Device-to-Device bandwidth for large messages. On DGX-2 platforms, it
delivers up to 144.79 Gbps unidirectional intra-node Device-to-Device
bandwidth for large messages. More performance numbers are available
from the MVAPICH website (under Performance link).

New features, enhancements and bug fixes for OSU Micro-Benchmarks
(OMB) 5.6.2 are listed here.

* New Features & Enhancements (since v5.6.1)
    - Add support for benchmarking GPU-Aware multi-threaded point-to-point
      operations
        * osu_latency_mt

* Bug Fixes (since v5.6.1)
    - Fix issue with freeing in osu_get_bw benchmark
    - Fix issues with out of tree builds
        - Thanks to Joseph Schuchart at HLRS for reporting the issue
    - Fix incorrect header in osu_mbw_mr benchmark
    - Fix memory alignment for non-heap allocations in OpenSHMEM message rate
      benchmarks
        - Thanks to Yossi at Mellanox for pointing out the issue

For downloading MVAPICH2-GDR 2.3.2 GA, OMB 5.6.2 and associated user
guides, quick start guide, and accessing the SVN, please visit the
following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
3,025 worldwide (in 89 countries). The number of downloads from the
MVAPICH site has crossed 558,000 (0.55 million).  The MVAPICH team
would like to thank all its users and organizations!!