[mvapich-discuss] Announcing the release of MVAPICH2-GDR 2.3.1 GA and OSU Micro-Benchmarks (OMB) 5.6.1

Sun Mar 17 00:16:54 EDT 2019

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3.1 GA and OSU Micro-Benchmarks (OMB) 5.6.1.

MVAPICH2-GDR 2.3.1 is based on the standard MVAPICH2 2.3.1 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
technology for inter-node data movement on NVIDIA GPUs clusters with
Mellanox InfiniBand interconnect. It also provides support for DGX-2,
OpenPOWER and NVLink2, efficient intra-node CUDA-Aware unified memory
communication and support for RDMA_CM, RoCE-V1, and RoCE-V2. Further,
MVAPICH2-GDR 2.3.1 provides optimized large message collectives
(broadcast, reduce, and allreduce) for emerging Deep Learning and
Streaming frameworks.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3.1 GA are
listed here.

* Features and Enhancements (Since 2.3)
    - Based on MVAPICH2 2.3.1
    - Enhanced intra-node and inter-node point-to-point performance for DGX-2
      and IBM POWER8 and IBM POWER9 systems
    - Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems
    - Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get
    - Support for PGI 18.10
    - Add new runtime variables
      - 'MV2_GDRCOPY_LIMIT' to replace 'MV2_USE_GPUDIRECT_GDRCOPY_LIMIT'
      - 'MV2_GDRCOPY_NAIVE_LIMIT' to replace 'MV2_USE_GPUDIRECT_GDRCOPY_NAIVE_LIMIT'
      - 'MV2_USE_GDRCOPY' to replace 'MV2_USE_GPUDIRECT_GDRCOPY'
    - Flexible support for running TensorFlow (Horovod) jobs

* Bug Fixes (Since 2.3)
    - Fix data validation issue in CUDA-Aware MPI_Reduce
    - Fix hang in CUDA-Aware MPI_Get_accumulate
    - Fix compilation errors with clang
    - Fix compilation warnings

Further, MVAPICH2-GDR 2.3.1 GA provides support on GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3.1 GA continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.85 microsec (8
bytes) with CUDA 10 and Volta GPUs. On OpenPOWER platforms with
NVLink2, it delivers up to 70.34 Gbps unidirectional intra-node
Device-to-Device bandwidth for large messages. On DGX-2 platforms, it
delivers up to 144.79 Gbps unidirectional intra-node Device-to-Device
bandwidth for large messages. More performance numbers are available
from the MVAPICH website (under Performance link).

New features, enhancements and bug fixes for OSU Micro-Benchmarks
(OMB) 5.6.1 are listed here.

* Bug Fixes (since v5.6)
    - Fix issue with latency computation in osu_latency_mt benchmark.

For downloading MVAPICH2-GDR 2.3.1 GA, OMB 5.6.1 and associated user
guides, quick start guide, and accessing the SVN, please visit the
following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
2,975 worldwide (in 88 countries). The number of downloads from the
MVAPICH site has crossed 528,000 (0.52 million).  The MVAPICH team
would like to thank all its users and organizations!!