[mvapich] Announcing the release of MVAPICH2-GDR 2.3rc1

Fri Sep 21 23:34:06 EDT 2018

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3rc1.

MVAPICH2-GDR 2.3rc1 is based on the standard MVAPICH2 2.3 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
technology for inter-node data movement on NVIDIA GPUs clusters with
Mellanox InfiniBand interconnect. It also provides support for
OpenPower and NVLink, efficient intra-node CUDA-Aware unified memory
communication and support for RDMA_CM, RoCE-V1, and RoCE-V2. Further,
MVAPICH2-GDR 2.3rc1 provides optimized large message collectives
(broadcast, reduce, and allreduce) for emerging Deep Learning and
Streaming frameworks.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3rc1 are
listed here.

* Features and Enhancements (since MVAPICH2-GDR 2.3a)

    - Based on MVAPICH2 2.3
    - Support for CUDA 9.2
    - Support for OpenPOWER9 with NVLink
    - Support IBM XLC and PGI compilers with CUDA kernel features
    - Enhanced point-to-point performance for the small messages
    - Enhanced Alltoallv operation for host buffers
    - Enhanced CUDA-based collective tuning on OpenPOWER8/9 systems
    - Enhanced large-message Reduce, Broadcast and Allreduce for Deep Learning
      workloads
    - Add new runtime variables 'MV2_USE_GPUDIRECT_RDMA' and 'MV2_USE_GDR' to
      replace 'MV2_USE_GPUDIRECT'
    - Support collective offload using Mellanox's SHARP for Allreduce on
      host-buffers
        - Enhance tuning framework for Allreduce using SHARP
    - Enhanced host-based collectives for IBM POWER8/9, Intel
      Skylake, Intel KNL, and Intel Broadwell architectures

* Bug Fixes (Since 2.3a)
    - Fix issues with InfiniBand Multicast (IB-MCAST) based designs
      for GPU-based Broadcast
    - Fix hang issue with the zero-copy Broadcast operation
    - Fix issue with datatype processing for host buffer
    - Fix application crash with GDRCopy feature
    - Fix memory leak in CUDA-based Allreduce algorithms
    - Fix data validation issue for Allreduce algorithms
    - Fix data validation issue for non-blocking Gather operation

Further, MVAPICH2-GDR 2.3rc1 provides support on GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3rc1 continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.88 microsec (8
bytes) with CUDA 9.2 and Volta GPUs. On OpenPOWER platforms with
NVLink, it delivers up to 34.4 Gbps unidirectional intra-node
Device-to-Device bandwidth for large messages. More performance
numbers are available from the MVAPICH website (under Performance
link).

For downloading MVAPICH2-GDR 2.3rc1, associated user guide, and sample
performance numbers please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
and enhancements are welcome. Please post it to the mvapich-discuss
mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
2,950 worldwide (in 86 countries). The number of downloads from the
MVAPICH site has crossed 493,000 (0.49 million).  The MVAPICH team
would like to thank all its users and organizations!!