[mvapich-discuss] Announcing the Release of MVAPICH2-GDR 2.3.4

Thu Jun 4 22:32:15 EDT 2020

The MVAPICH team is pleased to announce the release of MVAPICH2-GDR
2.3.4 GA.

MVAPICH2-GDR 2.3.4 is based on the standard MVAPICH2 2.3.4 release and
incorporates designs that take advantage of the GPUDirect RDMA (GDR)
technology for inter-node data movement on NVIDIA GPUs clusters with
Mellanox InfiniBand interconnect. It also provides support for DGX-2,
OpenPOWER and NVLink2, GDRCopyv2, efficient intra-node CUDA-Aware
unified memory communication and support for RDMA_CM, RoCE-V1, and
RoCE-V2. Further, MVAPICH2-GDR 2.3.4 provides optimized large message
collectives (broadcast, reduce, and allreduce) for emerging Deep
Learning and Streaming frameworks.

Features, Enhancements, and Bug Fixes for MVAPICH2-GDR 2.3.4 GA are
listed here.

* Features and Enhancements (Since 2.3.3)

    - Based on MVAPICH2 2.3.4
    - Enhanced MPI_Allreduce performance on DGX-2 systems
    - Enhanced MPI_Allreduce performance on POWER9 systems
    - Reduced the CUDA interception overhead for non-CUDA symbols
    - Enhanced performance for point-to-point and collective operations on
      Frontera's RTX nodes
    - Add new runtime variable  'MV2_SUPPORT_DL' to replace
      'MV2_SUPPORT_TENSOR_FLOW'
    - Added compilation and runtime methods for checking CUDA support
    - Enhanced GDR output for runtime variable MV2_SHOW_ENV_INFO
    - Tested with Horovod and common DL Frameworks (TensorFlow, PyTorch, and
      MXNet)
    - Tested with PyTorch Distributed

Bug Fixes (Since 2.3.3)

    - Fix hang caused by the use of multiple communicators
    - Fix detection of Intel CPU Model name
    - Fix intermediate buffer size for Allreduce when DL workload is expected
    - Fix the random hangs in IMB4-RMA tests
    - Fix hang in OMP offloading
    - Fix hang with -w dynamic option when using one-sided benchmarks for
      device buffers
    - Add proper fallback and warning message when shared RMA window cannot be
      created
    - Fix potential FP exception error in MPI_Allreduce
      - Thanks to Shinichiro Takizawa at AIST for the report
    - Fix data validation issue of MPI_Allreduce
      - Thanks to Andreas Herten at JSC for the report
    - Fix the need for preloading libmpi.so
      - Thanks to Andreas Herten at JSC for the feedback
    - Fix compilation warnings and memory leaks

Further, MVAPICH2-GDR 2.3.4 GA provides support on GPU-Cluster using
regular OFED (without GPUDirect RDMA).

MVAPICH2-GDR 2.3.4 GA continues to deliver excellent performance. It
provides inter-node Device-to-Device latency of 1.85 microseconds (8
bytes) with CUDA 10.1 and Volta GPUs. On OpenPOWER platforms with
NVLink2, it delivers up to 70.4 GBps unidirectional intra-node
Device-to-Device bandwidth for large messages. On DGX-2 platforms, it
delivers up to 144.79 GBps unidirectional intra-node Device-to-Device
bandwidth for large messages. More performance numbers are available
from the MVAPICH website (under Performance link).

For downloading MVAPICH2-GDR 2.3.4 GA and associated user guides,
quick start guide, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
3,075 worldwide (in 89 countries). The number of downloads from the
MVAPICH site has crossed 760,000 (0.76 million).  The MVAPICH team
would like to thank all its users and organizations!!