[mvapich-discuss] Announcing the Release of MVAPICH2 2.3.5 GA

Mon Nov 30 20:42:09 EST 2020

The MVAPICH team is pleased to announce the release of MVAPICH2 2.3.5 GA.

Features and enhancements for MVAPICH2 2.3.5 GA are as follows:

* Features and Enhancements (since 2.3.4):
    - Enhanced performance for MPI_Allreduce and MPI_Barrier
    - Support collective offload using Mellanox's SHARP for Barrier
        - Enhanced tuning framework for Barrier using SHARP
    - Remove dependency on underlying libibverbs, libibmad, libibumad, and
      librdmacm libraries using dlopen
    - Add support for Broadcom NetXtreme RoCE HCA
        - Enhanced inter-node point-to-point support
    - Support architecture detection for Fujitsu A64fx processor
    - Enhanced point-to-point and collective tuning for Fujitsu A64fx processor
    - Enhanced point-to-point and collective tuning for AMD ROME processor
    - Add support for process placement aware HCA selection
        - Add "MV2_PROCESS_PLACEMENT_AWARE_HCA_MAPPING" environment variable to
          enable process placement aware HCA mapping
    - Add support to select HWLOC v1 and HWLOC v2 at configure time
        - Select using configure time flag --with-hwloc=version
        - Takes options of v1 (default) and v2
    - Add support to auto-detect RoCE HCAs and auto-detect GID index
    - Add support to use RoCE/Ethernet and InfiniBand HCAs at the same time
    - Add architecture-specific flags to improve performance of certain CUDA
      operations
        - Thanks to Chris Chambreau @LLNL for the report
    - Read MTU and maximum outstanding RDMA operations from the device
    - Improved performance and scalability for UD-based communication
    - Update maximum HCAs supported by default from 4 to 10
    - Enhanced collective tuning for Frontera at TACC, Expanse at SDSC,
      Ookami at StonyBrook, and bb5 at EPFL
    - Enhanced support for SHARP v2.1.0
    - Generalize code for GPU support
    - Update hwloc v2 code to v2.3.0

* Bug Fixes (since 2.3.4):
    - Fix issue with mpiexec+PBS when calling MPI_Abort
        - Thanks to Matthew W. Anderson @INL for the report and initial patch
    - Fix validation failure with multi-threaded applications when InfiniBand
      registration cache is enabled.
        - Thanks to Alexander Melnikov for the report and initial patch
    - Fix issue with realloc when InfiniBand registration cache is enabled
        - Thanks to Si Lu @TACC and Viet-Duc Le @KISTI for reporting the issue
    - Fix out of tree builds for ROMIO
        - Thanks to Per Berg @Defense Center for Operative Oceanography, Denmark
          for the report
    - Fix integer overflow errors in the collective code path
        - Thanks to Kiran Ravikumar @GaTech for the report and reproducer
    - Fix issue with Hybrid+Spread mapping on hyper-threaded systems
    - Fix out-of-memory issue when allocating CUDA events
    - Fix issue with large message UD transfers where packets were incorrectly
      marked as dropped/missing
    - Fix spelling mistakes
        - Thanks to Jens Schleusener @fossies.org for the report
    - Revert changes which caused dependencies on lex/yacc at configure time
        - Thanks to Daniel Pou @HPE for the report
    - Fix issues with UD Zcopy data transfers
    - Fix issues with handling datatypes in the collective code
    - Revert moving -lmpi, -lmpicxx, and -lmpifort before other LDFLAGS in
      compiler wrappers like mpicc, mpicxx, mpif77, and mpif90
        - This was causing issues with certain legacy applications
        - Thanks to Nicolas Morey-Chaisemartin @SUSE for the report
    - Fix compilation warnings and memory leaks

MVAPICH2 2.3.5 delivers impressive performance and scalability. Some highlights
from recent runs on the TACC Frontera system include:
    -  Complete job startup in only 31 seconds for 229,376 processes on 4,096
       nodes with 56 processes per node.
    -  Improvement in the latency of MPI_Bcast by up to a factor of two
       at 2,048 nodes while using InfiniBand hardware-based multicast support.
    -  Accelerated performance of MPI_Allreduce, MPI_Reduce, and MPI_Barrier
       at 7,861 nodes (full system scale) by a factor of 5.1, 5.2, and
       7.1, respectively, by using SHARP.

For downloading MVAPICH2 2.3.5 GA and associated user guides, quick
start guide, and accessing the SVN, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedback, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team

PS: We are also happy to inform that the number of organizations using
MVAPICH2 libraries (and registered at the MVAPICH site) has crossed
3,100 worldwide (in 89 countries). The number of downloads from the
MVAPICH site has crossed 1,150,000 (1.15 million).  The MVAPICH team
would like to thank all its users and organizations!!