[mvapich-discuss] Announcing the Release of MVAPICH2 1.6

Wed Mar 9 23:48:13 EST 2011

The MVAPICH team is pleased to announce the release of MVAPICH2 1.6
with the following NEW features/enhancements and bug fixes:

* NEW Features and Enhancements (since MVAPICH2-1.5.1)

    - Optimization and enhanced performance for clusters with nVIDIA
      GPU adapters (with and without GPUDirect technology)
    - Support for InfiniBand Quality of Service (QoS) with multiple lanes
    - Support for 3D torus topology with appropriate SL settings
        - For both CH3 and Nemesis interfaces
        - Thanks to Jim Schutt, Marcus Epperson and John Nagle from
          Sandia for the initial patch
    - Enhanced R3 rendezvous protocol
        - For both CH3 and Nemesis interfaces
    - Robust RDMA Fast Path setup to avoid memory allocation
      failures
        - For both CH3 and Nemesis interfaces
    - Multiple design enhancements for better performance of
      small and medium sized messages
    - Using LiMIC2 for efficient intra-node RMA transfer to avoid extra
      memory copies
    - Upgraded to LiMIC2 version 0.5.4
    - Support of Shared-Memory-Nemesis interface on multi-core platforms
      requiring intra-node communication only (SMP-only systems,
      laptops, etc. )
    - Enhancements to mpirun_rsh job start-up scheme on large-scale systems
    - Optimization in MPI_Finalize
    - XRC support with Hydra Process Manager
    - Updated Hydra launcher with MPICH2-1.3.3 Hydra process manager
    - Hydra is the default mpiexec process manager
    - Enhancements and optimizations for one sided Put and Get operations
    - Removing the limitation on number of concurrent windows in RMA
      operations
    - Optimized thresholds for one-sided RMA operations
    - Support for process-to-rail binding policy (bunch, scatter and
      user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3,
      and OFA-RoCE-CH3 interfaces)
    - Enhancements to Multi-rail Design and features including striping
      of one-sided messages
    - Dynamic detection of multiple InfiniBand adapters and using these
      by default in multi-rail configurations (OLA-IB-CH3, OFA-iWARP-CH3 and
      OFA-RoCE-CH3 interfaces)
    - Optimized and tuned algorithms for Gather, Scatter, Reduce,
      AllReduce and AllGather collective  operations
    - Enhanced support for multi-threaded applications
    - Fast Checkpoint-Restart support with aggregation scheme
    - Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
    - Support for new standardized Fault Tolerant Backplane (FTB) Events
      for Checkpoint-Restart and Job Pause-Migration-Restart Framework
    - Enhanced designs for automatic detection of various
      architectures and adapters
    - Configuration file support (similar to the one available in MVAPICH).
      Provides a convenient method for handling all runtime variables
      through a configuration file.
    - User-friendly configuration options to enable/disable various
      checkpoint/restart and migration features
    - Enabled ROMIO's auto detection scheme for filetypes
      on Lustre file system
    - Improved error checking for system and BLCR calls in
      checkpoint-restart and migration codepath
    - Enhanced OSU Micro-benchmarks suite (version 3.3)
    - Building and installation of OSU micro benchmarks during default
      MVAPICH2 installation
    - Improved configure help for MVAPICH2 features
    - Improved usability of process to CPU mapping with support of
      delimiters (',' , '-') in CPU listing
        - Thanks to Gilles Civario for the initial patch
    - Use of gfortran as the default F77 compiler

* Bug fixes (since MVAPICH2-1.5.1)

    - Fix for shmat() return code check
    - Fix for issues in one-sided RMA
    - Fix for issues with inter-communicator collectives in Nemesis
    - KNEM patch for osu_bibw issue with KNEM version 0.9.2
    - Fix for osu_bibw error with Shared-memory-Nemesis interface
    - Fix for a hang in collective when thread level is set to multiple
    - Fix for intel test errors with rsend, bsend and ssend
      operations in Nemesis
    - Fix for memory free issue when it allocated by scandir
    - Fix for a hang in Finalize
    - Fix for issue with MPIU_Find_local_and_external when it is called
      from MPIDI_CH3I_comm_create
    - Fix for handling CPPFLGS values with spaces
    - Dynamic Process Management to work with XRC support
    - Fix related to disabling CPU affinity when shared memory is
      turned off at run time
    - Resolving a hang in mpirun_rsh termination when CR is enabled
    - Fixing issue in MPI_Allreduce and Reduce when called with MPI_IN_PLACE
        - Thanks to the initial patch by Alexander Alekhin
    - Fix for threading related errors with comm_dup
    - Fix for alignment issues in RDMA Fast Path
    - Fix for extra memcpy in header caching
    - Only set FC and F77 if gfortran is executable
    - Fix in aggregate ADIO alignment
    - XRC connection management
    - Fixes in registration cache
    - Fixes for multiple memory leaks
    - Fix for issues in mpirun_rsh
    - Checks before enabling aggregation and migration
    - Fixing the build errors with --disable-cxx
        - Thanks to Bright Yang for reporting this issue

MVAPICH2 1.6 is being made available with OFED 1.5.3. It continues to
deliver excellent performance. Sample performance numbers include:

  OpenFabrics/Gen2 on Westmere quad-core (2.53 GHz) with PCIe-Gen2
      and ConnectX2-QDR (Two-sided Operations):
        - 1.63 microsec one-way latency (4 bytes)
        - 3394 MB/sec unidirectional bandwidth
        - 6540 MB/sec bidirectional bandwidth

  QLogic InfiniPath Support on Westmere quad-core (2.53 GHz) with
      PCIe-Gen2 and QLogic-QDR (Two-sided Operations):
        - 2.00 microsec one-way latency (4 bytes)
        - 3139 MB/sec unidirectional bandwidth
        - 4255 MB/sec bidirectional bandwidth

  OpenFabrics/Gen2-RoCE (RDMA over Converged Ethernet) Support on
      Xeon quad-core (2.4 GHz) with ConnectX-EN
      (Two-sided operations):
        - 2.92 microsec one-way latency (4 bytes)
        - 1143 MB/sec unidirectional bandwidth
        - 2253 MB/sec bidirectional bandwidth

  Intra-node performance on Westmere quad-core (2.53 GHz)
      (Two-sided operations, intra-socket)
        - 0.33 microsec one-way latency (4 bytes)
        - 10135 MB/sec unidirectional bandwidth with LiMIC2
        - 16651 MB/sec bidirectional bandwidth with LiMIC2

Performance numbers for several other platforms and system configurations
can be viewed by visiting `Performance' section of the project's web page.

For downloading MVAPICH2 1.6, associated user guide and accessing the
SVN, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedbacks, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

We are also happy to inform that the number of organizations using
MVAPICH/MVAPICH2 (and registered at the MVAPICH site) has crossed
1,400 world-wide (in 60 countries). The MVAPICH team extends thanks to
all these organizations.

Thanks,

The MVAPICH Team