[mvapich-discuss] Announcing the Release of MVAPICH2 1.9 GA, MVAPICH2-X 1.9 GA and OSU Micro-Benchmarks (OMB) 4.0.1

Mon May 6 23:54:06 EDT 2013

The MVAPICH team is pleased to announce the release of MVAPICH2 1.9 GA, 
MVAPICH2-X 1.9 GA (Hybrid MPI+PGAS with UPC and OpenSHMEM support through 
Unified Communication Runtime) and OSU Micro-Benchmarks (OMB) 4.0.1.

Features, Enhancements, and Bug Fixes for MVAPICH2 1.9 are listed
here.

* New Features and Enhancements (since MVAPICH2 1.8.1). (**) indicates
   enhancement since 1.9RC1:
     - Based on MPICH-3.0.3
         - Support for all MPI-3 features
           (Available for all interfaces: OFA-IB-CH3, OFA-iWARP-CH3,
           OFA-RoCE-CH3, uDAPL-CH3, OFA-IB-Nemesis and PSM-CH3)
     - Support for Mellanox Connect-IB HCA
     - Adaptive number of registration cache entries based on job size
     - Support for single copy intra-node communication using Linux supported
       CMA (Cross Memory Attach)
         - Provides flexibility for intra-node communication: shared memory,
           LiMIC2, and CMA
     - New version of LiMIC2 (v0.5.6)
         - Provides support for unlocked ioctl calls
     - Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
         - Using SCR version 1.1.8
         - Support for application-level checkpointing
         - Support for hierarchical system-level checkpointing
         - Install utility scripts included with SCR
     - Scalable UD-multicast-based designs for collectives
       (Bcast, Allreduce and Scatter)
     - LiMIC-based design for Gather collective
     - Improved performance for shared-memory-aware collectives
       (Reduce and Bcast)
     - (**) Tuned Bcast, Alltoall, AllReduce, Allgather, Reduce, Scatter,
                 Reduce_Scatter, Allgatherv collectives
     - Tuned MPI performance on Kepler GPUs
     - Improved intra-node communication performance with GPU buffers
       using pipelined design
     - Improved inter-node communication performance with GPU buffers
       with non-blocking CUDA copies
     - Improved small message communication performance with
       GPU buffers using CUDA IPC design
     - Efficient vector, hindexed datatype processing on GPU buffers
     - Improved automatic GPU device selection and CUDA context management
     - Optimal communication channel selection for different
       GPU communication modes (DD, DH and HD) in different
       configurations (intra-IOH and inter-IOH)
     - Provided option to use CUDA library call instead of CUDA driver to
       check buffer pointer type
         - Thanks to Christian Robert from Sandia for the suggestion
     - Revamped Build system:
         - Uses automake instead of simplemake
         - Renamed "maint/updatefiles" to "autogen.sh"
         - Allows for parallel builds ("make -j8" and similar)
     - Improved job startup time
         - A new runtime variable, MV2_HOMOGENEOUS_CLUSTER, for optimized
           startup on homogeneous clusters
     - Introduced option to export environment variables automatically with
       mpirun_rsh
     - Support for automatic detection of path to utilities used by
       mpirun_rsh during configuration
       - Utilities supported: rsh, ssh, xterm, TotalView
     - Support for launching jobs on heterogeneous networks with mpirun_rsh
     - Removed libibumad dependency for building the library
     - Tuned thresholds for various architectures
     - Set DAPL-2.0 as the default version for the uDAPL interface
     - (**) Updated to hwloc v1.7
     - Option to use IP address as a fallback if hostname
       cannot be resolved
     - Introduced MV2_RDMA_CM_CONF_FILE_PATH parameter which specifies
       path to mv2.conf
     - Improved debug messages and error reporting

* Bug Fixes (since 1.9RC1):
     - Fix cuda context issue with async progress thread
         - Thanks to Osuna Escamilla Carlos from env.ethz.ch for the report
     - Overwrite pre-existing PSM environment variables
         - Thanks to Adam Moody from LLNL for the patch
     - Fix several warnings
         - Thanks to Adam Moody from LLNL for some of the patches

   For a complete set of bug fixes of MVAPICH2 1.9 (compared to 1.8.1),
   please refer to the following URL:

   http://mvapich.cse.ohio-state.edu/download/mvapich2/changes-1.9.shtml

MVAPICH2-X 1.9 software package (released as a technology preview)
provides support for hybrid MPI+PGAS (UPC and OpenSHMEM) programming
models with unified communication runtime for emerging exascale
systems.  This software package provides flexibility for users to
write applications using the following programming models with a
unified communication runtime: MPI, MPI+OpenMP, pure UPC, and pure
OpenSHMEM programs as well as hybrid MPI(+OpenMP) + PGAS (UPC and
OpenSHMEM) programs.

Features for MVAPICH2-X 1.9 are as follows. (**) indicates features
since 1.9RC1:

*MPI Features
     - (**) Based on MVAPICH2 1.9 (OFA-IB-CH3 interface) including
       MPI-3 features. MPI programs can take advantage of all
       the features enabled by default in OFA-IB-CH3 interface
       of MVAPICH2 1.9
     - High performance two-sided communication scalable to
       multi-thousand nodes
     - Optimized collective communication operations:
         - Shared-memory optimized algorithms for barrier, broadcast,
           reduce and allreduce operations
         - Optimized two-level designs for scatter and gather operations
         - Improved implementation of allgather, alltoall operations
     - High-performance and scalable support for one-sided communication
     - Direct RDMA based designs for one-sided communication
     - Shared memory backed Windows for One-Sided communication
     - Support for truly passive locking for intra-node RMA
       in shared memory backed windows
     - Multi-threading support
     - Enhanced support for multi-threaded MPI applications

* Unified Parallel C (UPC) Features
     - UPC Language Specification v1.2 standard compliance
     - Based on Berkeley UPC v2.16.2
     - Optimized RDMA-based implementation of UPC data movement routines
     - Improved UPC memput design for small/medium size messages

* OpenSHMEM Features:
     - (**) Added 'shmem_ptr' functionality
     - OpenSHMEM v1.0d standard compliance
     - Optimized RDMA-based implementation of OpenSHMEM
       data movement routines
     - Efficient implementation of OpenSHMEM atomics using RDMA atomics
     - High performance intra-node communication using
       shared memory based schemes
     - Optimized OpenSHMEM put routines for small/medium message sizes

* Hybrid Program Features:
     - (**) Based on MVAPICH2 1.9 (OFA-IB-CH3 interface). All the runtime
       features enabled by default in OFA-IB-CH3 interface of MVAPICH2 1.9
       are available in MVAPICH2-X 1.9
     - Supports hybrid programming using MPI(+OpenMP),
       MPI(+OpenMP)+UPC and MPI(+OpenMP)+OpenSHMEM
     - Support for MPI-3, UPC v1.2 and OpenSHMEM v1.0d
     - Optimized network resource utilization through the
       unified communication runtime
     - Efficient deadlock-free progress of MPI and UPC/OpenSHMEM calls

* Unified Runtime Features:
     - (**) Based on MVAPICH2 1.9 (OFA-IB-CH3 interface). All the
       runtime features enabled by default in OFA-IB-CH3 interface of
       MVAPICH2 1.9 are available in MVAPICH2-X 1.9. MPI, UPC,
       OpenSHMEM and Hybrid programs benefit from its features
       listed below:
     - Scalable inter-node communication with highest performance
       and reduced memory usage
     - Integrated RC/XRC design to get best performance on
       large-scale systems with reduced/constant memory footprint
     - RDMA Fast Path connections for efficient small
       message communication
     - Shared Receive Queue (SRQ) with flow control to significantly
       reduce memory footprint of the library
     - AVL tree-based resource-aware registration cache
     - Automatic tuning based on network adapter and host architecture
     - Optimized intra-node communication support by taking
       advantage of shared-memory communication
     - Efficient Buffer Organization for Memory Scalability of
       Intra-node Communication
     - Automatic intra-node communication parameter tuning
       based on platform
     - Flexible CPU binding capabilities
     - Portable Hardware Locality (hwloc v1.7) support for
       defining CPU affinity
     - Efficient CPU binding policies (bunch and scatter patterns,
       socket and numanode granularities) to specify CPU binding
       per job for modern multi-core platforms
     - Allow user-defined flexible processor affinity
     - Two modes of communication progress
         - Polling
         - Blocking (enables running multiple processes/processor)
     - Flexible process manager support
     - Support for mpirun rsh, hydra and oshrun process managers
     - Support for upcrun process manager

Bug Fixes for OSU Micro-Benchmarks (OMB) 4.0.1 are listed here.

* Bug Fixes (since OMB 4.0)
     - Fix several warnings

http://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/4.0/CHANGES

Various performance numbers for MVAPICH2 1.9 and MVAPICH2-X 1.9
on different platforms and system configurations can be viewed
by visiting `Performance' section of the project's web page.

For downloading MVAPICH2 1.9, MVAPICH2-X 1.9, OMB 4.0.1,
associated user guides, quick start guide, and accessing the SVN,
please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedbacks, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

Thanks,

The MVAPICH Team