[mvapich-discuss] Announcing the Release of MVAPICH2 1.8RC1 and OSU Micro-Benchmarks (OMB) 3.5.2

Thu Mar 22 15:14:14 EDT 2012

Hi,

I am having trouble using the new version of MVAPICH2 with CUDA support.

I am running on a host with 3 GPUs connected to two IO hubs (GPU0 to IOH1, GPU1&2 to IOH2), and MPI_Initialize hangs on this system when I run it with mpirun -np 3.

Details:

Configure line:

./configure --prefix=/nics/d/home/jglaser/mpich2-install --enable-cuda --with-cuda-include=/sw/keeneland/cuda/4.1/linux_binary/include/ --with-cuda-libpath=/sw/keeneland/cuda/4.1/linux_binary/lib64 --enable-shared --with-ib-libpath=/usr/lib64/

Test program:
================
#include <mpi.h>
#include <cuda_runtime.h>
#include <stdlib.h>

int main(int argc, char ** argv)
    {
    int rank;

    cudaSetDevice(atoi(getenv("MV2_COMM_WORLD_LOCAL_RANK")));
    printf("before init\n");
    MPI_Init(&argc,&argv);
    printf("after init");
    MPI_Finalize();
    printf("after finalize");
    }
================

Compile with NVCC and appropriate options (obtained from mpicc -show)

Test program output

mpirun -np 3 ./mpitest
before init
before init
before init
Ctrl-C caught... cleaning up processes
(it hangs)

It works with two GPUs:
mpirun -np 2 ./mpitest
before init
before init
after init
after init
after finalize
after finalize

The last version of MVAPICH2 (1.8a2) did work without problems.

Any idea?

Thanks,

Jens

On Mar 22, 2012, at 12:21 PM, Dhabaleswar Panda wrote:

> The MVAPICH team is pleased to announce the release of MVAPICH2 1.8RC1
> and OSU Micro-Benchmarks (OMB) 3.5.2.
> 
> Features, Enhancements, and Bug Fixes for MVAPICH2 1.8RC1 are listed
> here.
> 
> * New Features and Enhancements (since 1.8a2):
> 
>    - New design for intra-node communication from GPU Device buffers
>      using CUDA IPC for better performance and correctness
>        - Thanks to Joel Scherpelz from NVIDIA for his suggestions
>    - Enabled shared memory communication for host transfers when CUDA is
>      enabled
>    - Optimized and tuned collectives for GPU device buffers
>    - Enhanced pipelined inter-node device transfers
>    - Enhanced shared memory design for GPU device transfers for
>      large messages
>    - Enhanced support for CPU binding with socket and numanode level
>      granularity
>    - Support suspend/resume functionality with mpirun_rsh
>    - Exporting local rank, local size, global rank and global size
>      through environment variables (both mpirun_rsh and hydra)
>    - Update to hwloc v1.4
>    - Checkpoint-Restart support in OFA-IB-Nemesis interface
>    - Enabling run-through stabilization support to handle process
>      failures in OFA-IB-Nemesis interface
>    - Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully
>    - Performance tuning on various platforms
>    - Support for Mellanox IB FDR adapter
> 
> * Bug Fixes (since 1.8a2):
> 
>    - Fix a hang issue on InfiniHost SDR/DDR cards
>        - Thanks to Nirmal Seenu from Fermilab for the report
>    - Fix an issue with runtime parameter MV2_USE_COALESCE usage
>    - Fix an issue with LiMIC2 when CUDA is enabled
>    - Fix an issue with intra-node communication using datatypes and GPU
>      device buffers
>    - Fix an issue with Dynamic Process Management when launching
>      processes on multiple nodes
>        - Thanks to Rutger Hofman from VU Amsterdam for the report
>    - Fix build issue in hwloc source with mcmodel=medium flags
>        - Thanks to Nirmal Seenu from Fermilab for the report
>    - Fix a build issue in hwloc with --disable-shared or
>      --disabled-static options
>    - Use portable stdout and stderr redirection
>        - Thanks to Dr. Axel Philipp from MTU Aero Engines for the patch
>    - Fix a build issue with PGI 12.2
>        - Thanks to Thomas Rothrock from U.S. Army SMDC for the patch
>    - Fix an issue with send message queue in OFA-IB-Nemesis interface
>    - Fix a process cleanup issue in Hydra when MPI_ABORT is called
>      (upstream MPICH2 patch)
>    - Fix an issue with non-contiguous datatypes in MPI_Gather
>    - Fix a few memory leaks and warnings
> 
> Bugfixes for OSU Micro-Benchmarks (OMB) 3.5.2 is listed here.
> 
> * Bug Fix (since OMB 3.5.1):
>  - Fix typo which led to use of incorrect buffers
> 
> The complete set of features and enhancements for MVAPICH2 1.8RC1 compared
> to MVAPICH2 1.7 are as follows:
> 
> * Features & Enhancements:
>    - Support for MPI communication from NVIDIA GPU device memory
>        - High performance RDMA-based inter-node point-to-point
>          communication (GPU-GPU, GPU-Host and Host-GPU)
>        - High performance intra-node point-to-point communication for
>          multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
>        - Taking advantage of CUDA IPC (available in CUDA 4.1) in
>          intra-node communication for multiple GPU adapters/node
>        - Optimized and tuned collectives for GPU device buffers
>        - MPI datatype support for point-to-point and collective
>          communication from GPU device buffers
>    - Support suspend/resume functionality with mpirun_rsh
>    - Enhanced support for CPU binding with socket and numanode level
>      granularity
>    - Exporting local rank, local size, global rank and global size
>      through environment variables (both mpirun_rsh and hydra)
>    - Update to hwloc v1.4
>    - Checkpoint-Restart support in OFA-IB-Nemesis interface
>    - Enabling run-through stabilization support to handle process
>      failures in OFA-IB-Nemesis interface
>    - Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully
>    - Performance tuning on various architecture clusters
>    - Support for Mellanox IB FDR adapter
>    - Adjust shared-memory communication block size at runtime
>    - Enable XRC by default at configure time
>    - New shared memory design for enhanced intra-node small message
>      performance
>    - Tuned inter-node and intra-node performance on different cluster
>      architectures
>    - Support for fallback to R3 rendezvous protocol if RGET fails
>    - SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated
>      hosts without specifying a hostfile
>    - Support added to automatically use PBS_NODEFILE in Torque and PBS
>      environments
>    - Enable signal-triggered (SIGUSR2) migration
>    - Reduced memory footprint of the library
>    - Enhanced one-sided communication design with reduced memory
>      requirement
>    - Enhancements and tuned collectives (Bcast and Alltoallv)
>    - Flexible HCA selection with Nemesis interface
>        - Thanks to Grigori Inozemtsev, Queens University
>    - Support iWARP interoperability between Intel NE020 and
>      Chelsio T4 Adapters
>    - RoCE enable environment variable name is changed from MV2_USE_RDMAOE
>      to MV2_USE_RoCE
> 
> Sample performance numbers for MPI communication from NVIDIA GPU memory
> using MVAPICH2 1.8RC1 and OMB 3.5.2 can be obtained from the following
> URL:
> 
> http://mvapich.cse.ohio-state.edu/performance/gpu.shtml
> 
> For downloading MVAPICH2 1.8RC1, OMB 3.5.2, associated user guide, quick
> start guide, and accessing the SVN, please visit the following URL:
> 
> http://mvapich.cse.ohio-state.edu
> 
> All questions, feedbacks, bug reports, hints for performance tuning,
> patches and enhancements are welcome. Please post it to the
> mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).
> 
> We are also happy to inform that the number of downloads from MVAPICH
> project site has crossed 100,000. The MVAPICH team extends thanks to all
> MVAPICH/MVAPICH2 users and their organizations.
> 
> Thanks,
> 
> The MVAPICH Team
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss