From panda at cse.ohio-state.edu  Tue Jul  1 22:50:12 2025
From: panda at cse.ohio-state.edu (Panda, Dhabaleswar)
Date: Wed, 2 Jul 2025 02:50:12 +0000
Subject: [Hidl-announce] Announcing the release of High-Performance Deep
 Learning (HiDL) 2.0 package with MPI backend
Message-ID: <DM6PR01MB43155E2C03A05779A8303227DA40A@DM6PR01MB4315.prod.exchangelabs.com>

The High-Performance Deep Learning (HiDL) team is pleased to announce
the 2.0 release of HiDL, which is a high-performance deep learning
vendor-neutral stack based on the MVAPICH-Plus MPI backend. HiDL 2.0
uses PyTorch 2.0 and later versions with the MVAPICH-Plus backend to
support large-scale Distributed Data Parallel (DDP) training workloads
and targets modern GPU clusters and high-performance
interconnects. This vendor-neutral approach does not require any
vendor-supported collective communication library (such as NCCL or
RCCL) and delivers competing performance on latest GPU clusters.

The modified PyTorch 2.0 stack to use the latest MVAPICH-Plus is
available in an open-source manner from the following location:
https://github.com/OSU-Nowlab/pytorch/tree/hidl-2.0

* HiDL 2.0: PyTorch 2.0 with MVAPICH-Plus Features

    - Support for PyTorch 2.0 and later versions
    - Full support for PyTorch Native Distributed Data Parallel (DDP) training
    - Optimized support for MPI communication backend in model training workloads
        - Efficient large-message collectives (e.g., Allreduce) on various CPUs and GPUs
        - GPU-Direct Ring and Two-level multi-leader algorithms for Allreduce operations
        - Support for fork safety in distributed training environments
        - Exploits efficient large message collectives in MVAPICH-Plus 4.0 and later
    - Open-source PyTorch version with advanced MPI backend support
        - Available in our PyTorch tag (https://github.com/OSU-Nowlab/pytorch/tree/hidl-2.0)
    - Vendor-neutral stack with competitive performance and throughput
      to GPU-based collective libraries (etc. NCCL, RCCL)
    - Battle tested on modern HPC clusters (OLCF Frontier, TACC Vista, etc.) with
      up-to-date GPUs (NVIDIA and AMD)
    - Compatible with
        - InfiniBand Networks: Mellanox InfiniBand adapters (EDR, FDR, HDR, NDR)
        - Slingshot Networks: HPE Slingshot
        - GPU&CPU Support:
            - NVIDIA GPU A100, H100, GH200
            - AMD MI200 series GPUs
        - Software Stack:
            - CUDA [12.x] and Latest CuDNN
            - ROCm [6.x]
            - (NEW)PyTorch [2.x]
            - (NEW)Python [3.x]

For setting up the HiDL stack and the associated user guide, please
visit the following URL:

http://hidl.cse.ohio-state.edu

Sample performance numbers for DDP training using the HiDL 2.0 stack
is available from:
http://hidl.cse.ohio-state.edu/performance/pytorch-ddp-gpu/

All questions, feedback, and bug reports are welcome. Please post to
hidl-discuss at lists.osu.edu.

Thanks,

The High-Performance Deep Learning (HiDL) Team
http://hidl.cse.ohio-state.edu