From panda at cse.ohio-state.edu  Tue May  5 21:55:00 2026
From: panda at cse.ohio-state.edu (Panda, Dhabaleswar)
Date: Wed, 6 May 2026 01:55:00 +0000
Subject: [Hidl-discuss] Announcing the release of a new HPC-Accelerated AI
 (HPC-AI) v1.0 software stack
Message-ID: <DM6PR01MB4315C1A29163D07A90E50142DA3F2@DM6PR01MB4315.prod.exchangelabs.com>

The HPC-Accelerated AI (HPC-AI) team, formerly known as the
High-Performance Deep Learning (HiDL) team, is pleased to announce the
1.0 release of the HPC-AI software stack.  The HPC-AI project
introduces a vendor neutral software stack to implement
high-performance and scalable distributed training and inference using
the popular MVAPICH-Plus MPI-based communication library supporting
excellent scale-up and scale-out with modern CPUs, GPUs, and
interconnects. The objective of the HPC-AI project is to exploit
modern HPC technologies to provide high-performance and scalable
solutions for foundational model training, agentic workflows, and
reinforcement learning.

The 1.0 release of the HPC-AI stack introduces the following key
 features:

    - Full-stack integration of Training and Inference Frameworks:
      PyTorch, DeepSpeed, vLLM, and SGLang with MVAPICH-Plus
    - Native PyTorch Distributed Data Parallel (DDP) training
      with MPI backend
    - Advanced decoding method (MAC-Attention) and communication
      runtime (MCR-DL)
    - Efficient large-message collectives (e.g., Allreduce) on
      various CPUs and GPUs
    - GPU-Direct Ring and Two-level multi-leader algorithms for
      Allreduce operations
    - Support for fork safety in distributed training and
      inference environments
    - Exploits efficient large message collectives in MVAPICH-Plus 4.1
      and later Open-source framework builds with advanced MPI
      backend support
    - Vendor-neutral stack with competitive performance to GPU-based
      collective libraries (e.g., NCCL, RCCL)
    - Battle tested on modern HPC clusters (e.g., OLCF Frontier,
      TACC Vista, SDSC Cosmos) with up-to-date accelerator
      generations (e.g., AMD, NVIDIA)
    - Compatible with
        - InfiniBand Networks: Mellanox InfiniBand adapters
          (EDR, FDR, HDR, NDR)
        - Slingshot Networks: HPE Slingshot
        - GPU&CPU Support:
            - NVIDIA GPU A100, H100, GH200
            - AMD MI250X, MI300A GPUs
        - Software Stack:
            - CUDA [12.x] and Latest CuDNN
            - (NEW)ROCm [7.x]
            - (NEW)PyTorch [2.10.0]
            - (NEW)Training & Inference: DeepSpeed, vLLM, SGLang
            - (NEW)Advanced: MAC-Attention, MCR-DL
            - (NEW)Python [3.x]

The HPC-AI 1.0 stack is available in an open-source manner from the
following location:
https://github.com/OSU-Nowlab/pytorch/tree/hpc_ai_v1.0

For setting up the HPC-AI stack and the associated user guide, please
visit the following URL:

https://hpc-ai.engineering.osu.edu/

Sample performance numbers for DDP training using the HPC-AI 1.0 stack
on a set of representative systems is available from:
https://hpc-ai.engineering.osu.edu/performance/pytorch-ddp-gpu/

All questions, feedback, and bug reports are welcome. Please post to
hidl-discuss at lists.osu.edu.

Thanks,

The HPC-Accelerated AI (HPC-AI) Team
https://hpc-ai.engineering.osu.edu/