[Mvapich-discuss] Announcing the release of a new HPC-Accelerated AI (HPC-AI) v1.0 software stack
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Tue May 5 21:55:00 EDT 2026
The HPC-Accelerated AI (HPC-AI) team, formerly known as the
High-Performance Deep Learning (HiDL) team, is pleased to announce the
1.0 release of the HPC-AI software stack. The HPC-AI project
introduces a vendor neutral software stack to implement
high-performance and scalable distributed training and inference using
the popular MVAPICH-Plus MPI-based communication library supporting
excellent scale-up and scale-out with modern CPUs, GPUs, and
interconnects. The objective of the HPC-AI project is to exploit
modern HPC technologies to provide high-performance and scalable
solutions for foundational model training, agentic workflows, and
reinforcement learning.
The 1.0 release of the HPC-AI stack introduces the following key
features:
- Full-stack integration of Training and Inference Frameworks:
PyTorch, DeepSpeed, vLLM, and SGLang with MVAPICH-Plus
- Native PyTorch Distributed Data Parallel (DDP) training
with MPI backend
- Advanced decoding method (MAC-Attention) and communication
runtime (MCR-DL)
- Efficient large-message collectives (e.g., Allreduce) on
various CPUs and GPUs
- GPU-Direct Ring and Two-level multi-leader algorithms for
Allreduce operations
- Support for fork safety in distributed training and
inference environments
- Exploits efficient large message collectives in MVAPICH-Plus 4.1
and later Open-source framework builds with advanced MPI
backend support
- Vendor-neutral stack with competitive performance to GPU-based
collective libraries (e.g., NCCL, RCCL)
- Battle tested on modern HPC clusters (e.g., OLCF Frontier,
TACC Vista, SDSC Cosmos) with up-to-date accelerator
generations (e.g., AMD, NVIDIA)
- Compatible with
- InfiniBand Networks: Mellanox InfiniBand adapters
(EDR, FDR, HDR, NDR)
- Slingshot Networks: HPE Slingshot
- GPU&CPU Support:
- NVIDIA GPU A100, H100, GH200
- AMD MI250X, MI300A GPUs
- Software Stack:
- CUDA [12.x] and Latest CuDNN
- (NEW)ROCm [7.x]
- (NEW)PyTorch [2.10.0]
- (NEW)Training & Inference: DeepSpeed, vLLM, SGLang
- (NEW)Advanced: MAC-Attention, MCR-DL
- (NEW)Python [3.x]
The HPC-AI 1.0 stack is available in an open-source manner from the
following location:
https://github.com/OSU-Nowlab/pytorch/tree/hpc_ai_v1.0
For setting up the HPC-AI stack and the associated user guide, please
visit the following URL:
https://hpc-ai.engineering.osu.edu/
Sample performance numbers for DDP training using the HPC-AI 1.0 stack
on a set of representative systems is available from:
https://hpc-ai.engineering.osu.edu/performance/pytorch-ddp-gpu/
All questions, feedback, and bug reports are welcome. Please post to
hidl-discuss at lists.osu.edu.
Thanks,
The HPC-Accelerated AI (HPC-AI) Team
https://hpc-ai.engineering.osu.edu/
More information about the Mvapich-discuss
mailing list