From panda at cse.ohio-state.edu Tue May 5 21:55:00 2026 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Wed, 6 May 2026 01:55:00 +0000 Subject: [Hidl-discuss] Announcing the release of a new HPC-Accelerated AI (HPC-AI) v1.0 software stack Message-ID: The HPC-Accelerated AI (HPC-AI) team, formerly known as the High-Performance Deep Learning (HiDL) team, is pleased to announce the 1.0 release of the HPC-AI software stack. The HPC-AI project introduces a vendor neutral software stack to implement high-performance and scalable distributed training and inference using the popular MVAPICH-Plus MPI-based communication library supporting excellent scale-up and scale-out with modern CPUs, GPUs, and interconnects. The objective of the HPC-AI project is to exploit modern HPC technologies to provide high-performance and scalable solutions for foundational model training, agentic workflows, and reinforcement learning. The 1.0 release of the HPC-AI stack introduces the following key features: - Full-stack integration of Training and Inference Frameworks: PyTorch, DeepSpeed, vLLM, and SGLang with MVAPICH-Plus - Native PyTorch Distributed Data Parallel (DDP) training with MPI backend - Advanced decoding method (MAC-Attention) and communication runtime (MCR-DL) - Efficient large-message collectives (e.g., Allreduce) on various CPUs and GPUs - GPU-Direct Ring and Two-level multi-leader algorithms for Allreduce operations - Support for fork safety in distributed training and inference environments - Exploits efficient large message collectives in MVAPICH-Plus 4.1 and later Open-source framework builds with advanced MPI backend support - Vendor-neutral stack with competitive performance to GPU-based collective libraries (e.g., NCCL, RCCL) - Battle tested on modern HPC clusters (e.g., OLCF Frontier, TACC Vista, SDSC Cosmos) with up-to-date accelerator generations (e.g., AMD, NVIDIA) - Compatible with - InfiniBand Networks: Mellanox InfiniBand adapters (EDR, FDR, HDR, NDR) - Slingshot Networks: HPE Slingshot - GPU&CPU Support: - NVIDIA GPU A100, H100, GH200 - AMD MI250X, MI300A GPUs - Software Stack: - CUDA [12.x] and Latest CuDNN - (NEW)ROCm [7.x] - (NEW)PyTorch [2.10.0] - (NEW)Training & Inference: DeepSpeed, vLLM, SGLang - (NEW)Advanced: MAC-Attention, MCR-DL - (NEW)Python [3.x] The HPC-AI 1.0 stack is available in an open-source manner from the following location: https://github.com/OSU-Nowlab/pytorch/tree/hpc_ai_v1.0 For setting up the HPC-AI stack and the associated user guide, please visit the following URL: https://hpc-ai.engineering.osu.edu/ Sample performance numbers for DDP training using the HPC-AI 1.0 stack on a set of representative systems is available from: https://hpc-ai.engineering.osu.edu/performance/pytorch-ddp-gpu/ All questions, feedback, and bug reports are welcome. Please post to hidl-discuss at lists.osu.edu. Thanks, The HPC-Accelerated AI (HPC-AI) Team https://hpc-ai.engineering.osu.edu/