High-Performance Computing for AI

HPC-AI Features

The 1.0 release of the HPC-AI stack introduces the following key features:

Full-stack integration of Training and Inference Frameworks: PyTorch, DeepSpeed, vLLM, and SGLang with MVAPICH-Plus
The NOWLAB PyTorch features

Native PyTorch Distributed Data Parallel (DDP) training with MPI backend
Efficient large-message collectives (e.g., Allreduce) on various CPUs and GPUs
GPU-Direct Ring and Two-level multi-leader algorithms for Allreduce operations
Support for fork safety in distributed training and inference environments
Efficient large message collectives in MVAPICH-Plus 4.1 and later
Open-source framework builds with advanced MPI backend support

Advanced inference decoding methods ( MAC-Attention ) and communication runtimes (MCR-DL)
Vendor-neutral stack with competitive performance to GPU-based collective libraries (e.g., NCCL, RCCL)
Battle tested on modern HPC clusters (e.g., OLCF Frontier, TACC Vista, SDSC Cosmos) with up-to-date accelerator generations (e.g., AMD, NVIDIA)
Compatible with

InfiniBand Networks: Mellanox InfiniBand adapters (EDR, FDR, HDR, NDR)
Slingshot Networks: HPE Slingshot
GPU&CPU Support:

NVIDIA GPU A100, H100, GH200
AMD MI250X, MI300A GPUs

Software Stack:

Python [3.x]
CUDA [12.x] and Latest CuDNN
(NEW)ROCm [7.x]
(NEW)PyTorch [2.10.0]

Training & Inference Frameworks:

(NEW)DeepSpeed, vLLM, SGLang

MPI4DL Features

Based on PyTorch
(NEW) Support for training very high-resolution images

Distributed training support for:

Layer Parallelism (LP)
Pipeline Parallelism (PP)
Spatial Parallelism (SP)
Spatial and Layer Parallelism (SP+LP)
Spatial and Pipeline Parallelism (SP+PP)
(NEW)Bidirectional and Layer Parallelism (GEMS+LP)
(NEW)Bidirectional and Pipeline Parallelism (GEMS+PP)
(NEW)Spatial, Bidirectional and Layer Parallelism (SP+GEMS+LP)
(NEW)Spatial, Bidirectional and Pipeline Parallelism (SP+GEMS+PP)

(NEW)Support for AmoebaNet and ResNet models
(NEW)Support for different image sizes and custom datasets

Exploits collective features of MVAPICH2-GDR
Compatible with

NVIDIA GPU A100 and V100
CUDA [11.6, 11.7]
Python >= 3.8
PyTorch [1.12.1 , 1.13.1]
MVAPICH2-GDR = 2.3.7
MVAPICH-Plus = 3.0b

MPI4cuML Features

Based on cuML 22.02.00

Include ready-to-use examples for KMeans, Linear Regression, Nearest Neighbors, and tSVD

MVAPICH2 support for RAFT 22.02.00

Enabled cuML’s communication engine, RAFT, to use MVAPICH2-GDR backend for Python and C++ cuML applications
KMeans, PCA, tSVD, RF, LinearModels
Added switch between available communication backends (MVAPICH2 and NCCL)

Built on top of mpi4py over the MVAPICH2-GDR library
Tested with

Mellanox InfiniBand adapters (FDR and HDR)
Various x86-based multi-core platforms (AMD and Intel)
NVIDIA GPU A100, V100, and P100

OSU-Caffe 0.9 Features

OSU-Caffe derives from Caffe, which is a Deep Learning Framework that provides the flexibility to design and enhance DL models. All the features available with the NVIDIA's fork of the BVLC Caffe are available with this release. OSU-Caffe offers additional features and mechanisms that take advantage of the HPC resources. It is an MPI distributed version that scales-out on multi-GPU nodes. It takes advanatge of the optimized CUDA-Aware MPI to boost its performance on GPU Clusters. OSU-Caffe re-designs the DL workflow to provide overlap of the computation and communication. Further, it takes advantage of efficient large message MPI collective communication operations from GPU buffers that efficiently exploit GPUDirect RDMA, CUDA IPC, CUDA Kernels and Core-Direct features.

The list of features for supporting distributed and large scale DL frameworks.

Based on Nvidia's Caffe fork (caffe-0.14)
MPI-based distributed training support
Efficient scale-out support for multi-GPU nodes systems
New workflow to overlap the compute layers and the communication
Efficient parallel file readers to optimize I/O and data movement

Takes advantage of Lustre Parallel File System

Exploits efficient large message collectives in MVAPICH2-GDR 2.2
Tested with

Various CUDA-aware MPI libraries
CUDA 7.5
Various HPC Clusters with K80 GPUs, varying number of GPUs/node, and InfiniBand (FDR and EDR) adapters

RDMA-TensorFlow 0.9.1 Features

Based on Google TensorFlow 1.3.0
Build with Python 2.7, Cuda 8.0, CUDNN 5.0, gcc 4.8.5, and glibc 2.17
Compliant with TensorFlow 1.3.0 APIs and applications
High-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow

RDMA-based data communication
Adaptive communication protocols
Dynamic message chunking and accumulation
Support for RDMA device selection

Easily configurable for native InfiniBand and the traditional sockets based support (Ethernet and InfiniBand with IPoIB)
Tested with

Mellanox InfiniBand adapters (e.g., EDR)
NVIDIA GPGPU K80
Tested with CUDA 8.0 and CUDNN 5.0

HPC-Accelerated AI (HPC-AI)

HPC-AI Features

MPI4DL Features

MPI4cuML Features

OSU-Caffe 0.9 Features

RDMA-TensorFlow 0.9.1 Features