- INFaaS: Automated Model-less Inference Serving | ATC’ 21
- Alpa : Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | OSDI' 22
- Pathways : Asynchronous Distributed Dataflow for ML | MLSys' 22
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale ICML' 2022.
- ZeRO-Offload : Democratizing Billion-Scale Model Training.
- ZeRO-Infinity : Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- ZeRO : memory optimizations toward training trillion parameter models.
- Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors | MobiSys ’22
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing | ATC'22
- Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access | Eurosys'23
- Cocktail: A Multidimensional Optimization for Model Serving in Cloud | NSDI'22
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- SHEPHERD : Serving DNNs in the Wild
- Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
- AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs
- ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
- Channel Permutations for N:M Sparsity | MLSys' 23
- Welder : Scheduling Deep Learning Memory Access via Tile-graph | OSDI' 23
- Optimizing Dynamic Neural Networks with Brainstorm | OSDI'23
- ModelKeeper: Accelerating DNN Training via Automated Training Warmup | NSDI'23
- Breadth-First Pipeline Parallelism | MLSys' 23
- MGG : Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms | OSDI' 23
- Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters | OSDI' 23
- Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning | OSDI' 23
- BPipe: Memory-Balanced Pipeline Parallelism for TrainingLarge Language Models
- Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
- Revisiting Reliability in Large-Scale Machine Learning Research Clusters
- Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications | EuroSys '24
- Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation | EuroSys '24
- Model Selection for Latency-Critical Inference Serving | EuroSys '24
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving | SOSP' 24
- Usher: Holistic Interference Avoidance for Resource Optimized ML Inference | OSDI' 24