Dive into Big Model Training

📰 Report Link [here]

Abstract: The increasing scale of model size and continuous improvement of performance herald the arrival of the Big Model era. In this report, we explore what and how the big model training works by diving into training objectives and training methodologies. Specifically,training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models based on self-supervised learning, and training methodologies which are based on distributed training describe how to make big model training a reality. We summarize the existing training methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design. Training parallelism can be categorized into data, pipeline, and tensor parallelism according to the dimension of parallelism that takes place. Memory-saving technologies are orthogonal and complementary to training parallelism. And model sparsity design further scales up the model size with a constant computational cost.

Useful Repositories

PyTorch: https://github.com/pytorch/pytorch
TensorFlow: https://github.com/tensorflow/tensorflow
Mesh TensorFlow: https://github.com/tensorflow/mesh
Megatron-LM: https://github.com/NVIDIA/Megatron-LM
DeepSpeed: https://github.com/microsoft/DeepSpeed
Fairscale: https://github.com/facebookresearch/fairscale
Colossal-AI: https://github.com/hpcaitech/ColossalAI
OneFlow: https://github.com/Oneflow-Inc/oneflow
BMTrain: https://github.com/OpenBMB/BMTrain

BM Background

Year	Title	Intro
2017	Deep Learning Scaling is Predictable, Empirically	empirical characterization of generalization error and model size growth as training sets grow
2020	Scaling Laws for Neural Language Models	Performance depends strongly on scale, weakly on model shape
2021	On the Opportunities and Risks of Foundation Models	A foundation model is any model that is trained on broad data at scale and can be adapted to a wide range of downstream tasks
2022	The 2022 AI Index	Language models are more capable than ever, but also more biased

Glance at Big Model

Year	Name	Param	From
2018	GPT	110M	OpenAI
2018	BERT	349M	Google
2019	GPT-2	1.5B	OpenAI
2019	Megatron-LM	8.3B	Nvidia
2020	Turing-NLG	17B	Microsoft
2020	GPT-3	175B	OpenAI
2021	Switch Transformer	1.6T	Google
2021	BaGuaLu	174T	BAAI
2022	PaLM	540B	Google

Training Parallelism

Data Parallelism

Year	Title	Intro
2009	Bandwidth optimal all-reduce algorithms for clusters of workstations	All-Reduce Architecture
2011	HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent	Asynchronous SGD
2014	Scaling Distributed Machine Learning with the Parameter Server	Traditional Centralized Architecture
2016	GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server	offload temporarily unused parameters back to CPU
2020	PyTorch Distributed: Experiences on Accelerating Data Parallel Training	PyTorch DDP Implementation
2020	A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters	BytePS

Tensor Parallelism

Year	Title	Intro
2019	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	1D tensor parallelism for transformer MLP and self-attention
2021	An Efficient 2D Method for Training Super-Large Deep Learning Models	2D TP based on SUMMA
2021	2.5-dimensional distributed model training	2.5D TP
2021	Maximizing Parallelism in Distributed Training for Huge Neural Networks	3D TP

Pipeline Parallelism

Year	Title	Intro
2018	GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	Pipeline Parallelism from Google
2019	PipeDream: Generalized Pipeline Parallelism for DNN Training	1F1B microbatch scheduling
2020	Memory-Efficient Pipeline-Parallel DNN Training	PipeDream-flush and PipeDream-2BW
2021	Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM	Interleaved 1F1B pipeline schedule

Mixture-of-Expert

Year	Title	Intro
2017	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	ensembling implemented with a gating mechanism connecting multiple experts
2020	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	replaces transformer FFN with MoE layer
2021	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	scales the model size up to trillions of parameters
2021	Go Wider Instead of Deeper	WideNet uses individual LN to transform semantic representations
2022	Mixture-of-Experts with Expert Choice Routing	let experts select the top-k tokens

Memory Saving Design

Activation Checkpointing

Year	Title	Intro
2016	Training Deep Nets with Sublinear Memory Cost	trade computation for memory and train a n layer network with $O(\sqrt{n})$ memory cost
2019	Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization	formalize the problem of training time and memory requirements trading-off as the tensor rematerialization optimization problem

ZeRO

Year	Title	Intro
2019	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	Zero Redundancy Optimizer
2021	ZeRO-Offload: Democratizing Billion-Scale Model Training	offloading data and compute to CPU
2021	ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning	heterogeneous system technology leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale
2022	PatrickStar: Parallel Training of Pre-Trained Models Via Chunk-Based Dynamic Memory Management	heterogeneous system technology leverages GPU, CPU memory in a more efficient way

Mix Precision Training

Year	Title	Intro
2017	Mixed Precision Training	Speed up training and save memory
2017	Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks	a replacement of 32-bit floating point format training and inference to support modern deep network topologies without modifications
2018	Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes	training AlexNet with 95 epochs within 4 minutes
2020	Ultra-low precision 4-bit training of deep neural networks	scale the precision of training systems to 4-bits

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
report.png		report.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dive into Big Model Training

Useful Repositories

BM Background

Glance at Big Model

Training Parallelism

Data Parallelism

Tensor Parallelism

Pipeline Parallelism

Mixture-of-Expert

Memory Saving Design

Activation Checkpointing

ZeRO

Mix Precision Training

About

Releases

Packages

Contributors 3

qhliu26/Dive-into-Big-Model-Training

Folders and files

Latest commit

History

Repository files navigation

Dive into Big Model Training

Useful Repositories

BM Background

Glance at Big Model

Training Parallelism

Data Parallelism

Tensor Parallelism

Pipeline Parallelism

Mixture-of-Expert

Memory Saving Design

Activation Checkpointing

ZeRO

Mix Precision Training

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages