Skip to content

qhliu26/Dive-into-Big-Model-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Dive into Big Model Training

πŸ“° Report Link [here]

πŸ“« Contact me qhliu26@gmail.com

Abstract: The increasing scale of model size and continuous improvement of performance herald the arrival of the Big Model era. In this report, we explore what and how the big model training works by diving into training objectives and training methodologies. Specifically,training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models based on self-supervised learning, and training methodologies which are based on distributed training describe how to make big model training a reality. We summarize the existing training methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design. Training parallelism can be categorized into data, pipeline, and tensor parallelism according to the dimension of parallelism that takes place. Memory-saving technologies are orthogonal and complementary to training parallelism. And model sparsity design further scales up the model size with a constant computational cost.

Report Thumbnail

Useful Repositories

BM Background

Year Title Intro
2017 Deep Learning Scaling is Predictable, Empirically empirical characterization of generalization error and model size growth as training sets grow
2020 Scaling Laws for Neural Language Models Performance depends strongly on scale, weakly on model shape
2021 On the Opportunities and Risks of Foundation Models A foundation model is any model that is trained on broad data at scale and can be adapted to a wide range of downstream tasks
2022 The 2022 AI Index Language models are more capable than ever, but also more biased

Glance at Big Model

Year Name Param From
2018 GPT 110M OpenAI
2018 BERT 349M Google
2019 GPT-2 1.5B OpenAI
2019 Megatron-LM 8.3B Nvidia
2020 Turing-NLG 17B Microsoft
2020 GPT-3 175B OpenAI
2021 Switch Transformer 1.6T Google
2021 BaGuaLu 174T BAAI
2022 PaLM 540B Google

Training Parallelism

Data Parallelism

Year Title Intro
2009 Bandwidth optimal all-reduce algorithms for clusters of workstations All-Reduce Architecture
2011 HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent Asynchronous SGD
2014 Scaling Distributed Machine Learning with the Parameter Server Traditional Centralized Architecture
2016 GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server offload temporarily unused parameters back to CPU
2020 PyTorch Distributed: Experiences on Accelerating Data Parallel Training PyTorch DDP Implementation
2020 A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters BytePS

Tensor Parallelism

Year Title Intro
2019 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 1D tensor parallelism for transformer MLP and self-attention
2021 An Efficient 2D Method for Training Super-Large Deep Learning Models 2D TP based on SUMMA
2021 2.5-dimensional distributed model training 2.5D TP
2021 Maximizing Parallelism in Distributed Training for Huge Neural Networks 3D TP

Pipeline Parallelism

Year Title Intro
2018 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Pipeline Parallelism from Google
2019 PipeDream: Generalized Pipeline Parallelism for DNN Training 1F1B microbatch scheduling
2020 Memory-Efficient Pipeline-Parallel DNN Training PipeDream-flush and PipeDream-2BW
2021 Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM Interleaved 1F1B pipeline schedule

Mixture-of-Expert

Year Title Intro
2017 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ensembling implemented with a gating mechanism connecting multiple experts
2020 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding replaces transformer FFN with MoE layer
2021 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity scales the model size up to trillions of parameters
2021 Go Wider Instead of Deeper WideNet uses individual LN to transform semantic representations
2022 Mixture-of-Experts with Expert Choice Routing let experts select the top-k tokens

Memory Saving Design

Activation Checkpointing

Year Title Intro
2016 Training Deep Nets with Sublinear Memory Cost trade computation for memory and train a n layer network with $O(\sqrt{n})$ memory cost
2019 Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization formalize the problem of training time and memory requirements trading-off as the tensor rematerialization optimization problem

ZeRO

Year Title Intro
2019 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Zero Redundancy Optimizer
2021 ZeRO-Offload: Democratizing Billion-Scale Model Training offloading data and compute to CPU
2021 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning heterogeneous system technology leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale
2022 PatrickStar: Parallel Training of Pre-Trained Models Via Chunk-Based Dynamic Memory Management heterogeneous system technology leverages GPU, CPU memory in a more efficient way

Mix Precision Training

Year Title Intro
2017 Mixed Precision Training Speed up training and save memory
2017 Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks a replacement of 32-bit floating point format training and inference to support modern deep network topologies without modifications
2018 Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes training AlexNet with 95 epochs within 4 minutes
2020 Ultra-low precision 4-bit training of deep neural networks scale the precision of training systems to 4-bits

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •