π° Report Link [here]
π« Contact me qhliu26@gmail.com
Abstract: The increasing scale of model size and continuous improvement of performance herald the arrival of the Big Model era. In this report, we explore what and how the big model training works by diving into training objectives and training methodologies. Specifically,training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models based on self-supervised learning, and training methodologies which are based on distributed training describe how to make big model training a reality. We summarize the existing training methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design. Training parallelism can be categorized into data, pipeline, and tensor parallelism according to the dimension of parallelism that takes place. Memory-saving technologies are orthogonal and complementary to training parallelism. And model sparsity design further scales up the model size with a constant computational cost.
- PyTorch: https://github.com/pytorch/pytorch
- TensorFlow: https://github.com/tensorflow/tensorflow
- Mesh TensorFlow: https://github.com/tensorflow/mesh
- Megatron-LM: https://github.com/NVIDIA/Megatron-LM
- DeepSpeed: https://github.com/microsoft/DeepSpeed
- Fairscale: https://github.com/facebookresearch/fairscale
- Colossal-AI: https://github.com/hpcaitech/ColossalAI
- OneFlow: https://github.com/Oneflow-Inc/oneflow
- BMTrain: https://github.com/OpenBMB/BMTrain
Year | Title | Intro |
---|---|---|
2017 | Deep Learning Scaling is Predictable, Empirically | empirical characterization of generalization error and model size growth as training sets grow |
2020 | Scaling Laws for Neural Language Models | Performance depends strongly on scale, weakly on model shape |
2021 | On the Opportunities and Risks of Foundation Models | A foundation model is any model that is trained on broad data at scale and can be adapted to a wide range of downstream tasks |
2022 | The 2022 AI Index | Language models are more capable than ever, but also more biased |
Year | Name | Param | From |
---|---|---|---|
2018 | GPT | 110M | OpenAI |
2018 | BERT | 349M | |
2019 | GPT-2 | 1.5B | OpenAI |
2019 | Megatron-LM | 8.3B | Nvidia |
2020 | Turing-NLG | 17B | Microsoft |
2020 | GPT-3 | 175B | OpenAI |
2021 | Switch Transformer | 1.6T | |
2021 | BaGuaLu | 174T | BAAI |
2022 | PaLM | 540B |
Year | Title | Intro |
---|---|---|
2009 | Bandwidth optimal all-reduce algorithms for clusters of workstations | All-Reduce Architecture |
2011 | HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent | Asynchronous SGD |
2014 | Scaling Distributed Machine Learning with the Parameter Server | Traditional Centralized Architecture |
2016 | GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server | offload temporarily unused parameters back to CPU |
2020 | PyTorch Distributed: Experiences on Accelerating Data Parallel Training | PyTorch DDP Implementation |
2020 | A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters | BytePS |
Year | Title | Intro |
---|---|---|
2019 | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | 1D tensor parallelism for transformer MLP and self-attention |
2021 | An Efficient 2D Method for Training Super-Large Deep Learning Models | 2D TP based on SUMMA |
2021 | 2.5-dimensional distributed model training | 2.5D TP |
2021 | Maximizing Parallelism in Distributed Training for Huge Neural Networks | 3D TP |
Year | Title | Intro |
---|---|---|
2018 | GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | Pipeline Parallelism from Google |
2019 | PipeDream: Generalized Pipeline Parallelism for DNN Training | 1F1B microbatch scheduling |
2020 | Memory-Efficient Pipeline-Parallel DNN Training | PipeDream-flush and PipeDream-2BW |
2021 | Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Interleaved 1F1B pipeline schedule |
Year | Title | Intro |
---|---|---|
2017 | Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | ensembling implemented with a gating mechanism connecting multiple experts |
2020 | GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding | replaces transformer FFN with MoE layer |
2021 | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity | scales the model size up to trillions of parameters |
2021 | Go Wider Instead of Deeper | WideNet uses individual LN to transform semantic representations |
2022 | Mixture-of-Experts with Expert Choice Routing | let experts select the top-k tokens |
Year | Title | Intro |
---|---|---|
2016 | Training Deep Nets with Sublinear Memory Cost | trade computation for memory and train a n layer network with |
2019 | Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization | formalize the problem of training time and memory requirements trading-off as the tensor rematerialization optimization problem |
Year | Title | Intro |
---|---|---|
2019 | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | Zero Redundancy Optimizer |
2021 | ZeRO-Offload: Democratizing Billion-Scale Model Training | offloading data and compute to CPU |
2021 | ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning | heterogeneous system technology leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale |
2022 | PatrickStar: Parallel Training of Pre-Trained Models Via Chunk-Based Dynamic Memory Management | heterogeneous system technology leverages GPU, CPU memory in a more efficient way |
Year | Title | Intro |
---|---|---|
2017 | Mixed Precision Training | Speed up training and save memory |
2017 | Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks | a replacement of 32-bit floating point format training and inference to support modern deep network topologies without modifications |
2018 | Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes | training AlexNet with 95 epochs within 4 minutes |
2020 | Ultra-low precision 4-bit training of deep neural networks | scale the precision of training systems to 4-bits |