1. Papers
- 1.1. Convolution Neural Networks
- 1.2. Sparsity
- 1.3. Transformers
- 1.4. Attribute Recognition
- 1.5. Classification
- 1.6. Object Detection
- 1.7. Knowledge Distillation
- 1.8. Network Communication
  - 1.8.1. Gradient Compression
- 1.9. OPs in Network
  - 1.9.1. Batch Normalization
  - 1.9.2. Group Normalization
- 1.10. Compute Efficiency
- 1.11. Memory Efficiency
- 1.12. Compression
- 1.13. Parallelism
  - 1.13.1. Data Parallel
  - 1.13.2. Pipeline Parallelism
  - 1.13.3. Parallelization Strategies
- 1.14. Quantization
- 1.15. Network
- 1.16. Resource Management
- 1.17. Parameter Server
1. Network Operators
1. Course：
- 3.1. [CS179: GPU Programming](#CS179:GPUProgramminghttp:courses.cms.caltech.educs179)
- 3.2. [ CS231n: Convolutional Neural Networks for Visual Recognition](#CS231n:ConvolutionalNeuralNetworksforVisualRecognitionhttp:cs231n.stanford.edu)
- 3.3. [CSE 599W: Systems for ML](#CSE599W:SystemsforMLhttp:dlsys.cs.washington.eduschedule)
  - 3.3.1. [Introduction to Deep Learning](#IntroductiontoDeepLearninghttp:dlsys.cs.washington.edupdflecture1.pdf)
  - 3.3.2. [Lecture 1：Distributed Training and Communication Protocols](#Lecture1DistributedTrainingandCommunicationProtocolshttp:dlsys.cs.washington.edupdflecture11.pdf)
  - 3.3.3. [Lecture 3: Overview of Deep Learning System](#Lecture3:OverviewofDeepLearningSystemhttp:dlsys.cs.washington.edupdflecture3.pdf)
  - 3.3.4. [Lecture 5: GPU Programming](#Lecture5:GPUProgramminghttp:dlsys.cs.washington.edupdflecture5.pdf)
- 3.4. https://ucbrise.github.io/cs294-ai-sys-sp19/
1. Books：
- 4.1. Deep Learning
  - 4.1.1. Backpropation
1. OpenSource Frameworks And Libs
- 5.1. TIMM
- 5.2. [Tensor Comprehensions 2018.2.13 Facebook AI Research Technical Report](#TensorComprehensions2018.2.13FacebookAIResearchTechnicalReport)
- 5.3. TensorFlow RunTime (TFRT)
- 5.4. OneFlow
- 5.5. [Pytorch](#Pytorch)
  - 5.5.1. JIT
- 5.6. [Tensorflow](#Tensorflow)
- 5.7. [Acme: A Research Framework for Distributed Reinforcement Learning(arXiv: June 1 2020)](#Acme:AResearchFrameworkforDistributedReinforcementLearninghttps:arxiv.orgpdf2006.00979.pdfarXiv:June12020)
- 5.8. [Launchpad](#Launchpad)
- 5.9. Reverb(2020)
- 5.10. [Weld(Standford)](#Weldhttps:www.weld.rsStandford)
- 5.11. Table Data Storage
1. Characterizing ML Training Workloads
1. Data Preprocessing
1. Videos
1. Hardware
1. ML Compilers
- 10.1. Rammer
- 10.2. IREE(Intermediate Representation Execution Environment)
1. Profiling
- 11.1. Memory Profiling
- 11.2. Communication Profiling
1. Parallel Programming
1. Mixed Precision Training
- 13.1. CUDA
1. Visualizing Neural Networks
1. Cluster Scheduler
1. Misc

1. Papers

ML Papers:

1.1. Convolution Neural Networks

CNN Concepts Gradient-Based Learning Applied to Document Recognition(1996?) : Notes, 在 OCR 领域里，用不需要像之前的方法一样，需要很多人工介入/设计/处理。对字母的大小，变体等有鲁棒性。更多依赖于从数据里自动学习，而非手工针对任务做的启发式设计。这样一个统一结构的模型可以代替过去好几个独自设计的模块

ImageNet Classification with Deep Convolutional Neural Networks(2010) ：用 deep convolutional neural network 在 2010 年的 ImageNet LSVRC 比赛上分类 1200百万高清(当时的)图片。使用的网络有6000万参数和65万神经元，引入了 dropout 来防止过拟合。 Notes

Going Deeper with Convolutions (ILSVRC 14) Notes

Rich feature hierarchies for accurate object detection (2014) Notes

Deformable Convolutional Networks (ICCV 2017) Notes

1.2. Sparsity

Accelerating Sparse Approximate Matrix Multiplication on GPUs, Notes

1.3. Transformers

Attention is All you need (Google Brain 2017) My Notes

ViT: Vision Transformer: An Image is worth 16x16 Words: Transformers for Image Recognition at Scale (2021.6.3) My Notes

lightseq: Accelerated Training for Transformer-based Models on GPUs(2021.10): 从工程角度出发实现 transformer 的训练和推理加速。My Notes 源码阅读分析

Masked Autoencoders Are Scalable Vision Learners [My Notes](./papers/transformer/Masked-AutoEncoders-Are-Scalable Vision Learners.md)

Self-Attention Does Not Need O(n^2) Memory(2021.12) My Notes

Deformable DETR: Deformable Transformers for End to End Object Detections(ICLR 2021)

Competition-Level Code Generation with AlphaCode. My Notes

FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness. (2022) My Notes

SimA: Simple Softmax-free Attention for Vision Transoformers:

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning My Notes

1.4. Attribute Recognition

Hierarchical Feature Embedding for Attribute Recognition (2020.5.23) My Notes

1.5. Classification

AIGC(AI-Generated Content)

Deep Unsupervised Learning using Noneequilibrium Thermodynamics (2015) : 最早提出

Denoising Diffusion Probabilistic Models(DDPM) (2020): 新高度

InstructPix2Pix（2022）: 给定输入图像和告诉模型需要做什么的文本描述，模型就能遵循描述指令来编辑图像

OpenAI DALLE2 和Google Imagen 都是半开源

High-Resolution Image Synthesis with Latent Diffusion Models(CVPR 2022, Stable Diffusion模型的论文, v1.4) runwayml 是背后参与的一家公司。可以做Erase and replace 的视频公司，AIGC 只是他们能力链条里的一部分. My Notes

Stable Diffusion 代码：https://huggingface.co/CompVis/stable-diffusion 高质量数据集 LAION 的加持。支持：txt2image, image+txt2image, erase and replace

LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets(2022.5) My Notes

What are Diffusion Models? My Notes

HuggingFace上对于Stable Diffusion 的介绍。 My Notes

Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding(2022.5). DrawBench 上对 VG-GAN+CLIP, Latent Diffusion Models, DALL-E-2 等做了评价

Prompt-to-Prompt Image Editing with Cross-Attention Control

Classifier-Free Diffusion Guidance(2022.7.26)

Imagic: Text-Based Real Image Editing with Diffusion Models

Character-Aware Models Improve Visual Text Rendering(2022.12.20): 主要解决视觉拼写方面的问题，提高图像生成模型渲染高质量视觉文本的能力

Muse: Text-To-Image Generation via Masked Generative Transformers: FID（quality/diversity) 和 CLIP(text-image alignment) 的分数很高；比其他模型快；一个模型支持 inpainting、outpainting [My Notes](papers/AIGC/muse.md

CLIP(Constrative Language-Image Pretraining), DALLE, Diffusion Model 之间的异同点？

CLIP: Constrative Language-Image Pretraining(2021.1.5): My Notes, Intro

Speed Is All You Need: On-Device Acceleration of Large DIffusion Models via GPU-Aware Optimizations: My Notes (Google 2023 4.21)

ImageBind: One Embedding Space To Bind Them All(2023.5.9) My Notes

Q-Diffusion: Quantizing Diffusion Models (2023.2.10) My Notes

Decompose and Realign: Tackling Condition Misaglignment in Text-to-Image Diffusion Models(6.26 2023)

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation (2023.3.17) My Notes

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths(2023.5.29) My Notes

ControlNet 1.1 My Notes

TryOnDiffusion: A Tale of Two UNets() My Notes

苹果的 Stable Diffusion with Core ML: https://github.com/apple/ml-stable-diffusion My Notes

端侧设备的 MISC

高通的技术栈

Stable Diffusion on 高通

mlc-llm: Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

1.6. Object Detection

Concepts

Fast R-CNN: (2015) 提出用一个统一的网络来训练 R-CNN，而不是之前的三个阶段，不同的网络. My Notes

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2016)：提出了 Region Proposal Network (RPN)，和检测网络共享图片卷积特征，因此 region proposal 的代价非常小。 My Notes

YOLO : (2016): My Notes

End-to-End Object Detection with Transformers : (2020 FaceBook Research) My Notes

Focal Loss for Dense Object Detection:(2017): My Notes, 第一次解决了一阶段检测的精度问题，为了使用这个loss算法，自己构造了 RetinaNet，精度超越了faster rcnn

DETR(DEtection TRansformer): End to end object detection with transformers (2020) My Notes

Model Compression

Model compression(2006)：ensemble 模式下的多个模型的知识，可以压缩到一个模型里

蒸馏算是从上述模型压缩进一步发展而来的

1.7. Knowledge Distillation

Contrastive Representation Distillation (2020) My Notes

Search to Distill: Pearls are Everywhere but not the Eyes(2020) 除了参数外，还考虑了网络结构的蒸馏，结果发现好的老师无法教出在各方面都很优秀的学生 My Notes

A Gift from Knowledge Distillation: Fast optimization, Network Minimization and Transfer Learning (2017 韩国): My Notes

Distilling the Knowledge in a Neural Network Geoffrey Hinto 和 Jeff Dean(2015): 第一次提出模型蒸馏的概念及方法, 可以通过蒸馏让小模型能力提升很高（是语音场景）。提出了一种新的 ensemble 方法 My Notes

1.8. Network Communication

Flare: Flexible In-Network Allreduce 2021-6-29: SwitchML 不灵活，他们设计了一个交换机：by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model.

In-Network Aggregation，slides, Video 笔记 18 年的

paper: Scaling Distributed Machine Learning with In-Network Aggregation

NVIDIA SHARP: 看起来跟上述开源方案的非常相似，对比见这里. 16 年就发论文了

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration : 华为做的，需要硬件实现，好处是不需要修改网卡或交换机。

MPI: 并行编程中的一种编程模型， Message Passing Interface

PMI: 并行编程中的Process Management System

InfiniBand, Remote Direct Memory Access(RDMA), Socket, Ethernet

1.8.1. Gradient Compression

GRACE: A Compressed Communication Framework for Distributed Machine Learning (2021) : s. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training(2017.12,2020-6-23 modify) : In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant . 它就是在 GRACE 框架基础上开发的一类算法

Code: GRACE: A Compressed Communication Framework for Distributed Machine Learning: is an unified framework for all sorts of compressed distributed training algorithms

GC3: An Optimizing Compiler for GPU Collective Communication(ASPLOS 23)My Notes . 对应的库叫 msccl, My Notes

1.9. OPs in Network

1.9.1. Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift(2015): 用了 BN 后学习率可以调大，初始化过程更鲁棒。也有 regularization 作用，可以一定程度代替 drop out。 faster training and high performance. My Notes

Layer Normalization (2016.7) : My Notes : 想在 RNN 里引入类似 BN 的方法，来让训练速度提升，收敛的更快

Norm matters: efficient and accurate normalization schemes in deep networks(2019 ): suggest several alternatives to the widely used L2 batch-norm, using normalization in L1 and L∞ space

Sync Batch Normalization(syncbn): Source Code Notes

1.9.2. Group Normalization

Sys Papers:

1.10. Compute Efficiency

AdderNet: Do We Really Need Multiplications in Deep Learning

Rammer My Notes

1 Bit Adam () My Notes: 实现了 GPU 上的 error compensated Adam preconditioned momentum SGD。减少了 GPU 和 CPU间通信，还加速了优化器的计算

1.11. Memory Efficiency

Backprop with approximate activations for memory-efficient network training (arXiv 2019)

Don't waste your bits! squeeze activations and gradients for deep neural networks via tinyscript (ICML, 2020)

上面这俩都是对所有元素采用同样的量化方法

Gist: Efficient data encoding for deep neural network training 2018

ZeRO：大模型训练的标配，把 DDP 模式下冗余的显存都给分片了，需要时从别人那里要过来 My Notes

FSDP in fairscale: lightning used this

ZeRO-offload 2021.1.18 My Notes : 基于 ZeRO-2，把 NLP中 Adam 优化器实现在了 CPU 上，这样能 offload 优化器的内存和计算，这个是大头

Capuchin: Tensor-based GPU Memory Management for Deep Learning(2020 ASPLOS) , My notes: 目标是为了节省内存，能训更大的 batchsize。内存管理到了 tensor 级别，而且是模型/计算图无关。在运行时 profile 出来是 swap 合适还是 recompute 合适。有 prefetch / evict 机制。但是不支持动态图(dynamic model)和 dynamic shape

Dynamic tensor rematerializatio(2020): 是 Checkpoint 核心是发明的这个线上算法不需要提前知道模型信息，就能实时产出一个特别好的 checkpointing scheme。这种在线方法能处理静态和动态图。算是陈天奇提供的方法的checkpoint 方法的后续：动态做，更优。My Notes, Source Code Notes

Pushing deep learning beyond the gpu memory limit via smart swapping. (2020)

Tensor-based gpu memory management for deep learning. (2020)

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training (2021-7-6) My notes, source code notes

Low-Memory Neural Network Training: A Technical Report(2019-4-24)

Gradient/Activation Checkpointing

Checkpoint in fairscale

Visual Gifs to show gradient checkpointing

Binaryconnect: Training deep neural networks with binary weights during propagations. (2015)

ZeRO Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning(2021.4.16), My Notes : 在 ZeRO 基础上，把显存交换扩展到了 NVMe 硬盘上。ZeRO 系列的好处是不需要改用户的代码。ZeRO 系列主要解决内存问题（Memory wall)

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management: 实现基于 Transformer 的 NLP 里预训练场景下，高效的 swap 实现机制，让模型训练更高效，能够最大程度复用不会同时存在的 chunks My Notes， Source Code Notes

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training(2023.6.16)My Notes

1.12. Compression

Deep compression: Compressing deep nerual networks with pruning, trained quantization and huffman coding. (ICLR, 2016 Song Han)

Deep gradient compression: Reducing the communication bandwidth for distributed training. (ICLR, 2018 Songhan)

MONet: Memory optimization for deep networks(ICLR 2021): 与DTR几乎同期，优势在于大部分逻辑python实现，但是因为要直接用scheduler生成fwd和bwd的python调用代码来串联各个op，而里面没完全支持所有的pytorch python语法，所以用起来麻烦

MONeT: Memory optimization for deep networks(ICLR 2021) : 发在了 Learning Representation 上，说明不是系统的改进，是算法的改进。My Notes

MONeT: Memory Optimization for Deep Networks：https://github.com/utsaslab/MONeT

Elastic Training

Elan: Towards Generic and Efficient Elastic Training for Deep Learning(2020) My Notes

Agents

LLM Powered Autonomous Agents My Notes

Generative AI

Generative Agents: Interactive Simulacra of Human Behavior(2023.4.7) My Notes

1.13. Parallelism

Megatron-LM: Training Multi-Bilion Parameter Language Models Using Model Parallelism (2020.3): My Notes

Efficient Large-Scale Language Model Training on GPU Clusters (2021.4 NVIDIA, Standford, Microsoft): 主要介绍了继 Megatron-LM 之后，如何结合多种并行方式，让特定 batchsize 的大transformer 模型，高通吐地运行起来。阅读笔记

Reducing Activation Recompution in Large Transformer Models(Megatron-v3: 2022.5.10) My Notes

1.13.1. Data Parallel

PyTorch Distributed: Experiences on Accelerating Data Parallel Training(2020.6.28) My Notes

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training(2020-4-28) : 提出了 weights 的自动切分方法，通过高效的通信原语来同步，使用静态分析计算图的方法，应用于 ADAM 或 SGD

GSPMD: General and Scalable Parallelization for ML Graphs(2021-5-10) (for transformers) ：是基于XLA上的扩展，针对 transformer来做的

1.13.2. Pipeline Parallelism

首先是模型并行(Model Parallel)，然后为了提高模型并行里的效率，就有了流水并行

PipeDream: Generalized Pipeline Parallelism for DNN Training(SOSP'19)

GPipe: Efficient training of giant neural networks using pipeline parallelism(NIPS 2019):

Memory Efficient Pipeline-Parallel DNN Training(ICML 2021): 主要是提出了 PipeDream-2BW: 自动切分模型，double buffering of weights。算是在 PipeDream基础上进行提高

PipeDream Source Code

fairscale pipeline parallelism source code

torochpipe

1.13.3. Parallelization Strategies

Beyond Data and Model Parallelism for Deep Neural Networks(2018)

Defined a more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimesions. Proposed FlexFlow, a deep learning framework that uses guided randomized search of the SOAP spaceto find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelizaiton strategy's performance. jiazhihao的，里面有一个模拟器，能预测出速度

1.14. Quantization

8-bit inference with tensorrt(2017.5.8)

And the bit goes down: Revisiting the quantization of neural networks. (ICLR 2020)

MQBench Towards Reproducible and Deployable Model Quantization Benchmark (NIPS 2021 workshop) 提出了一套验证量化算法在推理上可复现性和可部署性的Benchmark集合，指出了当前学术界和工业界之间的 Gap。My Notes

Google. Gemmlowp: building a quantization paradigm from first principles 里面提到了量化感知训练的方法

1.15. Network

GradientFlow: Optimizing Network Performance for Distributed DNN Training on GPU Clusters(cs.DC 2019)

Proposed a communication backend named GradientFlow for distributed DNN training. First we integrate ring-based all-reduce, mixed-precision training, and computation/communication overlap. Second, we propose lazy allreduce to improve network throughput by fusing multiple communication operations into a singe one, and design coarse-grained sparse communication to reduce network traffic by transmitting important gradient chunks.

1.16. Resource Management

Optimus on top of K8s

1.17. Parameter Server

Scaling Distributed Machine Learning with the Parameter Server(2013?)

1.18 Prunning

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks(ICLR 最佳论文 2019): 著名的彩票假设，指出了新的 pruning 方法，让pruning之后的网络能从头训练，不掉点 My Notes

1.19 FasterMoE

My Notes

NLP

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers Notes

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model(2022)

Few-Shot Learning: Query samples are never seen before. Query samples are from unknown classes.

(GPT-1(Generative Pretrained Transformer))(2018.6): 0.117B, 5GB data，Generative 这种有用

(GPT-2(Generative Pretrained Transformer))(2019.2): 1.5B, 40GB data，泛化性不错

Language Models are Few-Shot Learners(GPT-3(Generative Pretrained Transformer))(2020.5): 175B, 45TB data

(GPT 3(Generative Pretrained Transformer))(2020)

PaLM(2022.4) My Notes

Language Models are unsupervised Multitask Learners(2018)

Language Models are Few-Shot Learners(2020.7.22) My Notes

InstructGPT(2022.1): 有论文

LaMBDA(DeepMind 2022):

ChatGPT: Optimizing Language Models for Dialogue(OpenAI 2022.11.30). 无论文。My Notes

LLaMA: Open and Efficient Foundation Language Models: 无须私有的数据集，用开源的就可以训出 13B 上比 175B 好的效果

通向 AGI 之路：大型语言模型(LLM)技术精要 My Notes

MetaGPT: Multi-Agent Meta Programming Framework(Assign different roles to GPTs to form a collaborative software entity for complex tasks.)

Auto-GPT: An experimental open-source attempt to make GPT-4 fully autonomous.

QLoRA: Efficient Finetuning of Quantized LLMs(2023 5.23): 是一个高效的 finetuning 方法，显著减少了显存的使用：可以在一个 48G 的 GPU 上微调一个 65B 参数的模型，同时保留了 16-bit 的微调任务性能。

LMQL: is a programming language for language model interaction: My Notes

Guidance(Microsoft): constrained prompting, 让模型更稳定地按照特定的格式进行输出

llama.cpp: 把 Facebook 的 LLaMA 模型使用 C/C++ 进行移植：My Notes

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention My Notes

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency(2023 6.23) My Notes

Hugging Face Text Generation Inference My Notes

Numbers every LLM Developer should known My Notes

Llama 2: Open Foundation and Fine-Tuned Chat Models(2023.7.18) My Notes

An Initial Exploration of Theoretical Support for Language Model Data Engineering Part1 My Notes

Checkpoint

Microsoft CheckFreq : [My Notes](papers/

Recommenders

PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters(2021) Notes, Siyu Wang's Notes

DataSet

TaoBao AD Notes

Bamboo: Notes

2. Network Operators

Dropout vs Droppath

Conv

3. Course：

CS294-AI-Sys(UC Berkely)

笔记

3.1. CS179: GPU Programming

Week 4 Lec 10: cuBLAS Intro. My Notes

里面提到的一些经典论文，找到人讨论效果更佳

3.2. CS231n: Convolutional Neural Networks for Visual Recognition

This course is a deep dive into the details of deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement and train their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Additionally, the final assignment will give them the opportunity to train and apply multi-million parameter networks on real-world vision problems of their choice. Through multiple hands-on assignments and the final course project, students will acquire the toolset for setting up deep learning tasks and practical engineering tricks for training and fine-tuning deep neural networks.

(Vector, Matrix, and Tensor Derivatives(Erik Learned-Miller))[http://cs231n.stanford.edu/vecDerivs.pdf] My Notes

3.3. CSE 599W: Systems for ML

System aspect of deep learning: faster training, efficient serving, lower memory consumption

有不少结合TVM的部分，比如 Memory Optimization, Parallel Scheduing, 其中我感兴趣的部分：

3.3.1. Introduction to Deep Learning

Evolution of ConvNets:

LeNet(1998) Alexnet(2012) GoogleLeNet(2014): Multi-indepent pass way (Sparse weight matrix)

Inception BN (2015): Batch normalization

Residual Net(2015): Residual pass way

Convolution = Spatial Locality + Sharing

ReLU: y = max(x, 0).

Why ReLU?

Cheap to compute
It is roughly linear

Dropout Regularization: Randomly zero out neurons with probability 0.5

Why Dropout? Overfitting prevention.

Batch Normalization: Stabilize the Magnitude.

Subtract mean
Divide by standard deviation
Output is invariant to input scale!
- Scale input by a constant
- Output of BN remains the same

Impact:

Easy to tune learning rate . ?
Less sensitive initialization. ?

3.3.2. Lecture 1：Distributed Training and Communication Protocols

3.3.3. Lecture 3: Overview of Deep Learning System

Computational Graph Optimization and Execution

Runtime Parallel Scheduing / Networks

3.3.4. Lecture 5: GPU Programming

3.4. https://ucbrise.github.io/cs294-ai-sys-sp19/

AI-Sys Spring 2019(UC Berkeley)

4. Books：

Neural Networks and Deep Learning

4.1. Deep Learning

Deep Learning: Ian Goodfellow and Yoshua Bengio and Aron Courville

intended to help students and practitioners enter the field of machine learning in general and deep learning in particular. The online version of the book is now complete and will remain available online for free.

4.1.1. Backpropation

How the backpropagation algorithms works

Customize backward function in pytorch

5. OpenSource Frameworks And Libs

5.1. TIMM

PyTorch Image Models (TIMM) is a library for state-of-the-art image classification. With this library you can:

Choose from 300+ pre-trained state-of-art image classification models.
Train models afresh on research datasets such as ImageNet using provided scripts
Finetune pre-trained modes on your own datasets, including the latest cutting edge models.

Pytorch Image Models

5.2. Tensor Comprehensions 2018.2.13 Facebook AI Research Technical Report

My notes

5.3. TensorFlow RunTime (TFRT)

aims to provide a unified, extensble infrastructure layer with best-in-class performance across a wide variety of domain specific hardware.

TFRT DeepDive

目前还在比较早期阶段，还无法做到通用

5.4. OneFlow

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch。主要是想让算子在分布式和单机下实现是一致的，解放研究员，然后能做自动并行。提出的完整管理依赖，actor 方式来流水和背压机制值得学习。 My Notes

5.5. Pytorch

Functions/APIs in PyTorch

Reading Source Code Snippets

ProcessGroup ProcessGroupNCCL

Pytorch Llazy Tensor Core

LazyTensor Paper

Pytorch CUDA Memory Caching Allocator

PyTorch debug 方法合集

5.5.1. JIT

5.6. Tensorflow

Tensorflow: a system for large-scale machine learning(OSDI 2016)

Automatic differetiation in Pytorch(2017)

Mesh-tensorflow: Deep learning for supercomputers(NIPS 2018)

5.7. Acme: A Research Framework for Distributed Reinforcement Learning(arXiv: June 1 2020)

Agents are different scales of both complexity and computation -- including distributed versions.

让 RL 从单机版原型可以无痛扩展到多机分布式。

5.8. Launchpad

From Acme Paper:

Roughly speaking, Launchpad provides a mechanism for creating a distributed program as a graph cnosisting of nodes and edges. Nodes exactly correspond to the modules -- represented as class instances as described above -- whereas the edges represent a client/server channel allowing communication between two modules. The key innovation of Launchpad is that it handles the creating of these edges in such a way that from perspective of any module there is no ditinction between a local and remote communication, e.g. for an actor retrieving parameters from a learner in both instances this just looks like a method call.

直观感觉是这些领域里需要一些基础的框架和库，去解决分布式通信，把问题抽象，和具体场景解耦出来，这样研究人员复用这些轮子。跟当年互联网领域 rpc 通信框架、最近几年的微服务里服务中心、服务主键等功能类似。

5.9. Reverb(2020)

The dataset in RL can be backed by a low-level data storaeg system.

It enables efficient insertion and routing of items and a flexible sampling mechanism that allows:FIFO, LIFO, unifrom, and weighted sampling schemes.

5.10. Weld(Standford)

Fast rust parallel code generation for data analytics frameworks. Developed at Standford University.

Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and functions by expressing the core computations in libraries using a common intermediate repersentation, and optimizing across each framwork. Modern analytics applications combine multiple functions from different libraries and frameworks to build complex workflows. Weld's take on solving this problem is to lazily build up a computation for the entire workflow, and then optimizingn and evaluating it only when a result is needed.

看起来就是解决掉用不同的库和框架来构建复杂工作流时的效率问题。通过引入一层中间表达，然后再实现到不同的框架里来做联合优化。

Paper: Weld: Rethinking the Interface Between Data-Intensive Applications.

To address this problem, we propose Weld, a new interface between data-intensive libraries that can optimize across disjoint libraries and functions. Weld can be integrated into existing frameworks such as Spark, TensorFlow, Pandas and NumPy without chaning their user-facing APIs.

Related research papers: PipeDream: General Pipeline Parallelism for DNN Training (SOSP'19)

5.11. Table Data Storage

Reverb is an efficient and easy-to-use data storage and trasport system designed for machine learning research(DeepMind, 2021-4-22)

Used an an experience replay system for distributed reinforcement learning algorithms

Pathways: Aschronous Distributed Dataflow for ML(MLSys 2022)

Analyzing and Mitigating Data Stalls in DNN Training(2021.1.19) My Notes

NumPy

The NumPy array: a structure for efficient numerical computation(IEEE Computing in Science and Engineering, 2011) My Notes

XLA

XLA concepts

6. Characterizing ML Training Workloads

Characterizing Deep Learning Training Workloads on Alibaba-PAI(2019-10-14): 主要看看他们是用什么方法来profile的，是否有借鉴意义

7. Data Preprocessing

ImageNet-21K Pretraining for the Masses (2021.8.5) My Notes

tf.data: A Machine Learning Data Processing Framework: (2021) My Notes

Faster Neural Network Training with Data Echoing(2020) 如果数据这块 I/O 时间过长，可以对数据进行重用

8. Videos

Flexible systems are the next frontier of machine learning Stanford 邀请 Jeff Dean 和 Chris Re 探讨最新的进展(2019) My Notes

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design(Jeff Dean): 里面介绍了很多 ML 在各领域的应用，也介绍了为啥做专用硬件

9. Hardware

TPU

In-Datacenter Performance Analysis of a Tensor Processing Unit

Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Tensorflow: A system for large-scale machine learning

Implementation of Control Flow In TensorFlow

Dynamic Control Flow in Large-Scale Machine Learning

GPU

Optimization

NVLink vs PCiE

Mixed Precision Training (2018 ICLR) My Notes

Training Deep Neural Networks with 8-bit Floating Point Numbers(2018 NeurIPS) My Notes

NVIDIA Automatic Mixed Precision training My Notes

What Every Programmer Should Know About Floating-Point Arithmetic My Notes

fp16

Fractional GPUs: Software-based Compute and Memory Bandwidth Reservation for GPUs: 实现了比 NV 自己的 MPS 更好的划分GPU ，跑多个程序的方式。之前腾讯的Gaia Stack 里实现有一些限制，本质是限制了调用次数，比如 kernel 得发射很多，才比较有效，而且多个程序之间没有物理隔离。开源了实现. My Notes

Optimizing Convolutitonal Layers: NV 官网的优化卷积运算的指南。Notes

NV官网 GPU Performance Background. My Notes

Nsight Compute Notes

CUDA

CUTLASS: My Notes: 1, 2

Efficient GEMM in CUTLASS

DL Performance: Matrix Multiplication

PyTorch Lightning Profiler. My Notes

cuBLAS: My Notes

pytorch cuda api

CUDA API: My Notes

Cooperative Groups

CUB: Provides state of the art, reusable software components for every layer of the CUDA programming model

SoftMax 的高效 CUDA 实现

CUDA 里的全局 threadId 如何计算？

CUDA Graphs: 可以用来节省传统 stream 方式下 cudaLaunchKernel 的时间，适合小的静态 kernel

Shared Memory in CUDA

Global Memory in CUDA

Tensor Cores: 它是什么，如何用，如何确定使用到了 TensorCore，启用 TC 的前提条件

Streams and Events

NCCL Performance : nccltest

NCCL

NCCL Environments

CUDA Scalable Parallel Programming(John Nickolls 2008)

Google FastSocket & Reduction Server : My Notes

Triton vs Numba vs Taichi

cudnn benchmark 的详解

Triton inference server

python_backend My Notes

Segmentation

Segment Anything Model(2023.4) My Notes

DINOv2: State-of-the-art computer vision models with self-supervised learning(2023.4.17)My Notes

10. ML Compilers

JAX: Compiling machine learning programs via high-level tracing(2011?), My Notes. JAX 相当于可以用非常易用的 Python 来连接了研究员和 XLA 的加速能力

10.1. Rammer

10.2. IREE(Intermediate Representation Execution Environment)

It's an MLIR-based end-to-end compiler and runtime that lowers ML modeles to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments. 目前支持 TensorFlow, JAX

下面这几篇都是做 inference 的：

IOS: Inter-Operator Scheduler For CNN Acceleration(MLSys 2021): 开源了。主要用 dp 来做推理时算子间的调度。因为 multi branch下conv算子更小了，而新的GPU上 SM 越来越多

Automatic Horizontal Fusion for GPU Kernels

ROLLER(OSDI 22)

TensorRT

TensorRT Model Accuracy

TensorRT Performance Best Practices

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale(2022.6) My Notes

训练相关的：

Nimble: Lightweight and Parallel GPU task Scheduling for Deep Learning(NeurIPS 2020), My Notes 主要是通过 CUDA graph 抹掉调度开销，然后通过多 stream 来尽量并发。但是论文和开源代码只能用于 static graph和static input，据说闭源版本支持动态

PyTorch NVFuser vs NNC vs TorchInductor

Alpa: Automating Inter and Intra-Operator Parallelism for Distributed Deep Learning(OSDI 2022), My Notes, 源码阅读

Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization(OSDI 2022). My Notes。是基于 flex flow 来做的作者斯坦福的 jiazhihao，FlexFlow, TASO 等都是他写的。unity是在之前dp+pp的基础上，加入了代数转换和op内并行策略. flexflow 是类似pytorch一样的一个新框架，它支持tf和pytorch的前端

pytorch/tau(PiPPy): My Notes

[Triton: An Intermediate Language and Compiler for Tiled Neurual Network Computations](2019 Machine Learning And Programming Languages) My Notes Introducting Triton

TensorIR: An Abstraction for Automatic Tensorized Program Optimization My Notes

11. Profiling

11.1. Memory Profiling

Capuchin: Tensor-based GPU Memory Management for Deep Learning(2020 ASPLOS) , My notes

11.2. Communication Profiling

Designing-a-profiling-and-visualization-tool-for-scalable-and-in-depth-analysis-of-high-performance-gpu-clusters(2019 HiPC), My notes

Bagua(Kuaishou): 类似 Hordvod 一样 Pytorch 里的插件，支持 Decentralized/Centralized, Quantized(or 1 bit)， Sync/Async的优化器和通信过程, 主要在一些特定场景下做了优化：Gradient AllReduce Overlaping with communicaiton, Hierarch TCP Communication Acceleration(多主机间通信), Generic Fused Optimizer(优化很多小tensor的场景). NLP 里的 Load Balanced Data Loader。My Notes. 阅读源码收获

BytePS: A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters（OSDI 2020） My Notes

12. Parallel Programming

Persistent threads style gpu programming for gpgpu workloads (2012 IEEE) My Notes

Persistent Threads Block

除了上面的 persistent threads block，还有 SM-Centric 的方式：

Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations(2015): 遗憾的是没开源

无论是 persisten threads block，还是 sm-centric，还是 elasticsearch，都是为了细粒度控制GPU上资源使用的

14. Visualizing Neural Networks

Deep Learning in your browser

15. Cluster Scheduler

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized-Deep-Learning My notes

AntMan

Bamboo: (2022): A system for running large pipeline-parallel DNNs affordably, reliably, and efficiently using spot instances. My Notes

Mixture Of Experts

Sparsely-gated Mixture of Experts Model (Shazeer 2017): 能千倍增加网络参数，但是计算量不增大：稀疏激活，这个领域里的种子文章 My notes

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity My notes

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding(ICLR 2021) My Notes

FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

TUTEL: Adaptive Mixture-of-Experts at Scale(2022.6)

SE-MoE: A Scalable And Efficient Mixture-of-Experts Distributed Training and Inference System(2022.5 Paddle) My Notes

Reinforcement Learning

MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge(NeurIPS 2022)

DreamerV3: Mastering Diverse Domains through World Models

Chips

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design. My Notes

Active Learning

希望在注解最少量样本的情况下，获得模型最大性能

A Survey of Deep Active Learning

Active Learning for Convolutional Neural Networks: A Core-Set Approach (ICLR 2018)

Large Model Inference

BMInf: An Efficient Toolkit for Big Model Inference and Tuning My Notes

ModelMesh

ModelMesh and KServe Bring extreme scale standized model inferencing on Kubernetes

Dependency and Package Management

C/C++ Dependency and Package Manager

Conan: 主要是给C和 C++ 项目加速开发和持续集成来设计并优化的。有二进制包的管理能力. My Notes

Python

Python Coroutines

Python Developing with asyncio

Python error handling

Conda

pytest-explanation

httpx

gunicorn

Asynchronous vs Synchronus programming

httpx

16. Misc

The Modern Histry of Object Recognition -- Infographic

HPC-oriented Latency Numbers Every Programmer Should Know

Curriculum Learning for Natural Language Understanding()

On The Power of Curriculum Learning in Training Deep Networks (2019) : 传统的要求batch 的训练数据是均匀分布，而这个是模拟了人的学习过程：从简单的任务开始，逐步增加难度。解决了两个问题：1）根据训练难度排序数据集 2）计算逐步增加了难度的一系列 mini-batches

What every computer scientist should know about floating-point arithmetric

System Tuning

AMD 2nd Gen EPYC CPU Tuning Guide for InfiniBand HPC. My Notes

Mother Board Architecture. My Notes

Image Steganography

StegaStamp: Invisible Hyperlinks in Physical Photographs(2019 Berkeley) My notes

Chip Design

Circuit Training: An open-source framework for generating chip floor plans with distributed deep reinforcement learning.

A Graph Placement Methoodology for Fast Chip Design. Peer Review File

Stronger Baselines for Evaluating Deep Reinforcement Learning in Chip Placement

ISPD(International Symposium on Physical Design) 2022 TOC

Text-Image Synthesis

空前的照相现实主义和深度语言理解 Introduction, My Notes

High Availaiblity

Netflix Hystrix

Files

map.md

Latest commit

History

map.md

File metadata and controls

1. Papers

1.1. Convolution Neural Networks

1.2. Sparsity

1.3. Transformers

1.4. Attribute Recognition

1.5. Classification

AIGC(AI-Generated Content)

1.6. Object Detection

Model Compression

1.7. Knowledge Distillation

1.8. Network Communication

1.9. OPs in Network

1.9.1. Batch Normalization

1.9.2. Group Normalization

1.10. Compute Efficiency

1.11. Memory Efficiency

1.12. Compression

Elastic Training

Agents

Generative AI

1.13. Parallelism

1.13.1. Data Parallel

1.13.2. Pipeline Parallelism

1.13.3. Parallelization Strategies

1.14. Quantization

1.15. Network

1.16. Resource Management

1.17. Parameter Server

1.18 Prunning

1.19 FasterMoE

NLP

Checkpoint

Recommenders

DataSet

2. Network Operators

3. Course：

3.1. CS179: GPU Programming

3.2. CS231n: Convolutional Neural Networks for Visual Recognition

3.3. CSE 599W: Systems for ML

3.3.1. Introduction to Deep Learning

Evolution of ConvNets:

3.3.2. Lecture 1：Distributed Training and Communication Protocols

3.3.3. Lecture 3: Overview of Deep Learning System

3.3.4. Lecture 5: GPU Programming

3.4. https://ucbrise.github.io/cs294-ai-sys-sp19/

4. Books：

4.1. Deep Learning

4.1.1. Backpropation

5. OpenSource Frameworks And Libs

5.1. TIMM

5.2. Tensor Comprehensions 2018.2.13 Facebook AI Research Technical Report

5.3. TensorFlow RunTime (TFRT)

5.4. OneFlow

5.5. Pytorch

5.5.1. JIT

5.6. Tensorflow

5.7. Acme: A Research Framework for Distributed Reinforcement Learning(arXiv: June 1 2020)

5.8. Launchpad

5.9. Reverb(2020)

5.10. Weld(Standford)

5.11. Table Data Storage

NumPy

XLA

6. Characterizing ML Training Workloads

7. Data Preprocessing

8. Videos

9. Hardware

TPU

GPU

CUDA

Triton inference server

Segmentation

10. ML Compilers

10.1. Rammer