GPU Optimization Workshop (hosted by @ChipHuyen)
YouTube video, Google Doc notes Slides and more notes
- PyTorch uses eager evaluation
- Pros: Easy to debug (print statements to inspect values would work)
- Cons: GPU memory bandwidth becomes the bottleneck due to multiple data transfers between CPU and GPU context.
- A 2022 article by Horace that goes into more details: Making Deep Learning Go Brrrr.
- PyTorch profiler can be used to examine such bottlenecks
- Also see PyTorch profiler recipe.
- Also see
torch.utils.bottleneck
.
torch.compile
can be used to fuse operators and generate an OpenAI Triton kernel.- Check if using
mode="reduce-overhead"
leads to better performance.
- Check if using
- When using CUDA backend with selected GPUs, matrix multiplication can be sped-up significantly using Tensor Cores, but with precision trade-offs.
- To enable the use of Tensor Cores in PyTorch, set
float32_matmul_precision
tohigh
ormedium
(the default ishighest
). - Note that this could cause accuracy regressions; see this discussion.
- To enable the use of Tensor Cores in PyTorch, set
- Quantization helps not only speed-up compute-bound workflows, but also memory-bandwidth-bound workflows!
- This means weights-only quantization not only makes the model size-on-disk smaller, it could also make the model faster on GPU.
- See PyTorch Architecture Optimization (torchao) for a library of many quantization algorithms.
- Also see
torch.ao
module, and how it is different.
- A great example of GPU optimization relevant to LLMs: FlashAttention; and to get better intuition behind "softmax scaling", read Online Normalizer.
- Learn OpenAI Triton.
- Official documentation
- cuda-mode/triton-index lists a few examples and links to resources.
- Inspect and learn from Triton kernels generated by
torch.compile
. - Using user-defined Triton kernels with
torch.compile
.
- Learn CUDA.
- Book recommendation: Programming Massively Parallel Processors: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (e-book available)
- cuda-mode/resource-stream lists many links to resources.
- CUDA kernels can be loaded directly from source using
torch.utils.cpp_extension.load_inline
(see lecture & notes). - Consider joining CUDA Mode Discord server.
- More resources:
- Nvidia TensorRT Model Optimizer is a library of model optimization techniques such as quantization. Mainly aimed at LLMs and diffusion models.
- The talk discusses memory-management inefficiencies with LLM inference, and ideas on how it could be improved.
- Also quickly mentions speculative decoding.
- Triton is an abstraction that is higher level (and simpler) than low-level compute libraries (such as CUDA), but is more expressive (but more complicated) than graph compilers (such as PyTorch).
- Discusses the Machine Model and Programming Model.
- Although Triton code is written in Python,
triton.jit
analyzes the syntax tree and generates code based on it. - Shows how to implement vector addition and softmax in Triton.
torch.compile
should produce good Triton kernels by default.- Triton has experimental support for AMD GPUs too. (Link to issue)
- Explains the kinds of optimizations that you get when using Triton.
- CPUs are good for row-wise processing, as often found in Online Transactional Processing (OLTP).
- GPUs are good for column-wise processing, as often found in Online Analytical Processing (OLAP).
- Apache Arrow is the most popular format for columnar data storage and transfer.
- Mentions many GPU-accelerated libraries out there:
- GPUs are not universally good. GPUs may not make sense when:
- Data-processing is latency-bound or I/O-bound.
- Throughput is not very important.
- The amount of data is not large enough.
- The processing pipeline needs to make too many switches between CPU and GPU contexts, bottlenecking memory-bandwidth.
- Moving to distributed setup (CPU or GPU), quickly becomes network bound.
- Infiniband or RoCEv2 can allow distributed GPUs with direct memory access over a network (RDMA), bypassing CPU contexts.
- Tools that can use GPU-RDMA: Dask+OpenUCX, Spark Rapids, etc.