May 2024

GPU Optimization Workshop (hosted by @ChipHuyen)

YouTube video, Google Doc notes Slides and more notes

Crash Course to GPU Optimization (Mark Saroufim, Meta)

PyTorch uses eager evaluation
- Pros: Easy to debug (print statements to inspect values would work)
- Cons: GPU memory bandwidth becomes the bottleneck due to multiple data transfers between CPU and GPU context.
A 2022 article by Horace that goes into more details: Making Deep Learning Go Brrrr.
PyTorch profiler can be used to examine such bottlenecks
- Also see PyTorch profiler recipe.
- Also see torch.utils.bottleneck.
torch.compile can be used to fuse operators and generate an OpenAI Triton kernel.
- Check if using mode="reduce-overhead" leads to better performance.
When using CUDA backend with selected GPUs, matrix multiplication can be sped-up significantly using Tensor Cores, but with precision trade-offs.
- To enable the use of Tensor Cores in PyTorch, set float32_matmul_precision to high or medium (the default is highest).
- Note that this could cause accuracy regressions; see this discussion.
Quantization helps not only speed-up compute-bound workflows, but also memory-bandwidth-bound workflows!
- This means weights-only quantization not only makes the model size-on-disk smaller, it could also make the model faster on GPU.
- See PyTorch Architecture Optimization (torchao) for a library of many quantization algorithms.
- Also see torch.ao module, and how it is different.
A great example of GPU optimization relevant to LLMs: FlashAttention; and to get better intuition behind "softmax scaling", read Online Normalizer.
Learn OpenAI Triton.
- Official documentation
- cuda-mode/triton-index lists a few examples and links to resources.
- Inspect and learn from Triton kernels generated by torch.compile.
- Using user-defined Triton kernels with torch.compile.
Learn CUDA.
- Book recommendation: Programming Massively Parallel Processors: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (e-book available)
- cuda-mode/resource-stream lists many links to resources.
- CUDA kernels can be loaded directly from source using torch.utils.cpp_extension.load_inline (see lecture & notes).
- Consider joining CUDA Mode Discord server.
More resources:

High Performance LLM Serving on NVIDIA GPUs (Sharan Chetlur, NVIDIA)

Nvidia TensorRT Model Optimizer is a library of model optimization techniques such as quantization. Mainly aimed at LLMs and diffusion models.
The talk discusses memory-management inefficiencies with LLM inference, and ideas on how it could be improved.
- Paper on PagedAttention
Also quickly mentions speculative decoding.
- Original paper on speculative decoding
- Recent review paper on speculative decoding

Block-based GPU Programming with Triton (Philippe Tillet, OpenAI)

Triton is an abstraction that is higher level (and simpler) than low-level compute libraries (such as CUDA), but is more expressive (but more complicated) than graph compilers (such as PyTorch).
Discusses the Machine Model and Programming Model.
Although Triton code is written in Python, triton.jit analyzes the syntax tree and generates code based on it.
Shows how to implement vector addition and softmax in Triton.
torch.compile should produce good Triton kernels by default.
Triton has experimental support for AMD GPUs too. (Link to issue)
Explains the kinds of optimizations that you get when using Triton.

Scaling from CPUs to distributed GPUs (William Malpica, Voltron Data)

CPUs are good for row-wise processing, as often found in Online Transactional Processing (OLTP).
GPUs are good for column-wise processing, as often found in Online Analytical Processing (OLAP).
Apache Arrow is the most popular format for columnar data storage and transfer.
Mentions many GPU-accelerated libraries out there:
- cupy is like numpy + scipy.
- cudf is like pandas.
- cuml is like scikit-learn.
- cugraph is like networkx.
- arrayfire is a general purpose GPU library that has overlaps with above mentioned libraries.
- bend is a GPU-friendly programming language.
GPUs are not universally good. GPUs may not make sense when:
- Data-processing is latency-bound or I/O-bound.
- Throughput is not very important.
- The amount of data is not large enough.
- The processing pipeline needs to make too many switches between CPU and GPU contexts, bottlenecking memory-bandwidth.
Moving to distributed setup (CPU or GPU), quickly becomes network bound.
- Infiniband or RoCEv2 can allow distributed GPUs with direct memory access over a network (RDMA), bypassing CPU contexts.
- Tools that can use GPU-RDMA: Dask+OpenUCX, Spark Rapids, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-05.md

2024-05.md

May 2024

GPU Optimization Workshop (hosted by @ChipHuyen)

Crash Course to GPU Optimization (Mark Saroufim, Meta)

High Performance LLM Serving on NVIDIA GPUs (Sharan Chetlur, NVIDIA)

Block-based GPU Programming with Triton (Philippe Tillet, OpenAI)

Scaling from CPUs to distributed GPUs (William Malpica, Voltron Data)

Files

2024-05.md

Latest commit

History

2024-05.md

File metadata and controls

May 2024

GPU Optimization Workshop (hosted by @ChipHuyen)

Crash Course to GPU Optimization (Mark Saroufim, Meta)

High Performance LLM Serving on NVIDIA GPUs (Sharan Chetlur, NVIDIA)

Block-based GPU Programming with Triton (Philippe Tillet, OpenAI)

Scaling from CPUs to distributed GPUs (William Malpica, Voltron Data)