Skip to content

Latest commit

 

History

History
82 lines (72 loc) · 6.79 KB

2024-05.md

File metadata and controls

82 lines (72 loc) · 6.79 KB

May 2024

GPU Optimization Workshop (hosted by @ChipHuyen)

YouTube video, Google Doc notes Slides and more notes

Crash Course to GPU Optimization (Mark Saroufim, Meta)

High Performance LLM Serving on NVIDIA GPUs (Sharan Chetlur, NVIDIA)

Block-based GPU Programming with Triton (Philippe Tillet, OpenAI)

  • Triton is an abstraction that is higher level (and simpler) than low-level compute libraries (such as CUDA), but is more expressive (but more complicated) than graph compilers (such as PyTorch).
  • Discusses the Machine Model and Programming Model.
  • Although Triton code is written in Python, triton.jit analyzes the syntax tree and generates code based on it.
  • Shows how to implement vector addition and softmax in Triton.
  • torch.compile should produce good Triton kernels by default.
  • Triton has experimental support for AMD GPUs too. (Link to issue)
  • Explains the kinds of optimizations that you get when using Triton.

Scaling from CPUs to distributed GPUs (William Malpica, Voltron Data)

  • CPUs are good for row-wise processing, as often found in Online Transactional Processing (OLTP).
  • GPUs are good for column-wise processing, as often found in Online Analytical Processing (OLAP).
  • Apache Arrow is the most popular format for columnar data storage and transfer.
  • Mentions many GPU-accelerated libraries out there:
    • cupy is like numpy + scipy.
    • cudf is like pandas.
    • cuml is like scikit-learn.
    • cugraph is like networkx.
    • arrayfire is a general purpose GPU library that has overlaps with above mentioned libraries.
    • bend is a GPU-friendly programming language.
  • GPUs are not universally good. GPUs may not make sense when:
    • Data-processing is latency-bound or I/O-bound.
    • Throughput is not very important.
    • The amount of data is not large enough.
    • The processing pipeline needs to make too many switches between CPU and GPU contexts, bottlenecking memory-bandwidth.
  • Moving to distributed setup (CPU or GPU), quickly becomes network bound.
    • Infiniband or RoCEv2 can allow distributed GPUs with direct memory access over a network (RDMA), bypassing CPU contexts.
    • Tools that can use GPU-RDMA: Dask+OpenUCX, Spark Rapids, etc.