GPU Optimization Workshop (May 2024)

Slides, notes, and materials for the workshop

Pre-event note

The talks are pretty technical, given that this is a workshop on GPU optimization. The speakers try their best to make their topics accessible, but you’ll make more out of the workshop if you familiarize yourself with the basic concepts in advance. (See Reading materials)
The event will be livestreamed on YouTube, but questions should be asked on Discord, not YouTube.
Given that we have 2000+ people signing up for the event, we expect there will be a lot of interesting live discussions on Discord.
Workshop TAs who will be helping us run the workshop:

Schedule

[12:00] Crash course on GPU optimization (Mark Saroufim @ Meta)

Mark is a PyTorch core developer and cofounder of CUDA MODE. He also ran the really fun NeurIPS LLM Efficiency challenge last year. Previously, he was at Graphcore and Microsoft.

Mark will give an overview of why GPUs, the metrics that matter, and different GPU programming models (thread-based CUDA and block-based Triton). He promises this will be a painless guide to writing CUDA/Triton kernels! This talk will give us the basics to understand the rest of the workshop.

[12:45] High-performance LLM serving on GPUs (Sharan Chetlur @ NVIDIA)

Sharan is a principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, optimizing the performance of deep learning models from a single GPU to a full data center scale. Previously, he was the Director of Engineering at Cerebras.

Sharan will discuss how to build performant, flexible solutions to optimize LLM serving given the rapid evolution of new models and techniques. The talk will cover optimization techniques such as token concatenation, different strategies for batching, and cache.

[13:20] Block-based GPU Programming with Triton (Philippe Tillet @ OpenAI)

Philippe is currently leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana.

Philippe will explain how Triton works and how its block-based programming model differs from the traditional single instruction, multiple threads (SIMT) programming model that CUDA follows. Triton aims to be higher-level than CUDA while being more expressive (lower-level) than common graph compilers like XLA and Torch-Inductor.

[14:00] Scaling data processing from CPU to distributed GPUs (William Malpica @ Voltron Data)

William is a co-founder of Voltron Data and the creator of BlazingSQL. He helped scale Theseus, a GPU-native query engine, to handle 100TB queries!

Most people today use GPUs for training and inference. A category of workloads that GPUs excel at but are underutilized for is data processing. In this talk, William will discuss why large-scale data processing should be done on GPUs instead of CPUs and how different tools like cuDF, RAPIDS, and Theseus leverage GPUs for data processing.

Reading materials

Please read the schedule below carefully. If there are terms you’re not familiar with, you might want to look them up in advance. Examples:

Memory bound vs. compute bound: whether the bottleneck is in GPU’s memory or in computation capabilities.
Thread-based vs. block-based: different programming models for GPU programming. CUDA is thread-based and Triton is block-based.

Tools that will be discussed in the workshop:

Recommended resources:

How CUDA Programming Works - Stephen Jones, NVIDIA (great lecture)
The Best GPUs for Deep Learning in 2023 — An In-depth Analysis (Tim Dettmers)
CUDA MODE Discord. They have a great lecture series on GPU optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation		Talk 1 - A Crash course on GPU Optimization - Mark Saroufim - Meta corporation
Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia		Talk 2 - High Performance LLM Serving on Nvidia GPUs - Sharan Chetlur -Nvidia
Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI		Talk 3 - Block Based GPU Programming with Triton - Phil Tillet - OpenAI
Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data		Talk 4 - Scaling data processing from CPU to distributed GPU - William Malpica - Voltron Data
README.md		README.md
community-note.md		community-note.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Optimization Workshop (May 2024)

Pre-event note

Schedule

Reading materials

About

Releases

Packages

Contributors 2

mlops-discord/gpu-optimization-workshop

Folders and files

Latest commit

History

Repository files navigation

GPU Optimization Workshop (May 2024)

Pre-event note

Schedule

Reading materials

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages