Wave January 2025 Release #303

harsh-nod · 2024-12-02T17:05:30Z

Milestones

IGEMM

FlashDecoding

Broadcasting dynamic offset for paged attention
Chained Indirect Reads
Flash Decoding Kernel without paged attention with 16x16x16
Flash Decoding Kernel with paged attention
Flash Decoding Kernel without paged attention with 32x32x8
Flash Decoding Kernel without paged attention with Dynamic Dims
Performance optimizations

EvoFormer

Multi-Buffering

Fully functional multi-buffering approach for GEMMs

Benchmarking

Fix benchmarking for bf16
Benchmarking using Github Actions

De-Prioritized

Week 1

Performance comparison?

5 shapes, Wave baseline performance, Tensile without PGR and with PGR
With and without PGR can be in the range [-5% to + 9%]
Tensile performance poor, possibly can be fixed without appropriate hyperparameters
5 shapes, best performance with all Tensile knobs turned on
Proposed plan at how you would implement this in Wave

IGEMM
~~- Contiguous Loads PR~~

~~- Vectorized reads/writes~~

~~- Land Evoformer PR on main~~
~~- Address comments on PR~~
~~- Performance evaluation of Evoformer~~
~~- Flash Decoding kernel with and without paged attention~~

How does PGR2 fit into the big picture?

Week 2

Establishing a target reference kernel for Wave

Try the tile sizes in the best kernels that are shared and re-evaluate performance
With overall performance improved, are you still seeing an improvement with multi-buffering?
==========================
Identifying the core pieces of the tensile kernel that are responsible for the performance difference
Assembly vs OpenCL Kernel (evaluate the effects of inline asm, whether it is required)
How to implement?

Ivan
~~- Attention dynamic index broadcast~~

Harsh

Flash Decoding Kernel with paged attention

Stan

Kernel Caching
set_prio 10% improvement
vectorized loads
scheduling

Week 3

Ivan

Fix IGEMM on main
iree-kernel-benchmark
Add support for buffer loads
Moving gather from global to shared

Stan

Increase load width

Harsh

Flash Decoding Kernel without paged attention
Flash Decoding Kernel with paged attention

Week 4

Finished implementation of multi-buffering & performance evaluations
5-10% Performance Gain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wave January 2025 Release #303

Wave January 2025 Release #303

harsh-nod commented Dec 2, 2024 •

edited

Loading

Wave January 2025 Release #303

Wave January 2025 Release #303

Comments

harsh-nod commented Dec 2, 2024 • edited Loading

Milestones

IGEMM

FlashDecoding

EvoFormer

Multi-Buffering

Benchmarking

De-Prioritized

Week 1

Week 2

Week 3

Week 4

harsh-nod commented Dec 2, 2024 •

edited

Loading