Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wave January 2025 Release #303

Open
8 of 47 tasks
harsh-nod opened this issue Dec 2, 2024 · 0 comments
Open
8 of 47 tasks

Wave January 2025 Release #303

harsh-nod opened this issue Dec 2, 2024 · 0 comments

Comments

@harsh-nod
Copy link
Contributor

harsh-nod commented Dec 2, 2024

Milestones

IGEMM

  • Add to iree-kernel-benchmark
  • Contiguous loads optimization
  • Shared memory data shuffle
  • Scalarizing the gather
  • Move gather from global to shared
  • Reduce the number of shared memory barriers
  • Add support for buffer loads (masked loads/stores, gather/scatter)
  • Improve test case coverage (different dtypes, mfma intrinsics, shapes, etc.)
  • Enable scheduling
  • BF16 MFMA Intrinsics
  • Fix failures on main
  • Fixes to upstream amdgpu.raw_buffer_load/store lowering
  • Perf-ci

FlashDecoding

  • Broadcasting dynamic offset for paged attention
  • Chained Indirect Reads
  • Flash Decoding Kernel without paged attention with 16x16x16
  • Flash Decoding Kernel with paged attention
  • Flash Decoding Kernel without paged attention with 32x32x8
  • Flash Decoding Kernel without paged attention with Dynamic Dims
  • Performance optimizations

EvoFormer

  • Vectorized writes
  • Kernel caching
  • Max3 instructions
  • Larger global loads
  • Vectorized reads
  • FAv3 - FP8
  • FAv3 - Scheduling
  • FAv3 - Wave specialization (set_prio)
  • Scalar support
  • Transpose using Shuffles
  • Performance Nightly-ci (different machine, what is being tested), add to iree-kernel-benchmark?
  • Performance evaluation to evaluate low hanging fruit ? (preliminary tuning, waves_per_eu)

Multi-Buffering

  • Fully functional multi-buffering approach for GEMMs

Benchmarking

  • Fix benchmarking for bf16
  • Benchmarking using Github Actions

De-Prioritized

  • Packed Shuffles
  • Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
  • Extend Attention (split-k vs warp reduction)
  • Prefill Attention
  • Update Paper
  • Debugger support (add breakpoints and inspect stack on GPU)
  • Profiling support
  • Ensure that mappings modify the index sequence
  • GEMM Non-temporal loads
  • GEMM + SiLU fusion kernel
  • MoE Kernel
  • Parallel compile and then run

Week 1

Performance comparison?

  • 5 shapes, Wave baseline performance, Tensile without PGR and with PGR
  • With and without PGR can be in the range [-5% to + 9%]
  • Tensile performance poor, possibly can be fixed without appropriate hyperparameters
  • 5 shapes, best performance with all Tensile knobs turned on
    Proposed plan at how you would implement this in Wave

IGEMM
- Contiguous Loads PR

- Vectorized reads/writes

- Land Evoformer PR on main
- Address comments on PR
- Performance evaluation of Evoformer
- Flash Decoding kernel with and without paged attention

How does PGR2 fit into the big picture?

Week 2

Establishing a target reference kernel for Wave

  • Try the tile sizes in the best kernels that are shared and re-evaluate performance
  • With overall performance improved, are you still seeing an improvement with multi-buffering?
    ==========================
  • Identifying the core pieces of the tensile kernel that are responsible for the performance difference
  • Assembly vs OpenCL Kernel (evaluate the effects of inline asm, whether it is required)
  • How to implement?

Ivan
- Attention dynamic index broadcast

Harsh

  • Flash Decoding Kernel with paged attention

Stan

  • Kernel Caching
  • set_prio 10% improvement
  • vectorized loads
  • scheduling

Week 3

Ivan

  • Fix IGEMM on main
  • iree-kernel-benchmark
  • Add support for buffer loads
  • Moving gather from global to shared

Stan

  • Increase load width

Harsh

  • Flash Decoding Kernel without paged attention
  • Flash Decoding Kernel with paged attention

Week 4

Finished implementation of multi-buffering & performance evaluations
5-10% Performance Gain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant