[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks #134

chengyupku · 2025-03-03T17:19:57Z

This pull request includes significant updates to the examples/deepseek_mla project, focusing on enhancing the documentation and optimizing the example_mla_decode.py script. The most important changes include the addition of a detailed README for MLA, the introduction of an argument parser for better script configurability, and various performance optimizations in the kernel implementation.

Documentation improvements:

examples/deepseek_mla/README.md: Added a comprehensive guide on writing high-performance kernels with TileLang, using MLA as an example. The guide covers MLA introduction, benchmark results, implementation details, and various optimization techniques like threadblock swizzling, shared memory swizzling, warp-specialization, and pipelining.

Code enhancements:

examples/deepseek_mla/example_mla_decode.py: Introduced an argument parser to allow dynamic configuration of batch size, heads, kv_heads, kv_ctx, dim, and pe_dim.
examples/deepseek_mla/example_mla_decode.py: Implemented threadblock swizzling and shared memory swizzling optimizations to improve memory access patterns and reduce bank conflicts.
examples/deepseek_mla/example_mla_decode.py: Replaced fragment allocation with local allocation for lse_max_local to optimize memory usage.
examples/deepseek_mla/example_mla_decode.py: Added a new main_no_split function to handle cases where num_split is set to 1, improving flexibility in kernel execution.

These changes aim to enhance the performance and usability of the MLA decoding example, making it easier to configure and more efficient in execution.

…HA WGMMA pipelined example (FA3-like scheduling) This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler: - Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc` - Added pass registration for `RewriteWgmmaSync` - Updated `tilelang/engine/phase.py` to include the new transformation pass - Updated `tilelang/transform/__init__.py` to expose the new pass The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.

Improve thread tag validation in warp specialized rewriter to prevent unintended transformations: - Add more precise checks for threadIdx.y and threadIdx.z - Validate thread extent to ensure only single-extent thread bindings are allowed - Prevent warp specialization for multi-extent thread bindings in y and z dimensions

…lash Attention Implementations - Add new `flash_attn` macro for non-split flash attention implementation - Add swizzled layout for tile in shared memory - Use threadblock swizzle to imporve L2 cache hit rate

…nce Benchmarks - Add detailed README.md explaining MLA (Multi-Head Latent Attention) implementation - Include performance benchmark images for batch sizes 64 and 128 - Add layout visualization images for QK and PV operations - Implement torch reference implementations in torch_refs.py - Update example_mla_decode.py with command-line argument support and flexible configuration - Add performance benchmarking and comparison with other implementations

LeiWang1999

Overall LGTM, Merged :)

chengyupku added 9 commits February 25, 2025 23:20

[Dev] Add RetNet Linear Attention example

6d5ed6f

Merge branch 'main' of https://github.com/tile-ai/tilelang into main

6312ee3

lint

fbeb0e9

[CI] Add TMA descriptor attribute to transformed module in test case

0bc1b63

[Dev] Refactor DeepSeek MLA Decode Example with Non-Split and Split F…

1a225bc

…lash Attention Implementations - Add new `flash_attn` macro for non-split flash attention implementation - Add swizzled layout for tile in shared memory - Use threadblock swizzle to imporve L2 cache hit rate

Merge branch 'main' of https://github.com/tile-ai/tilelang into main

3143b25

chengyupku requested a review from LeiWang1999 March 3, 2025 17:20

LeiWang1999 approved these changes Mar 3, 2025

View reviewed changes

chengyupku merged commit 0bbd063 into tile-ai:main Mar 3, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks #134

[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks #134

Uh oh!

chengyupku commented Mar 3, 2025

Uh oh!

LeiWang1999 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks #134

[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks #134

Uh oh!

Conversation

chengyupku commented Mar 3, 2025

Uh oh!

LeiWang1999 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants