[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes significant updates to the
examples/deepseek_mlaproject, focusing on enhancing the documentation and optimizing theexample_mla_decode.pyscript. The most important changes include the addition of a detailed README for MLA, the introduction of an argument parser for better script configurability, and various performance optimizations in the kernel implementation.Documentation improvements:
examples/deepseek_mla/README.md: Added a comprehensive guide on writing high-performance kernels with TileLang, using MLA as an example. The guide covers MLA introduction, benchmark results, implementation details, and various optimization techniques like threadblock swizzling, shared memory swizzling, warp-specialization, and pipelining.Code enhancements:
examples/deepseek_mla/example_mla_decode.py: Introduced an argument parser to allow dynamic configuration of batch size, heads, kv_heads, kv_ctx, dim, and pe_dim.examples/deepseek_mla/example_mla_decode.py: Implemented threadblock swizzling and shared memory swizzling optimizations to improve memory access patterns and reduce bank conflicts.examples/deepseek_mla/example_mla_decode.py: Replaced fragment allocation with local allocation forlse_max_localto optimize memory usage.examples/deepseek_mla/example_mla_decode.py: Added a newmain_no_splitfunction to handle cases wherenum_splitis set to 1, improving flexibility in kernel execution.These changes aim to enhance the performance and usability of the MLA decoding example, making it easier to configure and more efficient in execution.