[Dev][Benchmark] Add MLA paged decoding example and benchmark script #158

chengyupku · 2025-03-06T09:38:11Z

This pull request includes significant changes to the deepseek_mla examples and the tilelang library. The primary focus is on enhancing the MLA decoding process and refactoring the tilelang library for better clarity and functionality. Here are the most important changes:

Enhancements to MLA decoding:

examples/deepseek_mla/example_mla_decode_paged.py: Added a comprehensive implementation of the mla_decode_tilelang function, including the flash_mla_kernel, flash_mla_split_kv_kernel, combine, and main_split functions, to improve the efficiency and accuracy of the MLA decoding process.
examples/deepseek_mla/example_mla_decode.py: Updated the combine function to initialize lse_max_local with -T.infinity(accum_dtype) for better numerical stability.

Refactoring of the tilelang library:

tilelang/autotuner/__init__.py: Replaced all instances of tl with tilelang to improve code readability and maintain consistency across the module. This includes changes to class definitions, function parameters, and decorators. [1] [2] [3] [4] [5]
tilelang/engine/phase.py: Updated the LowerAndLegalize and OptimizeForTarget functions to use tilelang instead of tl for various transformation calls, ensuring better clarity and consistency in the transformation pipeline. [1] [2]

…cc_s from float to float16 - Remove redundant `acc_s_0` fragment in flash attention kernel - Simplify memory copy and reduction operations - Reorder memory copy and scaling steps for improved performance - Add Hopper-specific synchronization method in CUDA reduce template - Update reduce operation to use architecture-specific synchronization

… Benchmark Script - Implement comprehensive MLA (Multi-Head Latent Attention) decoding benchmark script - Add support for multiple implementations: Torch, TileLang, FlashMLA, FlashInfer, and Triton - Create flexible configuration for benchmarking different batch sizes, sequence lengths, and head configurations - Implement performance comparison and CSV output for detailed performance analysis - Add command-line argument support for targeted benchmarking and comparison

… and Precision - Replace `d` parameter with `dv` to clarify value dimension in MLA decoding - Enhance block distribution logic for split KV processing - Improve handling of remaining blocks in split KV computation - Add initialization of `lse_max_local` to prevent potential precision issues - Optimize block start and range calculations for more accurate sequence processing

…ile-ai#158) * [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 - Remove redundant `acc_s_0` fragment in flash attention kernel - Simplify memory copy and reduction operations - Reorder memory copy and scaling steps for improved performance - Add Hopper-specific synchronization method in CUDA reduce template - Update reduce operation to use architecture-specific synchronization * [Dev] Add DeepSeek MLA Decoding (Paged+Varlen) kernel and Performance Benchmark Script - Implement comprehensive MLA (Multi-Head Latent Attention) decoding benchmark script - Add support for multiple implementations: Torch, TileLang, FlashMLA, FlashInfer, and Triton - Create flexible configuration for benchmarking different batch sizes, sequence lengths, and head configurations - Implement performance comparison and CSV output for detailed performance analysis - Add command-line argument support for targeted benchmarking and comparison * [Dev] Refactor MLA Paged Decoding Kernel with Improved Block Handling and Precision - Replace `d` parameter with `dv` to clarify value dimension in MLA decoding - Enhance block distribution logic for split KV processing - Improve handling of remaining blocks in split KV computation - Add initialization of `lse_max_local` to prevent potential precision issues - Optimize block start and range calculations for more accurate sequence processing * lint

chengyupku added 5 commits March 4, 2025 21:33

Merge branch 'main' of https://github.com/tile-ai/tilelang into main

305ea76

lint

d5a77a0

chengyupku merged commit d248f96 into tile-ai:main Mar 6, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev][Benchmark] Add MLA paged decoding example and benchmark script #158

[Dev][Benchmark] Add MLA paged decoding example and benchmark script #158

Uh oh!

chengyupku commented Mar 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Dev][Benchmark] Add MLA paged decoding example and benchmark script #158

[Dev][Benchmark] Add MLA paged decoding example and benchmark script #158

Uh oh!

Conversation

chengyupku commented Mar 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant