-
Notifications
You must be signed in to change notification settings - Fork 63
Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a feature update to support SGLang and vLLM in the flash attention kernels. Key changes include updating error handling in the softmax fusion block, modifying pad computation in the vertical sparse attention function, and adding a new function (sglang_vs) to support SGLang-based flash attention.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| minference/ops/xattention_fa.py | Updated error handling by replacing a breakpoint() with an assert. |
| minference/ops/pit_sparse_flash_attention_v2.py | Adjusted pad calculation and added try/except imports with a new sglang_vs function for SGLang integration. |
| except: | ||
| breakpoint() | ||
| assert False, f"xAttention error, k_len: {k_len}, segment size: {segment_size}" |
Copilot
AI
May 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 'assert False' for error handling may be less informative in production; consider raising a specific exception (e.g., RuntimeError) with the same message.
| block_size_M: int = 64, | ||
| block_size_N: int = 64, | ||
| ): | ||
| batch_size, num_heads, context_size, head_dim = query.shape |
Copilot
AI
May 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new pad computation uses a bitwise operation that assumes block_size_M is a power of two; adding a clarifying comment or an explicit check could improve code clarity.
| batch_size, num_heads, context_size, head_dim = query.shape | |
| batch_size, num_heads, context_size, head_dim = query.shape | |
| # Ensure block_size_M is a power of two, as required for the bitwise operation below. | |
| if block_size_M & (block_size_M - 1) != 0 or block_size_M <= 0: | |
| raise ValueError("block_size_M must be a power of two and greater than zero.") | |
| # Compute padding size. The bitwise operation assumes block_size_M is a power of two. |
|
|
||
| return out[..., :context_size, :head_dim] | ||
|
|
||
| def sglang_vs(query, key, value, v_idx, s_idx, block_size_M: int = 64, block_size_N: int = 64): |
Copilot
AI
May 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding a docstring for the new sglang_vs function to explain its purpose, parameter expectations, and how it differs from vertical_slash_sparse_attention.
What does this PR do?
Before submitting
to it if that's the case.
Who can review?
@iofu728, @Starmys