Skip to content

Conversation

@iofu728
Copy link
Collaborator

@iofu728 iofu728 commented May 26, 2025

What does this PR do?

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@iofu728, @Starmys

@iofu728 iofu728 requested a review from Copilot May 26, 2025 09:31
@iofu728 iofu728 added the feature feature label May 26, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a feature update to support SGLang and vLLM in the flash attention kernels. Key changes include updating error handling in the softmax fusion block, modifying pad computation in the vertical sparse attention function, and adding a new function (sglang_vs) to support SGLang-based flash attention.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
minference/ops/xattention_fa.py Updated error handling by replacing a breakpoint() with an assert.
minference/ops/pit_sparse_flash_attention_v2.py Adjusted pad calculation and added try/except imports with a new sglang_vs function for SGLang integration.

Comment on lines 248 to +249
except:
breakpoint()
assert False, f"xAttention error, k_len: {k_len}, segment size: {segment_size}"
Copy link

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'assert False' for error handling may be less informative in production; consider raising a specific exception (e.g., RuntimeError) with the same message.

Copilot uses AI. Check for mistakes.
block_size_M: int = 64,
block_size_N: int = 64,
):
batch_size, num_heads, context_size, head_dim = query.shape
Copy link

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pad computation uses a bitwise operation that assumes block_size_M is a power of two; adding a clarifying comment or an explicit check could improve code clarity.

Suggested change
batch_size, num_heads, context_size, head_dim = query.shape
batch_size, num_heads, context_size, head_dim = query.shape
# Ensure block_size_M is a power of two, as required for the bitwise operation below.
if block_size_M & (block_size_M - 1) != 0 or block_size_M <= 0:
raise ValueError("block_size_M must be a power of two and greater than zero.")
# Compute padding size. The bitwise operation assumes block_size_M is a power of two.

Copilot uses AI. Check for mistakes.

return out[..., :context_size, :head_dim]

def sglang_vs(query, key, value, v_idx, s_idx, block_size_M: int = 64, block_size_N: int = 64):
Copy link

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding a docstring for the new sglang_vs function to explain its purpose, parameter expectations, and how it differs from vertical_slash_sparse_attention.

Copilot uses AI. Check for mistakes.
@iofu728 iofu728 changed the title Feature(MInference): add SGLang and vLLM vertical_and_slash flash attention Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels May 30, 2025
@iofu728 iofu728 merged commit b46eea0 into main May 30, 2025
1 check passed
@iofu728 iofu728 deleted the hjiang/use_sglang_vllm_kernel branch May 30, 2025 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants