Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels #153

iofu728 · 2025-05-26T09:31:48Z

What does this PR do?

Feature: support SGLang [Feat] Add sparse attn to sgl-kernel sgl-project/sglang#5327 and vLLM Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) vllm-project/flash-attention#33, Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support vllm-project/vllm#11844 vertical_and_slash flash attention and index kernels

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@iofu728, @Starmys

…ention

Copilot

Pull Request Overview

This PR introduces a feature update to support SGLang and vLLM in the flash attention kernels. Key changes include updating error handling in the softmax fusion block, modifying pad computation in the vertical sparse attention function, and adding a new function (sglang_vs) to support SGLang-based flash attention.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
minference/ops/xattention_fa.py	Updated error handling by replacing a breakpoint() with an assert.
minference/ops/pit_sparse_flash_attention_v2.py	Adjusted pad calculation and added try/except imports with a new sglang_vs function for SGLang integration.

Copilot · 2025-05-26T09:32:34Z

minference/ops/xattention_fa.py

    except:
-        breakpoint()
+        assert False, f"xAttention error, k_len: {k_len}, segment size: {segment_size}"


Using 'assert False' for error handling may be less informative in production; consider raising a specific exception (e.g., RuntimeError) with the same message.

Copilot · 2025-05-26T09:32:34Z

minference/ops/pit_sparse_flash_attention_v2.py

    block_size_M: int = 64,
    block_size_N: int = 64,
 ):
    batch_size, num_heads, context_size, head_dim = query.shape


The new pad computation uses a bitwise operation that assumes block_size_M is a power of two; adding a clarifying comment or an explicit check could improve code clarity.

Suggested change

batch_size, num_heads, context_size, head_dim = query.shape

batch_size, num_heads, context_size, head_dim = query.shape

# Ensure block_size_M is a power of two, as required for the bitwise operation below.

if block_size_M & (block_size_M - 1) != 0 or block_size_M <= 0:

raise ValueError("block_size_M must be a power of two and greater than zero.")

# Compute padding size. The bitwise operation assumes block_size_M is a power of two.

Copilot · 2025-05-26T09:32:34Z

minference/ops/pit_sparse_flash_attention_v2.py

+
    return out[..., :context_size, :head_dim]
+
+def sglang_vs(query, key, value, v_idx, s_idx, block_size_M: int = 64, block_size_N: int = 64):


[nitpick] Consider adding a docstring for the new sglang_vs function to explain its purpose, parameter expectations, and how it differs from vertical_slash_sparse_attention.

Feature(MInference): add SGLang and vLLM vertical_and_slash flash att…

9d57594

…ention

iofu728 requested a review from Copilot May 26, 2025 09:31

iofu728 assigned iofu728 and Starmys May 26, 2025

iofu728 added the feature feature label May 26, 2025

Copilot AI reviewed May 26, 2025

View reviewed changes

Feature(MInference): support SGLang and vLLM kernels

f588e65

iofu728 changed the title ~~Feature(MInference): add SGLang and vLLM vertical_and_slash flash attention~~ Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels May 30, 2025

iofu728 merged commit b46eea0 into main May 30, 2025
1 check passed

iofu728 deleted the hjiang/use_sglang_vllm_kernel branch May 30, 2025 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels #153

Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels #153

Uh oh!

iofu728 commented May 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 26, 2025

Uh oh!

Copilot AI May 26, 2025

Uh oh!

Copilot AI May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    batch_size, num_heads, context_size, head_dim = query.shape
+    batch_size, num_heads, context_size, head_dim = query.shape
+    # Ensure block_size_M is a power of two, as required for the bitwise operation below.
+    if block_size_M & (block_size_M - 1) != 0 or block_size_M <= 0:
+        raise ValueError("block_size_M must be a power of two and greater than zero.")
+    # Compute padding size. The bitwise operation assumes block_size_M is a power of two.


		return out[..., :context_size, :head_dim]

		def sglang_vs(query, key, value, v_idx, s_idx, block_size_M: int = 64, block_size_N: int = 64):

Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels #153

Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels #153

Uh oh!

Conversation

iofu728 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iofu728 commented May 26, 2025 •

edited

Loading