Add batch invariant kernel override for FlashInfer backend [2/n] #25769

bwasti · 2025-09-26T15:27:16Z

Continuing from #25603, this patch extends to the much faster flashinfer backend
(This might look like a big change, but I am going to rebase onto #25603 and most of it will go away, mostly just look at the flashinfer.py file)

Purpose

Add optional determinism to flashinfer backend.

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT=1 pytest -s -v tests/v1/generation/test_batch_invariance.py -k test_logprobs_bitwise_batch_invariance_bs1_vs_bsN

Test Result

Pass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces an optional batch-invariant mode, controlled by the VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT environment variable, to ensure deterministic outputs regardless of batch size. This is achieved by overriding several performance-optimized but non-deterministic kernels with deterministic alternatives, including custom Triton kernels for matmul, log_softmax, and mean, and by forcing deterministic configurations in attention backends like FlashInfer and FlexAttention. The changes are well-integrated and include a comprehensive test suite to validate the batch invariance. My review focuses on improving the robustness of how the controlling environment variable is parsed in both C++ and Python code to handle common boolean string values and prevent potential issues.

csrc/core/batch_invariant.hpp

vllm/model_executor/layers/batch_invariant.py

mgoin · 2025-09-26T16:47:10Z

vllm/v1/attention/backends/flashinfer.py

It seems support wasn't added for the trtllm path in flashinfer. Should we update supports_trtllm_attention to also check against this environment variable so we force flashinfer?

vllm/vllm/utils/flashinfer.py

Lines 184 to 192 in 984d184

@functools.cache

def supports_trtllm_attention() -> bool:

"""

TRTLLM attention is supported if the platform is SM100 and

NVIDIA artifactory is accessible

"""

# Requires SM100 and NVIDIA artifactory to be accessible to download cubins

return current_platform.is_device_capability(

100) and has_nvidia_artifactory()

hmm, not sure I follow. are you suggesting we force trtllm on top of forcing flashinfer (in the case of batch_invariant=1)?

from what I gather -- trtllm is supported quite cleanly as an option independent of the batch invariance:
https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/flashinfer.py#L540-L541

Oh okay maybe I misunderstood. I saw that you only used the new parameters in plan in the if not attn_metadata.prefill_use_trtllm: case, so I assumed that this only works for the non-trtllm backend. If it works for both backends, then my comment can be disregarded

mergify · 2025-09-29T15:15:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bwasti.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin

LGTM, just a few nits

tests/v1/generation/test_batch_invariance.py

vllm/model_executor/layers/batch_invariant.py

yewentao256

Thanks for the work!
Please also fix the pre-commit issue

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: Bram Wasti <bwasti@meta.com>

bwasti · 2025-10-03T16:19:26Z

addressed all comments in the latest!

yewentao256

Thanks for the work! A few more thoughts

vllm/v1/attention/backends/flashinfer.py

vllm/model_executor/layers/batch_invariant.py

yewentao256 · 2025-10-03T19:51:13Z

vllm/v1/attention/backends/flashinfer.py

@@ -42,6 +45,7 @@
 from vllm.v1.kv_cache_interface import AttentionSpec

 FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 * 1024 * 1024
+FLASHINFER_WORKSPACE_BUFFER_SIZE_BATCH_INVARIANT = 2048 * 1024 * 1024


Add a comment for the number here

tests/v1/generation/test_batch_invariance.py

Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Bram Wasti <bwasti@fb.com>

yewentao256

LGTM, thanks for the work!

DarkLight1337 · 2025-10-04T08:33:54Z

This PR is causing fullgraph test to fail on main: https://buildkite.com/vllm/ci/builds/33518/steps/canvas?sid=0199ad61-7880-4598-9503-66481c15c00c

Reverting for now

…m-project#25769) Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…m-project#25769) Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>

…m-project#25769) Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Signed-off-by: Bram Wasti <bwasti@meta.com>

…m-project#25769) Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…m-project#25769) Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

…m-project#25769) Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

bwasti requested review from WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat and ywang96 as code owners September 26, 2025 15:27

mergify bot added the v1 label Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

csrc/core/batch_invariant.hpp Show resolved Hide resolved

vllm/model_executor/layers/batch_invariant.py Show resolved Hide resolved

mgoin reviewed Sep 26, 2025

View reviewed changes

bwasti added this to Batch-invariant Inference Sep 26, 2025

bwasti moved this to In Progress in Batch-invariant Inference Sep 26, 2025

mergify bot added the needs-rebase label Sep 29, 2025

bwasti force-pushed the det_flashinfer branch from 6760e2d to bf4df9e Compare September 29, 2025 15:26

mergify bot removed the needs-rebase label Sep 29, 2025

bwasti force-pushed the det_flashinfer branch from bf4df9e to 68024d1 Compare September 30, 2025 14:32

bwasti changed the title ~~Add batch invariant kernel override for FlashInfer backend~~ Add batch invariant kernel override for FlashInfer backend [2/n] Sep 30, 2025

bwasti mentioned this pull request Sep 30, 2025

Add more tests for batch invariant kernel-override logic [3/n] #25975

Closed

5 tasks

mgoin reviewed Sep 30, 2025

View reviewed changes

tests/v1/generation/test_batch_invariance.py Outdated Show resolved Hide resolved

tests/v1/generation/test_batch_invariance.py Show resolved Hide resolved

vllm/model_executor/layers/batch_invariant.py Outdated Show resolved Hide resolved

bwasti force-pushed the det_flashinfer branch 2 times, most recently from 35d4192 to 64930d4 Compare October 1, 2025 15:26

yewentao256 reviewed Oct 1, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

bwasti mentioned this pull request Oct 2, 2025

[WIP] Fix RMS and test MoE for batch invariance [4/n] #26136

Closed

5 tasks

yewentao256 self-assigned this Oct 3, 2025

Add batch invariant kernel override for flashinfer

45436ea

Signed-off-by: Bram Wasti <bwasti@meta.com>

bwasti force-pushed the det_flashinfer branch from 64930d4 to 45436ea Compare October 3, 2025 16:19

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025

yewentao256 reviewed Oct 3, 2025

View reviewed changes

bwasti and others added 3 commits October 3, 2025 16:08

Update vllm/v1/attention/backends/flashinfer.py

9ed0680

Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Bram Wasti <bwasti@fb.com>

Update vllm/model_executor/layers/batch_invariant.py

bfc1059

Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Bram Wasti <bwasti@fb.com>

Update vllm/model_executor/layers/batch_invariant.py

5d68c89

Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Bram Wasti <bwasti@fb.com>

yewentao256 approved these changes Oct 3, 2025

View reviewed changes

Merge branch 'main' into det_flashinfer

59885d3

simon-mo merged commit 2f7dbc9 into vllm-project:main Oct 4, 2025
48 of 51 checks passed

DarkLight1337 mentioned this pull request Oct 4, 2025

Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" #26220

Merged

bwasti mentioned this pull request Oct 6, 2025

Bump flashinfer to 0.4.0rc2 to support determinism #26306

Closed

5 tasks

bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025

Add deterministic FlashInfer changes from PR vllm-project#25769

4d823f6

bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025

Add deterministic FlashInfer changes from PR vllm-project#25769

297ecc2

bwasti mentioned this pull request Oct 7, 2025

[unrevert] Add batch invariant kernel override for FlashInfer backend [2/n] #26373

Merged

bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025

Add deterministic FlashInfer changes from PR vllm-project#25769

fcfa198

Signed-off-by: Bram Wasti <bwasti@meta.com>

bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025

Add deterministic FlashInfer changes from PR vllm-project#25769

5045121

Signed-off-by: Bram Wasti <bwasti@meta.com>

yewentao256 moved this from In Progress to Done in Batch-invariant Inference Oct 13, 2025

	@functools.cache
	def supports_trtllm_attention() -> bool:
	"""
	TRTLLM attention is supported if the platform is SM100 and
	NVIDIA artifactory is accessible
	"""
	# Requires SM100 and NVIDIA artifactory to be accessible to download cubins
	return current_platform.is_device_capability(
	100) and has_nvidia_artifactory()

Uh oh!

Uh oh!

Add batch invariant kernel override for FlashInfer backend [2/n] #25769

Add batch invariant kernel override for FlashInfer backend [2/n] #25769

Uh oh!

Conversation

bwasti commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mgoin Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

bwasti Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 29, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bwasti commented Oct 3, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yewentao256 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bwasti commented Sep 26, 2025 •

edited by github-actions bot

Loading