Skip to content

Conversation

@bwasti
Copy link
Contributor

@bwasti bwasti commented Sep 26, 2025

Continuing from #25603, this patch extends to the much faster flashinfer backend
(This might look like a big change, but I am going to rebase onto #25603 and most of it will go away, mostly just look at the flashinfer.py file)

Purpose

Add optional determinism to flashinfer backend.

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT=1 pytest -s -v tests/v1/generation/test_batch_invariance.py -k test_logprobs_bitwise_batch_invariance_bs1_vs_bsN

Test Result

Pass.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optional batch-invariant mode, controlled by the VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT environment variable, to ensure deterministic outputs regardless of batch size. This is achieved by overriding several performance-optimized but non-deterministic kernels with deterministic alternatives, including custom Triton kernels for matmul, log_softmax, and mean, and by forcing deterministic configurations in attention backends like FlashInfer and FlexAttention. The changes are well-integrated and include a comprehensive test suite to validate the batch invariance. My review focuses on improving the robustness of how the controlling environment variable is parsed in both C++ and Python code to handle common boolean string values and prevent potential issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems support wasn't added for the trtllm path in flashinfer. Should we update supports_trtllm_attention to also check against this environment variable so we force flashinfer?

@functools.cache
def supports_trtllm_attention() -> bool:
"""
TRTLLM attention is supported if the platform is SM100 and
NVIDIA artifactory is accessible
"""
# Requires SM100 and NVIDIA artifactory to be accessible to download cubins
return current_platform.is_device_capability(
100) and has_nvidia_artifactory()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure I follow. are you suggesting we force trtllm on top of forcing flashinfer (in the case of batch_invariant=1)?

from what I gather -- trtllm is supported quite cleanly as an option independent of the batch invariance:
https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/flashinfer.py#L540-L541

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay maybe I misunderstood. I saw that you only used the new parameters in plan in the if not attn_metadata.prefill_use_trtllm: case, so I assumed that this only works for the non-trtllm backend. If it works for both backends, then my comment can be disregarded

@mergify
Copy link

mergify bot commented Sep 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bwasti.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 29, 2025
@mergify mergify bot removed the needs-rebase label Sep 29, 2025
@bwasti bwasti changed the title Add batch invariant kernel override for FlashInfer backend Add batch invariant kernel override for FlashInfer backend [2/n] Sep 30, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits

@bwasti bwasti force-pushed the det_flashinfer branch 2 times, most recently from 35d4192 to 64930d4 Compare October 1, 2025 15:26
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!
Please also fix the pre-commit issue

Signed-off-by: Bram Wasti <bwasti@meta.com>
@bwasti
Copy link
Contributor Author

bwasti commented Oct 3, 2025

addressed all comments in the latest!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! A few more thoughts

@@ -42,6 +45,7 @@
from vllm.v1.kv_cache_interface import AttentionSpec

FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 * 1024 * 1024
FLASHINFER_WORKSPACE_BUFFER_SIZE_BATCH_INVARIANT = 2048 * 1024 * 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment for the number here

bwasti and others added 3 commits October 3, 2025 16:08
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@simon-mo simon-mo merged commit 2f7dbc9 into vllm-project:main Oct 4, 2025
48 of 51 checks passed
@DarkLight1337
Copy link
Member

This PR is causing fullgraph test to fail on main: https://buildkite.com/vllm/ci/builds/33518/steps/canvas?sid=0199ad61-7880-4598-9503-66481c15c00c

Reverting for now

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025
bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025
bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025
Signed-off-by: Bram Wasti <bwasti@meta.com>
bwasti added a commit to bwasti/vllm that referenced this pull request Oct 7, 2025
Signed-off-by: Bram Wasti <bwasti@meta.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
@yewentao256 yewentao256 moved this from In Progress to Done in Batch-invariant Inference Oct 13, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…m-project#25769)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Development

Successfully merging this pull request may close these issues.

5 participants