Skip to content

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Sep 23, 2025

Purpose

I've been getting some rare illegal memory accesses when developing using trtllm-gen flashinfer kernels.

I believe the main issue comes down to the fact that the trtllm-gen and non-trtllm-gen kernels need separate workspaces. Here is a FlashInfer PR (merged) that updates the tests to avoid this issue.

flashinfer-ai/flashinfer#1643

Detailed summary

Flashinfer's wrapper-based kernels (both prefill and decode) use the workspace buffer as a scratch-space for storing intermediate results (such as split-k accumulation data). They do not require it to be zero-initialized and might not clean it up after writing data into it.

On the other hand, trtllm-gen kernels require their workspace buffer to be zero-initialized and will clean up after using it, to maintain the state invariance.

vLLM currently uses the same workspace buffer for all four (trtllm/prev, prefill/decode) combinations. This leads to rare illegal accesses when one of them corrupts the state for the other. This PR adds a dedicated, zero-initialized buffer for the trtllm-gen kernels. When using this change, I stress-tested my development deployment and do not see any more crashes.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett requested a review from mgoin as a code owner September 23, 2025 22:04
@benchislett benchislett added the bug Something isn't working label Sep 23, 2025
@mergify mergify bot added the v1 label Sep 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and fixes a memory corruption issue by introducing a separate, zero-initialized workspace buffer for trtllm-gen FlashInfer kernels. This prevents state corruption between different kernel types. My review focuses on improving the implementation of this fix by addressing a potential race condition. I've suggested making the initialization of the new global workspace buffer thread-safe using a lock to prevent issues in multi-threaded environments.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. It is fine for prefill and decode to share workspace?

@benchislett
Copy link
Collaborator Author

@mgoin yes, it seems fine. They both use the workspace in a similar way.

It's implied in the FlashInfer tests since all cases use the same global workspace buffer, both prefill and decode.

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025
@benchislett
Copy link
Collaborator Author

See also this PR comment, that it is expected behaviour that the buffer is re-used between tests.

@mgoin mgoin enabled auto-merge (squash) September 23, 2025 23:25
@mgoin mgoin merged commit 1983609 into vllm-project:main Sep 24, 2025
52 of 55 checks passed
def _get_trtllm_gen_workspace_buffer():
global trtllm_gen_workspace_buffer
if trtllm_gen_workspace_buffer is None:
trtllm_gen_workspace_buffer = torch.zeros(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the FI PR it says trtllm-gen requires zero-init buffer but flashinfer doesn't need it

No. trtllm-gen kernel and fi kernel should re-use individual workspace as fi kernel does not require zero-init workspace.

Can we make it always a zero init buffer? I am only concerned if we do run both flavors of kernels for perf, we'd end up occupying double the workspace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose FI not cleaning up the buffer is the concern here, and we want to separate the two?

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants