[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen #25520

benchislett · 2025-09-23T22:04:15Z

Purpose

I've been getting some rare illegal memory accesses when developing using trtllm-gen flashinfer kernels.

I believe the main issue comes down to the fact that the trtllm-gen and non-trtllm-gen kernels need separate workspaces. Here is a FlashInfer PR (merged) that updates the tests to avoid this issue.

flashinfer-ai/flashinfer#1643

Detailed summary

Flashinfer's wrapper-based kernels (both prefill and decode) use the workspace buffer as a scratch-space for storing intermediate results (such as split-k accumulation data). They do not require it to be zero-initialized and might not clean it up after writing data into it.

On the other hand, trtllm-gen kernels require their workspace buffer to be zero-initialized and will clean up after using it, to maintain the state invariance.

vLLM currently uses the same workspace buffer for all four (trtllm/prev, prefill/decode) combinations. This leads to rare illegal accesses when one of them corrupts the state for the other. This PR adds a dedicated, zero-initialized buffer for the trtllm-gen kernels. When using this change, I stress-tested my development deployment and do not see any more crashes.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request correctly identifies and fixes a memory corruption issue by introducing a separate, zero-initialized workspace buffer for trtllm-gen FlashInfer kernels. This prevents state corruption between different kernel types. My review focuses on improving the implementation of this fix by addressing a potential race condition. I've suggested making the initialization of the new global workspace buffer thread-safe using a lock to prevent issues in multi-threaded environments.

vllm/v1/attention/backends/flashinfer.py

mgoin

Seems reasonable. It is fine for prefill and decode to share workspace?

benchislett · 2025-09-23T22:36:51Z

@mgoin yes, it seems fine. They both use the workspace in a similar way.

It's implied in the FlashInfer tests since all cases use the same global workspace buffer, both prefill and decode.

benchislett · 2025-09-23T22:38:42Z

See also this PR comment, that it is expected behaviour that the buffer is re-used between tests.

pavanimajety · 2025-09-23T22:20:23Z

vllm/v1/attention/backends/flashinfer.py

+def _get_trtllm_gen_workspace_buffer():
+    global trtllm_gen_workspace_buffer
+    if trtllm_gen_workspace_buffer is None:
+        trtllm_gen_workspace_buffer = torch.zeros(


From the FI PR it says trtllm-gen requires zero-init buffer but flashinfer doesn't need it

No. trtllm-gen kernel and fi kernel should re-use individual workspace as fi kernel does not require zero-init workspace.

Can we make it always a zero init buffer? I am only concerned if we do run both flavors of kernels for perf, we'd end up occupying double the workspace.

I suppose FI not cleaning up the buffer is the concern here, and we want to separate the two?

…llm-project#25520)

…25520) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…llm-project#25520) Signed-off-by: gaojc <1055866782@qq.com>

…llm-project#25520) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#25520)

…llm-project#25520) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#25520)

use a separate workspace buffer for trtllm gen

0c8a509

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from LucasWilkinson September 23, 2025 22:04

benchislett requested a review from mgoin as a code owner September 23, 2025 22:04

benchislett added the bug Something isn't working label Sep 23, 2025

mergify bot added the v1 label Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

mgoin approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'main' into flashinfer-trtllm_gen-fix-workspace

ef35b1c

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025

mgoin enabled auto-merge (squash) September 23, 2025 23:25

mgoin merged commit 1983609 into vllm-project:main Sep 24, 2025
52 of 55 checks passed

pavanimajety reviewed Sep 24, 2025

View reviewed changes

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

30794b3

…llm-project#25520)

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (#…

cb825af

…25520) Signed-off-by: yewentao256 <zhyanwentao@126.com>

gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

cd7a0e7

…llm-project#25520) Signed-off-by: gaojc <1055866782@qq.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

c22efd7

…llm-project#25520) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

3cb6f97

…llm-project#25520)

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

79e43db

…llm-project#25520)

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

6f99e1b

…llm-project#25520) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (v…

7ca2604

…llm-project#25520)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen #25520

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen #25520

Uh oh!

benchislett commented Sep 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

benchislett commented Sep 23, 2025

Uh oh!

benchislett commented Sep 23, 2025

Uh oh!

Uh oh!

pavanimajety Sep 23, 2025

Uh oh!

pavanimajety Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen #25520

[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen #25520

Uh oh!

Conversation

benchislett commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Detailed summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Sep 23, 2025

Uh oh!

benchislett commented Sep 23, 2025

Uh oh!

Uh oh!

pavanimajety Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benchislett commented Sep 23, 2025 •

edited by github-actions bot

Loading