Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt #26892

Sugar-zsg · 2025-10-15T08:01:35Z

This issue was discovered while testing a previous PR.(#25208)

When running inference with the Whisper model, using CUDAGraphMode=FULL_DECODE_ONLY, I observed the following behavior:

This prompt works correctly and uses CUDA Graph:

{
    "prompt": "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>",
    "multi_modal_data": {
        "audio": (audio_waveform, None)
    }
}

This prompt fails to reuse encoder results (the first decoder step switches to FULL mode):

{
    "prompt": "<|startoftranscript|>",
    "multi_modal_data": {
        "audio": (audio_waveform, None)
    }
}

This PR fixes an issue where, when using CUDA Graph, a prompt containing only a single token causes uniform_decode=True during the prefill phase, preventing the use of encoder outputs.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…se when using CUDA Graph and the prompt contains only a single token. Signed-off-by: Sugar-zsg <952242923@qq.com>

gemini-code-assist

Code Review

This pull request aims to fix an issue where uniform_decode is incorrectly enabled for single-token prompts in CUDA graph mode, which can prevent the use of encoder outputs. The approach is to add a helper function _has_prefill_tokens_scheduled to detect if any request is still in the prefill phase and disable uniform_decode accordingly.

My review found a critical issue in the implementation. The new helper function is called with an incorrect argument, which makes the fix ineffective. I've provided a detailed comment and a code suggestion to resolve this bug. Once fixed, the change should correctly address the described problem.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Sugar-zsg <952242923@qq.com>

Sugar-zsg · 2025-10-16T02:44:48Z

Could you please review this PR when you have time ? thanks. @russellb

Sugar-zsg · 2025-10-20T02:06:53Z

when use the second conf(with single-token prompt and CUDAGraphMode=FULL_DECODE_ONLY)

Before:

transcription result: Thank you.
transcription result: Thank you.

With this PR：

transcription result: The first words I spoke in the original phonograph a little piece of practical poetry. Mary had her little lamb it sleet was white as snow and everywhere that Mary went, the Lamb would sure to go!

transcription result: And the old one pitch on the way to Edgar Martinez swung on the line down the left field line for a base hit. Here comes Joy. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My oh my.

Fix an issue where uniform_decode becomes True during the prefill pha…

ecfc625

…se when using CUDA Graph and the prompt contains only a single token. Signed-off-by: Sugar-zsg <952242923@qq.com>

Sugar-zsg requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners October 15, 2025 08:01

mergify bot added the v1 label Oct 15, 2025

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Sugar-zsg and others added 4 commits October 15, 2025 16:08

standardization

756c7d1

Signed-off-by: Sugar-zsg <952242923@qq.com>

standardization

bb31567

Signed-off-by: Sugar-zsg <952242923@qq.com>

Standardized Logic

8d622dc

Signed-off-by: Sugar-zsg <952242923@qq.com>

Merge branch 'main' into whisper-cudagraph-test

de32b68

Sugar-zsg added 2 commits October 20, 2025 10:07

Merge branch 'main' into whisper-cudagraph-test

7762148

Merge branch 'main' into whisper-cudagraph-test

43ab147

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt #26892

Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt #26892

Sugar-zsg commented Oct 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Sugar-zsg commented Oct 16, 2025

Uh oh!

Sugar-zsg commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt #26892

Are you sure you want to change the base?

Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt #26892

Conversation

Sugar-zsg commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Sugar-zsg commented Oct 16, 2025

Uh oh!

Sugar-zsg commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sugar-zsg commented Oct 15, 2025 •

edited by github-actions bot

Loading