- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt #26892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…se when using CUDA Graph and the prompt contains only a single token. Signed-off-by: Sugar-zsg <952242923@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue where uniform_decode is incorrectly enabled for single-token prompts in CUDA graph mode, which can prevent the use of encoder outputs. The approach is to add a helper function _has_prefill_tokens_scheduled to detect if any request is still in the prefill phase and disable uniform_decode accordingly.
My review found a critical issue in the implementation. The new helper function is called with an incorrect argument, which makes the fix ineffective. I've provided a detailed comment and a code suggestion to resolve this bug. Once fixed, the change should correctly address the described problem.
Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
| Could you please review this PR when you have time ? thanks. @russellb | 
| when use the second conf(with single-token prompt and    | 
This issue was discovered while testing a previous PR.(#25208)
When running inference with the Whisper model, using
CUDAGraphMode=FULL_DECODE_ONLY, I observed the following behavior:This prompt works correctly and uses CUDA Graph:
This prompt fails to reuse encoder results (the first decoder step switches to FULL mode):
This PR fixes an issue where, when using CUDA Graph, a prompt containing only a single token causes
uniform_decode=Trueduring the prefill phase, preventing the use of encoder outputs.Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.