-
-
Couldn't load subscription status.
- Fork 10.9k
Whisper cudagraphs support #25208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Whisper cudagraphs support #25208
Conversation
d4bfd9f to
1b215ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix CUDA graph warmup for Whisper by ensuring encoder lengths are always 1D arrays and guarding against scalar inputs in cross-attention. The changes involve adding a defensive np.atleast_1d call in cross_attention.py and modifying array creation in gpu_model_runner.py. While the changes are functionally correct, I've identified a piece of dead code in gpu_model_runner.py that can be removed to improve code clarity and maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
|
Thank you for working on this! I will give it a try tomorrow. |
|
Thank you. I ran some tests and found that the output is always the same default value. Using the script With this PR: Before: transcription result: And the old one pitch on the way to Edgar Martinez swung on the line down the left field line for a base hit. Here comes Joy. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My oh my. |
|
@baonudesifeizhai Thank you for the reply. I tested with the latest code. After removing the Could you please clarify how the launch parameters should be configured in order to enable CUDA Graph correctly? |
|
During development, that is the output I would get if something broke in the Encoder path -- either the encoder didn't run at all, or the output didn't get passed to the decoder properly. Just a tip in case that helps with debugging. |
cudagraph_mode=FULL...outputs seems fine , i will find a way to test the token output cause current wav file are very short
> @baonudesifeizhai Thank you for the reply. I tested with the latest code. After removing the `cudagraph_mode` configuration, I was able to get the correct result, but the latency did not change. When using `cudagraph_mode=FULL`, the output issue still exists.
>
> Could you please clarify how the launch parameters should be configured in order to enable CUDA Graph correctly?
|
…desifeizhai/vllm into whisper-cudagraphs-support
|
I used the same script and configuration as you, but I still cannot get the correct results.
|
|
I wasn’t able to run tests today, but I have a couple of questions...
Thanks! |
|
could you have a look ...? thanks....
|
I've been following the comments. I was hoping to see @Sugar-zsg be able to replicate success. I will try it soon. Please also update all commit messages to include the |
Whisper does not work with full cudagraphs. That is being worked on in PR vllm-project#25208. The failure can be reproduced reliably via `tests/models/multimodal/generation/test_whisper.py`, at least in my H100 development environment. The tests passed on the PR and I'm not sure why. Regardless, this seems like the right change to make until vllm-project#25208 sorts out exactly what changes are needed. Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
The default cudagraph mode changed to The whisper tests are failing for me locally in my H100 environment. They pass in CI, but they also passed in CI on #25444 even though it broke H100 for me. With this PR, the failure is different. It's an accuracy failure instead of failing much earlier. This is what I'm running: and here is an example failure: Can you give these tests a try? |
Whisper does not work with full cudagraphs. That is being worked on in PR vllm-project#25208. The failure can be reproduced reliably via `tests/models/multimodal/generation/test_whisper.py`, at least in my H100 development environment. The tests passed on the PR and I'm not sure why. Regardless, this seems like the right change to make until vllm-project#25208 sorts out exactly what changes are needed. Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
shoul we fix the whisper fork problem now...?
|
|
I was busy with other work for a while, but no matter how I tried, I couldn’t reproduce the same results as you reported. I reviewed the related code changes, but I couldn’t understand how this modification makes CUDA Graph take effect. From what I can tell, it seems that you’re trying to cache the encoder inputs to ensure that the decoder receives consistent inputs each time, allowing CUDA Graph to be used. However, I have the following question:
Could you please explain how this modification enables CUDA Graph to work? Thank you! |
|
The original PR only prevented crashes with np.atleast_1d() but didn't solve the root cause.
|
9cfc80e to
2f4e230
Compare
|
After further analysis, I found that when the test prompt contains only a single token, there is no encoder input, which causes abnormal results (This also explains why I was never able to reproduce the same results as you earlier.). I’ve already opened a PR and try to fix this issue. However, during re-testing, I discovered a new problem: when running batch inference, the same code works correctly on A100 GPUs, but produces abnormal results for some batch requests when running on H20 GPUs. A100 results: H20 results: |







Purpose
Fix Whisper’s CUDA graph warmup by ensuring encoder lengths are always returned as 1-D arrays.
Guard cross-attention slot mapping against scalar inputs so cudagraph capture no longer crashes for encoder-decoder models.
Test Plan
Launch server with CUDA graphs enabled:
python -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --served-model-name whisper-large-v3 --compilation-config '{"cudagraph_mode": "FULL"}'
Issue a transcription via examples/online_serving/openai_transcription_client.py with a valid audio clip.
Run the parallel stress script (using a directory of well-formed .wav files) to observe concurrent throughput.
Test Result
Server starts, captures CUDA graphs without _get_cross_slot_mapping failures.
Single transcription request returns correct text.
Parallel run completes; invalid SciPy test WAVs trigger expected “No 'data' chunk” errors, while valid files succeed.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.