-
-
Couldn't load subscription status.
- Fork 10.9k
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder #25701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder #25701
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to disable full CUDA graphs for encoder-decoder models, such as Whisper, by forcing them into PIECEWISE mode. While the intent is correct, the implementation contains a critical logical error. The condition self.model_config.is_encoder_decoder is not None will always evaluate to True because the is_encoder_decoder property returns a boolean. This would cause all models to incorrectly fall back to PIECEWISE mode, potentially leading to performance regressions. I have provided a review comment with a suggested fix to correct this condition.
Whisper does not work with full cudagraphs. That is being worked on in PR vllm-project#25208. The failure can be reproduced reliably via `tests/models/multimodal/generation/test_whisper.py`, at least in my H100 development environment. The tests passed on the PR and I'm not sure why. Regardless, this seems like the right change to make until vllm-project#25208 sorts out exactly what changes are needed. Signed-off-by: Russell Bryant <rbryant@redhat.com>
a36468b to
3e50d7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also report the E2E acc result to make sure the change is correct?
Also possibly for perf improvement
We do have a test that does pretty extensive accuracy testing. It tests a big dataset and measures the word error rate of our transcription. It will fail if the error rate goes above a certain threshold. It's in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @russellb, sorry for the trouble. Is that test failing on main?
no big deal. I'm just confused why it works in CI but not locally. Is there something about the older GPUs in CI that would turn off cudagraphs or something? |
Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ct#25701) Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ct#25701) Signed-off-by: Russell Bryant <rbryant@redhat.com>
…ct#25701) Signed-off-by: Russell Bryant <rbryant@redhat.com>
…ct#25701) Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Whisper does not work with full cudagraphs. That is being worked on in
PR #25208.
The failure can be reproduced reliably via
tests/models/multimodal/generation/test_whisper.py, at least in myH100 development environment. The tests passed on the PR and I'm not
sure why.
Regardless, this seems like the right change to make until #25208
sorts out exactly what changes are needed.
Signed-off-by: Russell Bryant rbryant@redhat.com