-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Misc][qwen2_5_vl][torch.compile] Enable supports_torch_compile on generic nn.Module and demonstrate speedup on Qwen Vision model
#23207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc][qwen2_5_vl][torch.compile] Enable supports_torch_compile on generic nn.Module and demonstrate speedup on Qwen Vision model
#23207
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Sorry I'm not familiar, why do mm components not have vllm*_config set? |
|
This modification may cause recompilation during runtime. It appears that this may cause timeouts in some scenarios.🤔 |
I am honestly not too sure; an alternative we could do here is forward the vllm_config to these subclasses; this would be a less flexible solution (as it requires any nn_module we want to compile to have vllm_config in the init), but it is doable |
This is a valid concern, I will stress test this to check - if there is some easy way to trigger recompiles, can you share examples or a script that would trigger recompiles? In the meanwhileI will continue to test and try and trigger on my own. Thanks! |
|
From slack discussion:
Will try running these soon! |
@tanruixiang what do you mean by this? The way the vLLM-compile integration works, vLLM deletes the guards that TorchDynamo produces, so there will never be recompilations during inference serving time. |
@zou3519 Thanks for pointing this out. I also see slack's discussion. Let's wait for the benchmark first. |
|
@tanruixiang @zou3519 @ProExpertProg @ywang96 updated with some benchmark results for more context on our discussions |
|
@Lucaskabela do we have a bug for future steps of this effort? |
|
@miladm currently just waiting on feedback here |
37ba5ac to
d7fd15b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Lucaskabela! I actually had a chat with @youkaichao and the conclusion is that this current state is probably not good enough for us to merge this PR.
Generally speaking, while it's feasible for us to add torch.compile support for encoder in a model-by-model case (we're already doing model-by-model encoder DP support), there should be a clear evidence that this brings performance benefit since this adds additional startup time. WDYT?
|
Hi @ywang96 I think that is a very valid point - I will work on applying this to some places in order to show strong benefits |
supports_torch_compile on generic nn.Modulesupports_torch_compile on generic nn.Module
|
This pull request has merge conflicts that must be resolved before it can be |
|
|
||
| output, _ = self.proj(context_layer) | ||
| return output | ||
|
|
||
|
|
||
| @torch.library.custom_op("mylib::attn_executor", mutates_args=()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: We push all the compile breaking things here (such as .item/.tolist). This way, we can specify this as a custom op and compile around it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm why are we not just instead allowing unbacked int the graph? have we tried?
|
Does #27705 fix the issue? |
Could you provide a repro script? |
|
@DarkLight1337 no, only have revert the #27705 to avoid problems with this: #23207 VLLM_ATTENTION_BACKEND=TORCH_SDPA lm_eval |
|
@ZJY0516 You can also try sending an image. If I send you the following: https://ibb.co/XZVsT7dB It's this PR: #23207 where we are that has broken my inference, not @lgeiger's. |
|
@JartX what Hardware are you testing this on? |
|
@lgeiger My command of docker run: docker run -d --name vllm1 --tty --restart unless-stopped --shm-size 48gb -p 80:8000 --device /dev/kfd --device /dev/dri --group-add video --ipc host --network host --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v ./chat_vl.jinja:/chat-template-tools.jinja -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e VLLM_USE_V1=1 -e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 -e MIOPEN_USER_DB_PATH=/apps/miopen-cache -e MIOPEN_CUSTOM_CACHE_DIR=/apps/miopen-cache -e OMP_NUM_THREADS=1 -e VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm-rocm-251029_broken_test_1 vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct --gpu-memory-utilization 0.98 --max_model_len 65536 -tp 4 --served-model-name QWEN3--port 80 --limit-mm-per-prompt '{"image":6,"video":0}' --dtype float16 --enable-log-requests --chat-template /chat-template-tools.jinja |
|
cc @tjtanaa for AMD stuff |
|
@JartX Maybe we need to add back the rocm-only |
|
Hi @lgeiger could you add the contiguous on torch.sdpa on your pr please? https://github.com/vllm-project/vllm/pull/27741/files |
|
I will be debugging on Mi300X. |
|
adding if current_platform.is_rocm(): @tjtanaa @DarkLight1337 @lgeiger I'm going to do a PR express thanks @lgeiger |
|
@JartX go ahead and open that bugfix PR. For other path on ROCm, we still need to do some fixing. |
…generic nn.Module and demonstrate speedup on Qwen Vision model (vllm-project#23207) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: Lucas Kabela <lucasakabela@gmail.com>
…generic nn.Module and demonstrate speedup on Qwen Vision model (vllm-project#23207) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: Lucas Kabela <lucasakabela@gmail.com>
…generic nn.Module and demonstrate speedup on Qwen Vision model (vllm-project#23207) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: Lucas Kabela <lucasakabela@gmail.com>
### What this PR does / why we need it? adapt vllm-ascend main branch with vllm releases/v0.11.1 fix `forward context not set` in test_vlm.py caused by: vllm-project/vllm#23207 fix import `cdiv round` failed caused by: vllm-project/vllm#27188 fix import `init_cached_hf_modules` failed caused by: vllm-project/vllm#27567 adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused by: vllm-project/vllm#27654 - remove unused code in sigmoid_gating.py - `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`, `fused_recurrent_gated_delta_rule_fwd` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b Signed-off-by: 22dimensions <waitingwind@foxmail.com>
We enable
@supports_torch_compileon generic nn.Modules (as opposed to only top level architecture) and demonstrate the application in qwen_2_5_vlPurpose
This PR is a first step towards supporting torch compile for multimodal encoders (such as vision or audio). Since these modality specific components do not have vLLM config, we begin by taking a step to allow these modules to be compiled
Test Plan
Unit Test
E2E Tests
Test Result
BEFORE CHANGES:
AFTER CHANGES:
Benchmarks
Main + Torch (nightly)
This PR + Torch (nightly)
So with this PR, we observe moderate improvements to several key metrics (~5% relative improvement to throughput, ~2% improvement to TTFT, ~7% improvement to TPOT, and ~7% improvement to mean ITL (median ITL has only a minor improvement)
This requires only a minor rewrite (moving attention to a custom operator) in order to torch compile the main vision backbone (the VisionBlock).
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.