-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess #24332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes unnecessary CUDA synchronization in GLM-4.1V model by optimizing the image and video preprocessing methods. The changes prevent implicit CUDA syncs that occur when calling .tolist() on GPU tensors during size calculations.
- Converts
grid_thwtensor to list once at the beginning of each method - Replaces tensor operations with CPU-based calculations for computing split sizes
- Uses the pre-converted list for both visual processing and size calculations
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Copilot
AI
Sep 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new tensor from the list and then converting back to list is inefficient. Consider using numpy operations or native Python calculations since grid_thw_list is already a Python list.
Copilot
AI
Sep 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new tensor from the list and then converting back to list is inefficient. Consider using numpy operations or native Python calculations since grid_thw_list is already a Python list.
Signed-off-by: Win <chatcharinsang@gmail.com>
Signed-off-by: Win <chatcharinsang@gmail.com>
3dbbbeb to
457da99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization in vllm/model_executor/models/glm4_1v.py by removing an unnecessary CUDA synchronization. The change involves pre-computing a list from the grid_thw tensor at the beginning of the _process_image_input and _process_video_input functions. This allows the subsequent calculation of sizes to be performed on the CPU, overlapping with the main GPU-bound visual processing task and thus avoiding a synchronization point after the main kernel launch. The implementation is correct and should lead to improved performance. Additionally, the change improves robustness by using torch.long for calculations, preventing potential overflows, and enhances readability by grouping (merge_size * merge_size).
…rocess (vllm-project#24332) Signed-off-by: Win <chatcharinsang@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io>
…rocess (vllm-project#24332) Signed-off-by: Win <chatcharinsang@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io>
…rocess (vllm-project#24332) Signed-off-by: Win <chatcharinsang@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io>
…rocess (vllm-project#24332) Signed-off-by: Win <chatcharinsang@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…rocess (vllm-project#24332) Signed-off-by: Win <chatcharinsang@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR removes unnecessary CUDA sync in
_process_image_inputand_process_video_inputofvllm/model_executor/models/glm4_1v.pyby utilisinggrid_thw_list.Related PR #22792
Related issue #23884