-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[BugFix] Fix logits repetition penalty cuda check #22592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix logits repetition penalty cuda check #22592
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the apply_repetition_penalties function that caused a runtime error when running on a CPU device on a system with GPUs. The original implementation incorrectly used current_platform.is_cuda() to dispatch to a CUDA-specific kernel, which failed when the logits tensor was on the CPU. The fix replaces this platform check with logits.is_cuda(), ensuring that the dispatch logic correctly depends on the tensor's device. This is the correct approach and resolves the reported issue. The change is well-targeted and I see no further issues.
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks
|
Thanks, next time please sign-off your commits according to https://docs.vllm.ai/en/latest/contributing/index.html#dco-and-signed-off-by |
Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…codebase changed (#2560) ### What this PR does / why we need it? The mergence of the upstream vllm-project/vllm#22592 caused a vllm-ascend LoRA inference bug. The details are following: According to [torch_npu/npu/_stream_check.py](https://github.com/Ascend/pytorch/blob/863b9071cbdf47023c12c246e3efa9c6e2285fc6/torch_npu/npu/_stream_check.py#L74), NPU device type tensors have attributes is_cuda=True and is_npu=True. This causes that vLLM's apply_repetition_penalties function will run into the branch of "if logits.is_cuda and logits.is_contiguous()" and call the custom op implemented in CUDA, which is not compatible with NPU. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fe8d7b6 --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…codebase changed (vllm-project#2560) ### What this PR does / why we need it? The mergence of the upstream vllm-project/vllm#22592 caused a vllm-ascend LoRA inference bug. The details are following: According to [torch_npu/npu/_stream_check.py](https://github.com/Ascend/pytorch/blob/863b9071cbdf47023c12c246e3efa9c6e2285fc6/torch_npu/npu/_stream_check.py#L74), NPU device type tensors have attributes is_cuda=True and is_npu=True. This causes that vLLM's apply_repetition_penalties function will run into the branch of "if logits.is_cuda and logits.is_contiguous()" and call the custom op implemented in CUDA, which is not compatible with NPU. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fe8d7b6 --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>
…codebase changed (vllm-project#2560) ### What this PR does / why we need it? The mergence of the upstream vllm-project/vllm#22592 caused a vllm-ascend LoRA inference bug. The details are following: According to [torch_npu/npu/_stream_check.py](https://github.com/Ascend/pytorch/blob/863b9071cbdf47023c12c246e3efa9c6e2285fc6/torch_npu/npu/_stream_check.py#L74), NPU device type tensors have attributes is_cuda=True and is_npu=True. This causes that vLLM's apply_repetition_penalties function will run into the branch of "if logits.is_cuda and logits.is_contiguous()" and call the custom op implemented in CUDA, which is not compatible with NPU. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fe8d7b6 --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>
…codebase changed (vllm-project#2560) ### What this PR does / why we need it? The mergence of the upstream vllm-project/vllm#22592 caused a vllm-ascend LoRA inference bug. The details are following: According to [torch_npu/npu/_stream_check.py](https://github.com/Ascend/pytorch/blob/863b9071cbdf47023c12c246e3efa9c6e2285fc6/torch_npu/npu/_stream_check.py#L74), NPU device type tensors have attributes is_cuda=True and is_npu=True. This causes that vLLM's apply_repetition_penalties function will run into the branch of "if logits.is_cuda and logits.is_contiguous()" and call the custom op implemented in CUDA, which is not compatible with NPU. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fe8d7b6 --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>
Essential Elements of an Effective PR Description Checklist
One line bug fix for sampling on CPU, on systems with GPU's as per reported here:
#22591
It really is a simple fix
Fixes #22591
Purpose
Test Plan
Test Result
(Optional) Documentation Update