[FIX] Fix kernel bug #1959

jeejeelee · 2023-12-07T08:44:23Z

While assessing the effectiveness of the RMSNorm operator, I observed that executing this operator on non-zero GPU resulted in a 'RuntimeError: CUDA error: an illegal memory access was encountered.'
Upon further investigation through debugging, I found that cause as the absence of device guards, most cuda kernels have the same issues .
I have addressed the issue by incorporating device guards for all kernels. Additionally, I have augmented the kernel tests by including device id,such as in the test_activationprovided

jeejeelee · 2023-12-07T08:49:01Z

I have completed all the kernel tests on the A800 GPU(x2), and all kernels can pass correctly.Can you review this PR? @WoosukKwon

jeejeelee · 2023-12-09T00:41:39Z

@WoosukKwon Are there any issues with this PR?

WoosukKwon

Hi @jeejeelee thanks for submitting the PR. We've not noticed this bug since the device id is always 0 in vLLM. However, I agree that this change would make the kernel more portable.

csrc/activation_kernels.cu

WoosukKwon · 2023-12-27T22:25:29Z

csrc/activation_kernels.cu

@@ -1,21 +1,20 @@
-#include <torch/extension.h>
 #include <ATen/cuda/CUDAContext.h>
-
+#include <torch/extension.h>
+#include <c10/cuda/CUDAGuard.h>
 #include "dispatch_utils.h"


style nit:

Suggested change

#include <torch/extension.h>

#include <ATen/cuda/CUDAContext.h>

#include <torch/extension.h>

#include <c10/cuda/CUDAGuard.h>

#include "dispatch_utils.h"

#include <torch/extension.h>

#include <ATen/cuda/CUDAContext.h>

#include <c10/cuda/CUDAGuard.h>

#include "dispatch_utils.h"

tests/kernels/conftest.py

tests/kernels/test_activation.py

jeejeelee · 2023-12-28T03:17:04Z

@WoosukKwon Thank you for your review. I have completed the following modifications:

revert cpp code format
test the kernels for device 0 and 1.
add comma for python codes

Please review again

WoosukKwon

@jeejeelee LGTM! Thanks for the PR and apologies for the delayed review. We got many PRs last month and didn't have enough bandwidth due to the holidays. 😅

)

…m-project#1959) Signed-off-by: Youlei Yang <youlei.yang@intel.com>

jeejeelee added 2 commits December 7, 2023 16:11

fix kernel bug

7025579

fix kernel bug

207ee60

WoosukKwon self-requested a review December 7, 2023 16:56

WoosukKwon self-assigned this Dec 7, 2023

WoosukKwon reviewed Dec 27, 2023

View reviewed changes

jeejeelee added 2 commits December 28, 2023 10:07

modify code format

efbf0e5

shorten the test time

a1d3300

jeejeelee and others added 7 commits December 29, 2023 00:35

revert the code format

4d0cfe9

revert the code format

358476e

revert the code format

67396c7

fix ldg bug

0165e9a

add comma

2e0928e

Use double quote

e25df18

Merge branch 'main' into fix-kernel-bug

5500c76

WoosukKwon approved these changes Jan 3, 2024

View reviewed changes

WoosukKwon merged commit 77af974 into vllm-project:main Jan 3, 2024

jedibrillo pushed a commit to jedibrillo/vllm that referenced this pull request Jan 5, 2024

[FIX] Support non-zero CUDA devices in custom kernels (vllm-project#1959

abecc14

)

jeejeelee mentioned this pull request Jan 25, 2024

[Fix] Use a correct device when creating OptionalCUDAGuard #2583

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[FIX] Support non-zero CUDA devices in custom kernels (vllm-project#1959

4c86061

)

jeejeelee deleted the fix-kernel-bug branch December 26, 2024 01:47

jinyouzhi pushed a commit to jinyouzhi/vllm that referenced this pull request Sep 26, 2025

fix batch_size_padded=None error introduced by vllm-project#1955 (vll…

6aeb2e0

…m-project#1959) Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FIX] Fix kernel bug #1959

[FIX] Fix kernel bug #1959

Uh oh!

jeejeelee commented Dec 7, 2023

Uh oh!

jeejeelee commented Dec 7, 2023 •

edited

Loading

Uh oh!

jeejeelee commented Dec 9, 2023

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

WoosukKwon Dec 27, 2023

Uh oh!

Uh oh!

Uh oh!

jeejeelee commented Dec 28, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[FIX] Fix kernel bug #1959

[FIX] Fix kernel bug #1959

Uh oh!

Conversation

jeejeelee commented Dec 7, 2023

Uh oh!

jeejeelee commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeejeelee commented Dec 9, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon Dec 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeejeelee commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeejeelee commented Dec 7, 2023 •

edited

Loading

jeejeelee commented Dec 28, 2023 •

edited

Loading