-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Kernel] Better inf handling for grouped topk cu #24886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several good improvements for handling infinity and finiteness checks in the grouped topk CUDA kernel. The changes are well-motivated and correctly implemented. Using cuda::std::isfinite
directly on half-precision types is more efficient, and the new neg_inf<T>()
helper function improves type safety. The code is cleaner and more robust. I have one suggestion to improve the long-term maintainability of the neg_inf
function by scoping the bug workaround to specific CUDA versions.
csrc/moe/grouped_topk_kernels.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workaround for the cuda::std::numeric_limits
bug is noted. To improve long-term maintainability and ensure the code automatically benefits from future bug fixes in the CUDA toolkit, it's best to scope this workaround to the specific compiler versions where the bug is present using preprocessor directives. This will allow the simpler and more direct cuda::std::numeric_limits<T>::lowest()
to be used once the underlying bug is resolved.
Additionally, using cuda::std::numeric_limits<float>::lowest()
is more idiomatic for getting negative infinity than -cuda::std::numeric_limits<float>::infinity()
.
#if (defined(__CUDACC_VER_MAJOR__) && __CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 8)
// Workaround for a bug in libcu++ in CUDA 12.8+ where numeric_limits for
// fp16/bf16 types return 0.
template <typename T>
__device__ inline T neg_inf() {
// cuda::std::numeric_limits<T>::infinity() and lowest() return `0` for [T=bf16 or fp16]
// so we need to cast from fp32.
return cuda_cast<T, float>(cuda::std::numeric_limits<float>::lowest());
}
#else
// Default implementation for compilers where the bug is not present.
template <typename T>
__device__ inline T neg_inf() {
return cuda::std::numeric_limits<T>::lowest();
}
#endif
b267663
to
afcb182
Compare
Signed-off-by: lumina37 <starry.qvq@gmail.com>
afcb182
to
b91cddc
Compare
@jikunshang @xyang16 May I have a review for this minor change? It wont take much time. |
Does this PR improve performance? If so, please provide detailed comparison results. |
@mayuyuace I'll run the benchmark later, but I suggest the performance diff is nearly unrecognizable. |
@mayuyuace The benchmark result is almost the same Command & Env
Run on my 3090*2 instance. Main Branch
This PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks for contributing!
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
Signed-off-by: lumina37 <starry.qvq@gmail.com>
errors: /root/vllm/csrc/moe/grouped topk kernels. cu(536): error: namespace "cuda: :sta" has no member "isfinite" |
@acdart What's your nvcc version? BTW is that "cuda::std" instead of "cuda::sta"? |
@lumina37 getting the following error in compilation:
Does this PR changes the requirement on nvcc version? |
Related fix: #25346 |
I apologize for the mistake caused by my insufficient testing. The cuda::std::isfinite function exists in libcu++ v2.1.0 and was released with CUDA 12.2, but I'm not sure why it's still missing in CUDA 12.6. I will test on more versions next time. |
Signed-off-by: lumina37 <starry.qvq@gmail.com>
Signed-off-by: lumina37 <starry.qvq@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: lumina37 <starry.qvq@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: lumina37 <starry.qvq@gmail.com>
Purpose
cuda::std::isfinite
instead ofstd::isfinite(__half2float(...))
to check the finiteness of input without extra type-casting.neg_inf<T>
to generate-inf
for type T to simplify code. However it seems that there's some bug withcuda::std::numeric_limits<T>::infinity()
andcuda::std::numeric_limits<T>::lowest()
. They always return0.0
for [T=bf16 or fp16]. So I take a casting from fp32-inf
as the correct impl. My nvcc version is cuda_12.8.r12.8/compiler.35404655_0.cg
namespace forcg::reduce
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)