-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Disable FlashInfer sampler by default #26859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable FlashInfer sampler by default #26859
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly disables the FlashInfer sampler by default, requiring users to opt-in by setting VLLM_USE_FLASHINFER_SAMPLER=1. The change from envs.VLLM_USE_FLASHINFER_SAMPLER is not False to a direct boolean check simplifies the logic and makes the default behavior consistent. The corresponding log message is also appropriately changed from a warning to a debug message. The implementation is sound and improves code clarity.
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
…#26859) (#295) vllm-project/vllm#26859 Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Purpose
There have been increasing reports of correctness issues or IMA with FlashInfer's top-p & top-k sampling kernel (see #26480 (comment)). For instance, it seems it can generates the same output even when the temperature is quite high (even though the seed is not set). vLLM generates different results (expectedly) once the kernel is disabled.
Since flashinfer-python is a default dep of vLLM CUDA now, many more users would be using this kernel by default. Let us disable it by default for now so users can opt-in
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.