-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[CI] Add E2E Blackwell Quantized MoE Test #25723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces end-to-end tests for quantized Mixture-of-Experts (MoE) models on Blackwell GPUs, which is a valuable addition for ensuring hardware-specific features work correctly.
My review identified two main issues:
- A critical issue in the CI configuration (
.buildkite/test-pipeline.yaml) where the new test job has an incomplete and partially incorrect list of file dependencies. This could prevent the test from running when relevant code, including the test file itself, is modified. - A high-severity issue in the new test file (
tests/quantization/test_blackwell_moe.py) where the GPU capability check should be more flexible to allow running on future GPU architectures.
I have provided specific suggestions to address these points. Overall, the changes are a good step towards validating vLLM on new hardware, and with these fixes, the CI setup and tests will be more robust.
Signed-off-by: mgoin <mgoin64@gmail.com>
Increased max wait time for server to 600 seconds due to FlashInfer compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Adds a new "Blackwell Quantized MoE Test" job that is solely meant to run critical MoE models (Llama 4, Qwen, DeepSeek, GPT-OSS) that we have many ways of running on Blackwell, through various quantization backends.
It loads the model with dummy weights and for just a few layers to make sure we can pass
process_weights_after_loading, torch.compile, cudagraph capture, and serve a request.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.