-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Fix MXFP4 quantizer to support variable num_local_experts and hidden_size #41795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The quantizer hardcoded 32 experts and 2880 hidden_size in the reshape operations. This caused failures when quantizing models with different numbers of experts (e.g., averaged single-expert models). Changes: - Read num_local_experts and hidden_size from model.config - Use dynamic values in reshape operations instead of hardcoded constants - Defaults to 32 and 2880 for backward compatibility This enables quantizing averaged/merged MoE models with fewer experts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
cc @MekkCyber for quantization |
|
run-slow: mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good Thanks!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: mxfp4 |
|
run-slow: mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: [] |
Regarding the tests: |
What does this PR do?
This PR replaces hardcoded values
num_local_expertsandhidden_sizeinMXFP4ConfigforGPT-OSStype models.I discovered this when experimenting with non-standard configs of GPT-OSS architecture but i'm pretty sure it'll break for openai/gpt-oss-120b as well since it's number of experts is different from the hardcoded value.
The quantizer hardcoded 32 experts and 2880 hidden_size in the reshape operations. This caused failures when quantizing models with different numbers of experts.
Changes:
This enables quantizing averaged/merged MoE models with fewer experts.
Passed all tests that I was able to run locally on 24gb of vram.
Before submitting
Pull Request section? - yes
to it if that's the case. - I looked and didn't find an issue
documentation guidelines, and
here are tips on formatting docstrings. - likely not necessary
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.