-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU #29315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU #29315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for loading int4 weights for the GPT-OSS model, specifically for the compressed-tensors format. The changes include new methods to handle fused and unfused expert tensors, as well as specific logic for bias loading. While the implementation is comprehensive, I've identified a logical redundancy in the weight loading dispatch that could lead to inefficiencies and potential bugs. My review comment details this issue and suggests a fix to clarify the logic and improve correctness.
| if "mlp.experts." in name and ( | ||
| ".gate_proj" in name or ".up_proj" in name or ".down_proj" in name | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for loading expert weights can redundantly attempt to load bias tensors. An expert bias is first processed by load_int4_bias_gptoss. If that method returns False, the execution falls through to this block, where load_unfused_expert_weight is called, which also contains logic to handle biases. This is inefficient and can lead to incorrect behavior if the generic bias loading logic in load_unfused_expert_weight is not appropriate for this model's int4 biases.
To avoid this redundancy and potential issue, the condition should be modified to explicitly exclude bias tensors from being handled by load_unfused_expert_weight.
| if "mlp.experts." in name and ( | |
| ".gate_proj" in name or ".up_proj" in name or ".down_proj" in name | |
| ): | |
| if "mlp.experts." in name and not name.endswith(".bias") and ( | |
| ".gate_proj" in name or ".up_proj" in name or ".down_proj" in name | |
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| use_ep = self.parallel_config.enable_expert_parallel | ||
| tp_size, tp_rank = FusedMoEParallelConfig.flatten_tp_across_dp( | ||
| tp_size=get_tensor_model_parallel_world_size(), | ||
| dp_size=get_dp_group().world_size, | ||
| dp_rank=get_dp_group().rank_in_group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use existing TP flatten helper in int4 weight loader
When compressed-tensors W4A8 weights are loaded, load_weights_int4 calls FusedMoEParallelConfig.flatten_tp_across_dp, but that helper does not exist (only flatten_tp_across_dp_and_pcp is defined). Hitting the W4A8 path will therefore raise an AttributeError before any weights are processed, making int4 compressed-tensors checkpoints unloadable.
Useful? React with 👍 / 👎.
7bae249 to
5844ef1
Compare
- int4 gate, down and up weights are separated - Combite the gate and up to single w13 tensors - Also load the bias tensors - Create a separate utility for loading int4 weights Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>
5844ef1 to
f259d75
Compare
|
CC @yewentao256 |
| group0 = (qc or {}).get("config_groups", {}).get("group_0", {}) | ||
| w = group0.get("weights") or {} | ||
| ia = group0.get("input_activations") or {} | ||
| is_w4a8 = (w.get("num_bits") == 4) and (ia.get("num_bits") == 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an api to check this here:
vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Line 382 in 70d5953
| def _is_dynamic_token_w4a8_int( |
Will it be a good idea to re-use this api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion , but no we cant use that since it’s private and tied to CompressedTensorsConfig
Here we are just reading the flags from the config file and making the decision, to keep it simpler
| tp_rank_start = tp_rank * per_rank_intermediate_size | ||
| tp_rank_end = min((tp_rank + 1) * per_rank_intermediate_size, intermediate_size) | ||
|
|
||
| # W4A8 detection (int4 weights, int8 activations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally speaking, can we apply the logic that this PR applies in process_weights_after_loading instead of needing to change the modeling file?, if not, then why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally speaking, can we apply the logic that this PR applies in
process_weights_after_loadinginstead of needing to change the modeling file?, if not, then why not?
No we cannot do in process_weights_after_loading because we need to first load the weights into w13_weights tensors and then it can be processed later, here we are getting all gate, up and down tensors as separate and then we are fusing them and loading the correct weights.
|
Thanks for your PR @isharif168 |
| loaded_params.add(name) | ||
| return loaded_params | ||
|
|
||
| def load_per_expert_unfused_w4a8( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice for us not to specialize these functions for specific quantization schemes.
why can't the w4a8 be an argument for this function, instead of baking it into the name/impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice for us not to specialize these functions for specific quantization schemes. why can't the
w4a8be an argument for this function, instead of baking it into the name/impl?
I think its good to have different weight specific function and then we can call those specific function from load_weights_other, since all other combinations can have different weight loading scheme, we can do that if we have any in future
yes will make it more descriptive |
|
Hi @fadara01 @nikhil-arm Thanks. |
|
@jeejeelee / @mgoin could you please take a look at this? |
|
Hi @jeejeelee @mgoin Can you please help to review the change once? Here is the summary of the change and its dependencies -> This change takes the separate (gate/up/down) weights and then combines the gate and up weights to create w13_weights since the vllm expects in that format, and then loads the weights and bias (more details in the description) Here is the sample output with int4 === Prompt 0 === Reasoning: medium Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions You are a helpful assistant. <|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant --- Candidate 0 ---
We should provide a short answer.assistantfinalHere are three reasons to use AI:
Thanks. |
Add GptOssModel.load_per_expert_unfused_w4a8 helper to handle per-expert unfused MoE weights (gate_proj, up_proj, down_proj) in W4A8 checkpoints and map them into the fused FusedMoE layout (w13_* and w2_* parameters).
• Handles .weight, .weight_scale, .bias, and .input_scale suffixes.
• For biases, manually slices and writes into the appropriate columns of w13_bias (gate vs up) and w2_bias, supporting both 1D and 2D parameter layouts and using expert_id to pick the correct expert slice when the source tensor has an extra expert dimension.
• For weights/scales, delegates to a custom weight_loader when present, falling back to default_weight_loader otherwise, and surfaces whether the mapping was successfully handled.
Extend _load_weights_other to:
• Detect W4A8 (int4 weights, int8 activations) from self.config.quantization_config.config_groups["group_0"] and gate the new MoE path on is_w4a8.
• Precompute expert_params_mapping via FusedMoE.make_expert_params_mapping(...) for the MoE gate/up/down projections.
• For W4A8 models, first attempt load_per_expert_unfused_w4a8 for each weight tensor; if handled, mark the target param as loaded and skip the rest of the logic for that tensor. This allows loading checkpoints that store MoE weights per expert in an unfused form while still using FusedMoE fused storage internally.
Harden the generic stacking/renaming path in _load_weights_other:
• After substituting weight_name with param_name, skip if the resulting name is not in params_dict to avoid key errors on non-matching shards.
• Ensure loaded_params.add(name) is only called when a parameter is actually found and loaded (both in the stacked-path and the final “other weights” path).
Update GptOssForCausalLM metadata and mapping to match the new weight layout:
• Expand packed_modules_mapping to include:
• "qkv": ["q_proj", "k_proj", "v_proj"] for packed attention projections.
• "gate_up_proj": ["gate_proj", "up_proj"] for packed MoE gate+up projections used by compressed-tensors.
• Extend hf_to_vllm_mapper mappings:
• Map .self_attn. → .attn. (existing behavior).
• Map .qkv. → .qkv_proj. to align HF checkpoints that use a qkv naming convention with vLLM’s QKVParallelLinear.
• Map .mlp.experts.experts. → .mlp.experts. to flatten HF’s expert naming into the fused MoE layout.
• Add suffix mappings for MoE/compressed-tensors artifacts:
• Gate+up and down projection blocks/scales (MXFP4 and other formats) to w13_* and w2_*.
• Bias variants (.gate_up_proj_bias, .down_proj_bias) to w13_bias and w2_bias.
Purpose
To use dynamic_int4_moe for gpt_oss model
Test Plan
Tested with GPTOSS model quantized with llmcompresser
Test Result
GPTOSS model passed
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.