Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU #29315

isharif168 · 2025-11-24T14:43:11Z

Add GptOssModel.load_per_expert_unfused_w4a8 helper to handle per-expert unfused MoE weights (gate_proj, up_proj, down_proj) in W4A8 checkpoints and map them into the fused FusedMoE layout (w13_* and w2_* parameters).
• Handles .weight, .weight_scale, .bias, and .input_scale suffixes.
• For biases, manually slices and writes into the appropriate columns of w13_bias (gate vs up) and w2_bias, supporting both 1D and 2D parameter layouts and using expert_id to pick the correct expert slice when the source tensor has an extra expert dimension.
• For weights/scales, delegates to a custom weight_loader when present, falling back to default_weight_loader otherwise, and surfaces whether the mapping was successfully handled.
Extend _load_weights_other to:
• Detect W4A8 (int4 weights, int8 activations) from self.config.quantization_config.config_groups["group_0"] and gate the new MoE path on is_w4a8.
• Precompute expert_params_mapping via FusedMoE.make_expert_params_mapping(...) for the MoE gate/up/down projections.
• For W4A8 models, first attempt load_per_expert_unfused_w4a8 for each weight tensor; if handled, mark the target param as loaded and skip the rest of the logic for that tensor. This allows loading checkpoints that store MoE weights per expert in an unfused form while still using FusedMoE fused storage internally.
Harden the generic stacking/renaming path in _load_weights_other:
• After substituting weight_name with param_name, skip if the resulting name is not in params_dict to avoid key errors on non-matching shards.
• Ensure loaded_params.add(name) is only called when a parameter is actually found and loaded (both in the stacked-path and the final “other weights” path).
Update GptOssForCausalLM metadata and mapping to match the new weight layout:
• Expand packed_modules_mapping to include:
• "qkv": ["q_proj", "k_proj", "v_proj"] for packed attention projections.
• "gate_up_proj": ["gate_proj", "up_proj"] for packed MoE gate+up projections used by compressed-tensors.
• Extend hf_to_vllm_mapper mappings:
• Map .self_attn. → .attn. (existing behavior).
• Map .qkv. → .qkv_proj. to align HF checkpoints that use a qkv naming convention with vLLM’s QKVParallelLinear.
• Map .mlp.experts.experts. → .mlp.experts. to flatten HF’s expert naming into the fused MoE layout.
• Add suffix mappings for MoE/compressed-tensors artifacts:
• Gate+up and down projection blocks/scales (MXFP4 and other formats) to w13_* and w2_*.
• Bias variants (.gate_up_proj_bias, .down_proj_bias) to w13_bias and w2_bias.

Purpose

To use dynamic_int4_moe for gpt_oss model

Test Plan

Tested with GPTOSS model quantized with llmcompresser

Test Result

GPTOSS model passed

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds support for loading int4 weights for the GPT-OSS model, specifically for the compressed-tensors format. The changes include new methods to handle fused and unfused expert tensors, as well as specific logic for bias loading. While the implementation is comprehensive, I've identified a logical redundancy in the weight loading dispatch that could lead to inefficiencies and potential bugs. My review comment details this issue and suggests a fix to clarify the logic and improve correctness.

gemini-code-assist · 2025-11-24T14:46:58Z

vllm/model_executor/models/gpt_oss.py

+            if "mlp.experts." in name and (
+                ".gate_proj" in name or ".up_proj" in name or ".down_proj" in name
+            ):


The current logic for loading expert weights can redundantly attempt to load bias tensors. An expert bias is first processed by load_int4_bias_gptoss. If that method returns False, the execution falls through to this block, where load_unfused_expert_weight is called, which also contains logic to handle biases. This is inefficient and can lead to incorrect behavior if the generic bias loading logic in load_unfused_expert_weight is not appropriate for this model's int4 biases.

To avoid this redundancy and potential issue, the condition should be modified to explicitly exclude bias tensors from being handled by load_unfused_expert_weight.

Suggested change

if "mlp.experts." in name and (

".gate_proj" in name or ".up_proj" in name or ".down_proj" in name

):

if "mlp.experts." in name and not name.endswith(".bias") and (

".gate_proj" in name or ".up_proj" in name or ".down_proj" in name

):

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-24T14:48:54Z

vllm/model_executor/models/gpt_oss.py

+        use_ep = self.parallel_config.enable_expert_parallel
+        tp_size, tp_rank = FusedMoEParallelConfig.flatten_tp_across_dp(
+            tp_size=get_tensor_model_parallel_world_size(),
+            dp_size=get_dp_group().world_size,
+            dp_rank=get_dp_group().rank_in_group,


Use existing TP flatten helper in int4 weight loader

When compressed-tensors W4A8 weights are loaded, load_weights_int4 calls FusedMoEParallelConfig.flatten_tp_across_dp, but that helper does not exist (only flatten_tp_across_dp_and_pcp is defined). Hitting the W4A8 path will therefore raise an AttributeError before any weights are processed, making int4 compressed-tensors checkpoints unloadable.

Useful? React with 👍 / 👎.

- int4 gate, down and up weights are separated - Combite the gate and up to single w13 tensors - Also load the bias tensors - Create a separate utility for loading int4 weights Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>

heheda12345 · 2025-11-26T01:22:27Z

CC @yewentao256

nikhil-arm · 2025-11-26T14:29:24Z

vllm/model_executor/models/gpt_oss.py

+        group0 = (qc or {}).get("config_groups", {}).get("group_0", {})
+        w = group0.get("weights") or {}
+        ia = group0.get("input_activations") or {}
+        is_w4a8 = (w.get("num_bits") == 4) and (ia.get("num_bits") == 8)


We have an api to check this here:

vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

Line 382 in 70d5953

def _is_dynamic_token_w4a8_int(

Will it be a good idea to re-use this api?

Good suggestion , but no we cant use that since it’s private and tied to CompressedTensorsConfig
Here we are just reading the flags from the config file and making the decision, to keep it simpler

fadara01 · 2025-11-26T15:05:06Z

vllm/model_executor/models/gpt_oss.py

        tp_rank_start = tp_rank * per_rank_intermediate_size
        tp_rank_end = min((tp_rank + 1) * per_rank_intermediate_size, intermediate_size)

+        # W4A8 detection (int4 weights, int8 activations)


generally speaking, can we apply the logic that this PR applies in process_weights_after_loading instead of needing to change the modeling file?, if not, then why not?

generally speaking, can we apply the logic that this PR applies in process_weights_after_loading instead of needing to change the modeling file?, if not, then why not?

No we cannot do in process_weights_after_loading because we need to first load the weights into w13_weights tensors and then it can be processed later, here we are getting all gate, up and down tensors as separate and then we are fusing them and loading the correct weights.

fadara01 · 2025-11-26T15:07:51Z

Thanks for your PR @isharif168
The description currently assumes that the reviewer understands why the points you mention above are needed, could you please elaborate more on each of the points?
This also seems to depend on other PRs in llm_compressor, could you link this here? and generally explain how they're related and why these changes are needed.

fadara01 · 2025-11-26T15:08:59Z

vllm/model_executor/models/gpt_oss.py

            loaded_params.add(name)
        return loaded_params

+    def load_per_expert_unfused_w4a8(


it would be nice for us not to specialize these functions for specific quantization schemes.
why can't the w4a8 be an argument for this function, instead of baking it into the name/impl?

it would be nice for us not to specialize these functions for specific quantization schemes. why can't the w4a8 be an argument for this function, instead of baking it into the name/impl?

I think its good to have different weight specific function and then we can call those specific function from load_weights_other, since all other combinations can have different weight loading scheme, we can do that if we have any in future

isharif168 · 2025-11-26T15:22:19Z

Thanks for your PR @isharif168 The description currently assumes that the reviewer understands why the points you mention above are needed, could you please elaborate more on each of the points? This also seems to depend on other PRs in llm_compressor, could you link this here? and generally explain how they're related and why these changes are needed.

yes will make it more descriptive

isharif168 · 2025-12-01T12:22:12Z

Hi @fadara01 @nikhil-arm
Can you please help to review once again ?

Thanks.

fadara01 · 2025-12-01T16:30:29Z

@jeejeelee / @mgoin could you please take a look at this?

isharif168 · 2025-12-04T15:33:56Z

Hi @jeejeelee @mgoin

Can you please help to review the change once?

Here is the summary of the change and its dependencies
-> For int4 support first we need to convert the experts to linear layers so that the quantization becomes easier. This is done with the llm-compressor change vllm-project/llm-compressor#2091 and then quantize the model to int4 weights

-> This change takes the separate (gate/up/down) weights and then combines the gate and up weights to create w13_weights since the vllm expects in that format, and then loads the weights and bias (more details in the description)

Here is the sample output with int4

=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04

Reasoning: medium

You are a helpful assistant.

<|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant

--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI:

Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress.
Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science.
Supportive design

We should provide a short answer.assistantfinalHere are three reasons to use AI:

Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs.
Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills.
Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life.
finish_reason: stop
num_tokens: 226

Thanks.

mergify bot added the gpt-oss Related to GPT-OSS models label Nov 24, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Nov 24, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Nov 24, 2025

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 24, 2025

View reviewed changes

isharif168 force-pushed the handle_int4_weight_loading branch 3 times, most recently from 7bae249 to 5844ef1 Compare November 24, 2025 16:11

Add support to load int4 weights

f259d75

- int4 gate, down and up weights are separated - Combite the gate and up to single w13 tensors - Also load the bias tensors - Create a separate utility for loading int4 weights Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>

isharif168 force-pushed the handle_int4_weight_loading branch from 5844ef1 to f259d75 Compare November 24, 2025 16:45

Merge branch 'main' into handle_int4_weight_loading

97d4c38

nikhil-arm reviewed Nov 26, 2025

View reviewed changes

fadara01 reviewed Nov 26, 2025

View reviewed changes

isharif168 changed the title ~~Add support to load int4 weights for CPU~~ Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU Nov 26, 2025

fadara01 mentioned this pull request Dec 1, 2025

Add file to linearize and quantize the gpt-oss models vllm-project/llm-compressor#1831

Open

Uh oh!

Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU #29315

Are you sure you want to change the base?

Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU #29315

Conversation

isharif168 commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Nov 26, 2025

Uh oh!

nikhil-arm Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isharif168 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

fadara01 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isharif168 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

isharif168 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

isharif168 commented Nov 26, 2025

Uh oh!

isharif168 commented Dec 1, 2025

Uh oh!

fadara01 commented Dec 1, 2025

Uh oh!

isharif168 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

isharif168 commented Nov 24, 2025 •

edited by github-actions bot

Loading

nikhil-arm Nov 26, 2025 •

edited

Loading

fadara01 Nov 26, 2025 •

edited

Loading

fadara01 commented Nov 26, 2025 •

edited

Loading

isharif168 commented Dec 4, 2025 •

edited

Loading