Skip to content

Conversation

@JartX
Copy link
Contributor

@JartX JartX commented Aug 30, 2025

The following PR attempts to make the quantized MOE model chart compatible with AutoGPTQ and Autoround-GPTQ.

The PR: #23467 attempted to fix GPTQ quantization for Autoround-GPTQ, but it prevented AutoGPTQ models from loading, as @Isotr0py pointed out in his PR: #23490

In PR #23490, the method _maybe_ignore_quant_config was created.

Which: if isinstance(quant_config, (GPTQConfig, GPTQMarlinConfig)): returns None. This again breaks AutoRound-GPTQ models, since they do have quantized layer gates.

The current PR attempts to extend @Isotr0py's logic and establish an environment variable that directly sets quant_config = quant_config if VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ = 1.

A fallback system is also implemented, allowing AutoRound-GPTQs to run once the load_weights loading has failed due to the need to set the quantized layer gates and having been initially configured with quant_config = None.

The fallback log also warns of the use of: VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ = 1

To avoid entering the fallback and autotuning the gates.

I've tested both the following models:
Qwen/Qwen3-30B-A3B-GPTQ-Int4 performed using AutoGPTQ
Loading without fallback

and

jart25/Qwen3-Coder-30B-A3B-Instruct-Int8-gptq performed using Autoround-GPTQ
Loading with fallback.
It doesn't fallback if: VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ=1
And it works correctly

JartX added 5 commits August 30, 2025 17:14
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
… variable

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
@JartX JartX requested a review from sighingnow as a code owner August 30, 2025 15:55
@mergify mergify bot added the qwen Related to Qwen models label Aug 30, 2025
@JartX
Copy link
Contributor Author

JartX commented Aug 30, 2025

Hi @Isotr0py, would you be so kind as to look at the PR? Thank you very much.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for GPTQ quantization compatibility for Qwen3 MoE models, specifically to support both AutoGPTQ and Autoround-GPTQ. The changes include adding an environment variable to control gate quantization and a fallback mechanism in load_weights to handle cases where weights fail to load due to quantization mismatches. The fallback logic correctly reinitializes the MoE gate layers with the appropriate quantization configuration and retries loading. My main feedback is to make the exception handling in the fallback mechanism more specific to avoid masking unrelated errors.

Signed-off-by: JartX <sagformas@epdcenter.es>
@JartX
Copy link
Contributor Author

JartX commented Aug 30, 2025

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request significantly enhances vLLM's capability to load and utilize Qwen3 Mixture-of-Experts (MoE) models that have undergone quantization using either AutoGPTQ or AutoRound-GPTQ. It directly addresses and resolves prior compatibility issues between these two distinct quantization approaches, ensuring a more seamless and reliable integration. The changes introduce a flexible loading strategy, incorporating an environment variable for explicit control and a resilient fallback system, which collectively prevent common loading failures and improve the overall experience for users working with quantized MoE models.

Highlights

  • Compatibility Fix: Ensures robust compatibility for Qwen3 MOE models with both AutoGPTQ and AutoRound-GPTQ quantization methods, resolving previous conflicts.
  • Environment Variable Control: Introduces the VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ environment variable to explicitly control how quantization configurations are applied to MoE gate layers for AutoRound-GPTQ models.
  • Robust Fallback Mechanism: Implements a dynamic fallback system in the load_weights method. If initial weight loading fails due to issues with quantized MoE gate layers (specifically for AutoRound-GPTQ), the system re-initializes the gate layers with the correct quantization configuration and retries loading, improving resilience.
  • Code Refactoring: The _maybe_ignore_quant_config method has been renamed and refactored to _maybe_not_quantization to centralize the logic for handling quantization configurations based on the new environment variable.
Changelog
  • vllm/model_executor/models/qwen3_moe.py
    • Added import os to enable environment variable access.
    • Renamed and refactored the _maybe_ignore_quant_config method to _maybe_not_quantization. This updated method now checks for the VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ environment variable to determine if the quant_config should be applied to the gate layer, otherwise it returns None for GPTQConfig and GPTQMarlinConfig.
    • Modified the load_weights method to include a try-except KeyError block. If a KeyError occurs during weight loading (which can indicate an issue with quantized MoE gate layers), a warning is logged, and the moe_block.gate is dynamically re-initialized with the correct quant_config before the weight loading is retried.
    • Removed several redundant blank lines for improved code formatting and readability.
Activity
  • JartX requested a review from @Isotr0py.
  • JartX requested a summary from the gemini bot.
  • The gemini-code-assist[bot] provided a review comment, suggesting to narrow the except clause in load_weights from a generic KeyError to a more specific ValueError to prevent masking unrelated bugs.

@Isotr0py
Copy link
Member

Hmmm, the problematic checkpoint looks quite weird... AutoRound should use AutoRoundConfig instead of GPTQConfig or GPTQMarlinConfig:

class AutoRoundConfig(QuantizationConfig):
"""Config class for AutoRound.
Reference: https://arxiv.org/pdf/2309.05516
"""
SUPPORTED_BITS = {2, 3, 4, 8}
SUPPORTED_DTYPES = {"int"}
SUPPORTED_FORMATS = {"auto_round:auto_gptq", "auto_round:auto_awq"}
SUPPORTED_BACKENDS = {
"auto",
"gptq",
"gptq:marlin",
"awq",
"awq:marlin",
"marlin",
"ipex",
}

Seems the correct AutoRound config should look like this (https://huggingface.co/Intel/Qwen3-Coder-30B-A3B-Instruct-int4-AutoRound/blob/main/config.json#L177-L181):

    "group_size": 128,
    "nsamples": 512,
    "packing_format": "auto_round:auto_gptq",
    "quant_method": "auto-round",
    "sym": true

@JartX
Copy link
Contributor Author

JartX commented Aug 30, 2025

Hi! @Isotr0py thanks for the speed in your response, in ROCM the autoround-gptq method only works by auto_gptq as flag in Intel/autoround, so when it is generated with such a flag it appears as in my quantized model and I don't see a way to differentiate it. There is a key in the quantization_config called autoround_version, if you know how to get to it we could throw it there, and if there is one way or another. If that's okay with you, please, let's work on my repo and that's how we can team :)

@JartX
Copy link
Contributor Author

JartX commented Aug 30, 2025

@Isotr0py

Note:
With the flag auto_gptq only, only the quant_method is gptq which is that it admits me rocm

I've gone around a thousand times before working on the fallback, maybe you can think of another way, because even in the internal flags of the quant AutoGPTQ and Autoround-G are the same

@Isotr0py
Copy link
Member

With the flag auto_gptq only, only the quant_method is gptq which is that it admits me rocm

Oh, I see. Let me see how to identify GPTQ model quantized by autoround.

JartX added 3 commits August 30, 2025 20:34
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
@JartX
Copy link
Contributor Author

JartX commented Aug 30, 2025

@Isotr0py check again :) please hehe


from_autoround_gptq = False
        if hasattr(config, "quantization_config"):
            q_config = config.quantization_config
            if (isinstance(q_config, dict)
                    and q_config.get("quant_method") == "gptq"
                    and "autoround_version" in q_config):
                from_autoround_gptq = True

        gate_quant_config = quant_config if from_autoround_gptq else None

        # Load balancing settings.
        vllm_config = get_current_vllm_config()
        eplb_config = vllm_config.parallel_config.eplb_config
        self.enable_eplb = enable_eplb
        self.n_logical_experts = self.n_routed_experts
        self.n_redundant_experts = eplb_config.num_redundant_experts
        self.n_physical_experts = (self.n_logical_experts +
                                   self.n_redundant_experts)
        self.n_local_physical_experts = self.n_physical_experts // self.ep_size
        self.physical_expert_start = (self.ep_rank *
                                      self.n_local_physical_experts)
        self.physical_expert_end = (self.physical_expert_start +
                                    self.n_local_physical_experts)

        self.experts = FusedMoE(num_experts=self.n_routed_experts,
                                top_k=config.num_experts_per_tok,
                                hidden_size=config.hidden_size,
                                intermediate_size=config.moe_intermediate_size,
                                reduce_results=True,
                                renormalize=config.norm_topk_prob,
                                quant_config=quant_config,
                                prefix=f"{prefix}.experts",
                                enable_eplb=self.enable_eplb,
                                num_redundant_experts=self.n_redundant_experts)
        self.gate = ReplicatedLinear(
            config.hidden_size,
            config.num_experts,
            bias=False,
            quant_config=gate_quant_config,
            prefix=f"{prefix}.gate")

@JartX
Copy link
Contributor Author

JartX commented Aug 30, 2025

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request addresses and resolves critical compatibility issues for Qwen3 Mixture-of-Experts (MoE) models when loaded with different GPTQ quantization methods, specifically AutoGPTQ and AutoRound-GPTQ. It introduces a precise way to distinguish between these quantization types, ensuring that the MoE gate layers are correctly handled for each, thereby preventing loading failures and enhancing the overall stability and usability of quantized Qwen3 MoE models within the system.

Highlights

  • Quantization Compatibility: Ensures robust compatibility for Qwen3 Mixture-of-Experts (MoE) models with both AutoGPTQ and AutoRound-GPTQ quantization methods, resolving previous conflicts by correctly handling their gate layer quantization.
  • AutoRound-GPTQ Identification: Implements a new mechanism to accurately identify AutoRound-GPTQ models by checking for the presence of an "autoround_version" key within their quantization configuration, which was previously a challenge due to shared internal flags with AutoGPTQ.
  • Conditional Gate Quantization: Modifies the MoE gate layer's quantization behavior: the gate is now quantized only if the model is specifically identified as an AutoRound-GPTQ model; otherwise, it remains unquantized, which is the expected behavior for AutoGPTQ models.
  • Code Simplification: Removes the now redundant _maybe_ignore_quant_config helper method, streamlining the quantization configuration logic for MoE gate layers.
Changelog
  • vllm/model_executor/models/qwen3_moe.py
    • Removed direct imports for GPTQConfig and GPTQMarlinConfig as their specific handling is now integrated into the new logic.
    • Introduced logic within the Qwen3MoeSparseMoeBlock initialization to detect AutoRound-GPTQ models by inspecting the quantization_config for a specific autoround_version key.
    • Modified the quant_config passed to the ReplicatedLinear gate layer, setting it to the full quant_config only if the model is identified as AutoRound-GPTQ, otherwise setting it to None.
    • Removed the _maybe_ignore_quant_config helper method, as its functionality is superseded by the new conditional gate quantization logic.
    • Applied minor whitespace and formatting adjustments throughout the file for improved readability.
Activity
  • JartX requested a review from Isotr0py.
  • JartX requested a summary from the gemini bot.
  • The gemini-code-assist[bot] provided an initial summary and suggested refining an except clause from a broad Exception to a more specific ValueError in the load_weights method.
  • Isotr0py provided feedback regarding the expected AutoRoundConfig and the difficulty in differentiating AutoRound models when they present as GPTQConfig.
  • JartX explained the constraints with ROCm requiring the auto_gptq flag and the challenge of distinguishing AutoRound-GPTQ models, suggesting the autoround_version key as a differentiator.
  • Isotr0py acknowledged the issue and indicated they would investigate how to identify AutoRound models.
  • JartX provided a code snippet demonstrating the new logic for identifying AutoRound-GPTQ models using the autoround_version key.
  • JartX requested another summary from the gemini bot.

Comment on lines 121 to 129
from_autoround_gptq = False
if hasattr(config, "quantization_config"):
q_config = config.quantization_config
if (isinstance(q_config, dict)
and q_config.get("quant_method") == "gptq"
and "autoround_version" in q_config):
from_autoround_gptq = True

gate_quant_config = quant_config if from_autoround_gptq else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a from_autoround_gptq attr to GPTQConfig and GPTQMarlinConfig instead of only Qwen3MoE model? I think other models like Qwen2.5-VL may suffer from this issue too:

self.dynamic = dynamic
self.weight_bits = weight_bits
self.group_size = group_size
self.desc_act = desc_act
self.lm_head_quantized = lm_head_quantized
self.pack_factor = Fraction(32, self.weight_bits)
if self.weight_bits not in [2, 3, 4, 8]:
raise ValueError(
"Currently, only 2/3/4/8-bit weight quantization is "
f"supported for GPTQ, but got {self.weight_bits} bits.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isotr0py Okay, I'm going to try ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isotr0py I'm thinking... if we add it there, for example the auto-round itself, auto-gptq that is combined may fail, because it needs the quant_cofig=None but I went through the gptq.py (I can't test it because with rocm I can't test autoround), apart from GPTQ. There are even different MoE that Autoround manages in different ways in some it does the quantization of the gates and in others it does not

@JartX
Copy link
Contributor Author

JartX commented Aug 31, 2025

@Isotr0py yum, I need to unlock the PR, in order to continue at the main level, would you mind passing the PR if you're so kind? now that after the advance it has been reduced to a minor change? Once unlocked we will work on the last comment that he has indicated to me in another PR, which will be a much bigger change, which possibly affects many more parts of vllm and model modifications. Thank you very much for your time

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@JartX
Copy link
Contributor Author

JartX commented Aug 31, 2025

Thanks @Isotr0py ;)

@Isotr0py
Copy link
Member

@JartX I just pushed some updates after making sure the compatability with https://huggingface.co/Qwen/Qwen3-30B-A3B-GPTQ-Int4 and https://huggingface.co/jart25/Qwen3-Coder-30B-A3B-Instruct-Int4-gptq.

Let me double check the compatibility of https://huggingface.co/Intel/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound then.

@JartX
Copy link
Contributor Author

JartX commented Aug 31, 2025

Thank you very much for helping and testing @Isotr0py ,I didn't want to move forward without being able to test everything perfectly, and if I can't try the autoround due to rocm limitation... or if I break the GPTQMarlin... 😅 . Again, thanks for the support:)

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have confirmed Intel/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound GPTQ model can still work:

(EngineCore_0 pid=1104048) INFO 08-31 23:53:05 [core.py:75] Initializing a V1 LLM engine (v0.10.2.dev187+g5e021b498) with config: model='/home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/', speculative_config=None, tokenizer='/home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [gpu_model_runner.py:1926] Starting to load model /home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/...
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [gpu_model_runner.py:1958] Loading model from scratch...
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [cuda.py:328] Using Flash Attention backend on V1 engine.
...
INFO 08-31 23:53:20 [llm.py:283] Supported_tasks: ['generate']
WARNING 08-31 23:53:20 [__init__.py:1671] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 490.96it/s]
Processed prompts:   0%|                                                                                                          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_0 pid=1104048) WARNING 08-31 23:53:20 [cudagraph_dispatcher.py:102] cudagraph dispatching keys are not initialized. No cudagraph will be used.
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.48it/s, est. speed input: 35.65 toks/s, output: 103.71 toks/s]
--------------------------------------------------
Prompt: 'Hello, my name is'
Generated text: ' Rik. I am 21 years old and I have been living in'
--------------------------------------------------
Prompt: 'The president of the United States is'
Generated text: ' the head of state and head of government of the United States, serving as the'
--------------------------------------------------
Prompt: 'The capital of France is'
Generated text: ' Paris. The capital of the United States is Washington, D.C. The capital'
--------------------------------------------------
Prompt: 'The future of AI is'
Generated text: ' not just about technological advancement; it is also about ethical responsibility and societal impact.'
--------------------------------------------------

@Isotr0py Isotr0py enabled auto-merge (squash) August 31, 2025 16:02
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 31, 2025
@Isotr0py Isotr0py merged commit 183a709 into vllm-project:main Sep 1, 2025
56 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
…TQ and AutoRound-GPTQ) (vllm-project#23994)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…TQ and AutoRound-GPTQ) (vllm-project#23994)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants