-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) #23994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) #23994
Conversation
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
… variable Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
|
Hi @Isotr0py, would you be so kind as to look at the PR? Thank you very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for GPTQ quantization compatibility for Qwen3 MoE models, specifically to support both AutoGPTQ and Autoround-GPTQ. The changes include adding an environment variable to control gate quantization and a fallback mechanism in load_weights to handle cases where weights fail to load due to quantization mismatches. The fallback logic correctly reinitializes the MoE gate layers with the appropriate quantization configuration and retries loading. My main feedback is to make the exception handling in the fallback mechanism more specific to avoid masking unrelated errors.
Signed-off-by: JartX <sagformas@epdcenter.es>
|
/gemini summary |
Summary of ChangesThis pull request significantly enhances vLLM's capability to load and utilize Qwen3 Mixture-of-Experts (MoE) models that have undergone quantization using either AutoGPTQ or AutoRound-GPTQ. It directly addresses and resolves prior compatibility issues between these two distinct quantization approaches, ensuring a more seamless and reliable integration. The changes introduce a flexible loading strategy, incorporating an environment variable for explicit control and a resilient fallback system, which collectively prevent common loading failures and improve the overall experience for users working with quantized MoE models. Highlights
Changelog
Activity
|
|
Hmmm, the problematic checkpoint looks quite weird... AutoRound should use vllm/vllm/model_executor/layers/quantization/auto_round.py Lines 25 to 41 in 038e9be
Seems the correct AutoRound config should look like this (https://huggingface.co/Intel/Qwen3-Coder-30B-A3B-Instruct-int4-AutoRound/blob/main/config.json#L177-L181): |
|
Hi! @Isotr0py thanks for the speed in your response, in ROCM the autoround-gptq method only works by auto_gptq as flag in Intel/autoround, so when it is generated with such a flag it appears as in my quantized model and I don't see a way to differentiate it. There is a key in the quantization_config called autoround_version, if you know how to get to it we could throw it there, and if there is one way or another. If that's okay with you, please, let's work on my repo and that's how we can team :) |
|
Note: I've gone around a thousand times before working on the fallback, maybe you can think of another way, because even in the internal flags of the quant AutoGPTQ and Autoround-G are the same |
Oh, I see. Let me see how to identify GPTQ model quantized by autoround. |
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
|
@Isotr0py check again :) please hehe |
|
/gemini summary |
Summary of ChangesThis pull request addresses and resolves critical compatibility issues for Qwen3 Mixture-of-Experts (MoE) models when loaded with different GPTQ quantization methods, specifically AutoGPTQ and AutoRound-GPTQ. It introduces a precise way to distinguish between these quantization types, ensuring that the MoE gate layers are correctly handled for each, thereby preventing loading failures and enhancing the overall stability and usability of quantized Qwen3 MoE models within the system. Highlights
Changelog
Activity
|
| from_autoround_gptq = False | ||
| if hasattr(config, "quantization_config"): | ||
| q_config = config.quantization_config | ||
| if (isinstance(q_config, dict) | ||
| and q_config.get("quant_method") == "gptq" | ||
| and "autoround_version" in q_config): | ||
| from_autoround_gptq = True | ||
|
|
||
| gate_quant_config = quant_config if from_autoround_gptq else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a from_autoround_gptq attr to GPTQConfig and GPTQMarlinConfig instead of only Qwen3MoE model? I think other models like Qwen2.5-VL may suffer from this issue too:
vllm/vllm/model_executor/layers/quantization/gptq.py
Lines 65 to 75 in 749be00
| self.dynamic = dynamic | |
| self.weight_bits = weight_bits | |
| self.group_size = group_size | |
| self.desc_act = desc_act | |
| self.lm_head_quantized = lm_head_quantized | |
| self.pack_factor = Fraction(32, self.weight_bits) | |
| if self.weight_bits not in [2, 3, 4, 8]: | |
| raise ValueError( | |
| "Currently, only 2/3/4/8-bit weight quantization is " | |
| f"supported for GPTQ, but got {self.weight_bits} bits.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Isotr0py Okay, I'm going to try ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Isotr0py I'm thinking... if we add it there, for example the auto-round itself, auto-gptq that is combined may fail, because it needs the quant_cofig=None but I went through the gptq.py (I can't test it because with rocm I can't test autoround), apart from GPTQ. There are even different MoE that Autoround manages in different ways in some it does the quantization of the gates and in others it does not
|
@Isotr0py yum, I need to unlock the PR, in order to continue at the main level, would you mind passing the PR if you're so kind? now that after the advance it has been reduced to a minor change? Once unlocked we will work on the last comment that he has indicated to me in another PR, which will be a much bigger change, which possibly affects many more parts of vllm and model modifications. Thank you very much for your time |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Thanks @Isotr0py ;) |
|
@JartX I just pushed some updates after making sure the compatability with https://huggingface.co/Qwen/Qwen3-30B-A3B-GPTQ-Int4 and https://huggingface.co/jart25/Qwen3-Coder-30B-A3B-Instruct-Int4-gptq. Let me double check the compatibility of https://huggingface.co/Intel/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound then. |
|
Thank you very much for helping and testing @Isotr0py ,I didn't want to move forward without being able to test everything perfectly, and if I can't try the autoround due to rocm limitation... or if I break the GPTQMarlin... 😅 . Again, thanks for the support:) |
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have confirmed Intel/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound GPTQ model can still work:
(EngineCore_0 pid=1104048) INFO 08-31 23:53:05 [core.py:75] Initializing a V1 LLM engine (v0.10.2.dev187+g5e021b498) with config: model='/home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/', speculative_config=None, tokenizer='/home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [gpu_model_runner.py:1926] Starting to load model /home/mozf/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound/...
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [gpu_model_runner.py:1958] Loading model from scratch...
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore_0 pid=1104048) INFO 08-31 23:53:06 [cuda.py:328] Using Flash Attention backend on V1 engine.
...
INFO 08-31 23:53:20 [llm.py:283] Supported_tasks: ['generate']
WARNING 08-31 23:53:20 [__init__.py:1671] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 490.96it/s]
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_0 pid=1104048) WARNING 08-31 23:53:20 [cudagraph_dispatcher.py:102] cudagraph dispatching keys are not initialized. No cudagraph will be used.
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.48it/s, est. speed input: 35.65 toks/s, output: 103.71 toks/s]
--------------------------------------------------
Prompt: 'Hello, my name is'
Generated text: ' Rik. I am 21 years old and I have been living in'
--------------------------------------------------
Prompt: 'The president of the United States is'
Generated text: ' the head of state and head of government of the United States, serving as the'
--------------------------------------------------
Prompt: 'The capital of France is'
Generated text: ' Paris. The capital of the United States is Washington, D.C. The capital'
--------------------------------------------------
Prompt: 'The future of AI is'
Generated text: ' not just about technological advancement; it is also about ethical responsibility and societal impact.'
--------------------------------------------------
…TQ and AutoRound-GPTQ) (vllm-project#23994) Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…TQ and AutoRound-GPTQ) (vllm-project#23994) Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
The following PR attempts to make the quantized MOE model chart compatible with AutoGPTQ and Autoround-GPTQ.
The PR: #23467 attempted to fix GPTQ quantization for Autoround-GPTQ, but it prevented AutoGPTQ models from loading, as @Isotr0py pointed out in his PR: #23490
In PR #23490, the method _maybe_ignore_quant_config was created.
Which: if isinstance(quant_config, (GPTQConfig, GPTQMarlinConfig)): returns None. This again breaks AutoRound-GPTQ models, since they do have quantized layer gates.
The current PR attempts to extend @Isotr0py's logic and establish an environment variable that directly sets quant_config = quant_config if VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ = 1.
A fallback system is also implemented, allowing AutoRound-GPTQs to run once the load_weights loading has failed due to the need to set the quantized layer gates and having been initially configured with quant_config = None.
The fallback log also warns of the use of: VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ = 1
To avoid entering the fallback and autotuning the gates.
I've tested both the following models:
Qwen/Qwen3-30B-A3B-GPTQ-Int4 performed using AutoGPTQ
Loading without fallback
and
jart25/Qwen3-Coder-30B-A3B-Instruct-Int8-gptq performed using Autoround-GPTQ
Loading with fallback.
It doesn't fallback if: VLLM_QUANTIZATION_FROM_AUTOROUND_GPTQ=1
And it works correctly