[Quantization] Add field to skip unquantized modules for GPTQ config #25455

Isotr0py · 2025-09-23T06:43:44Z

Purpose

Currently, GPTQ models like Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4 and Qwen/Qwen3-30B-A3B-GPTQ-Int4 have unquantized layers, but there is no conifg to skip creating unquantized weights. And we're using _maybe_ignore_quant_config to patch model implementation in a hacky way.
However, this patch has been reported troublesome in [BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) #25268 and [BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) #23994, because different compressor can have various unquantized layers for MoE GPTQ models.
This PR introduced modules_in_block_to_quantize from Transformers GPTQConfig (https://huggingface.co/docs/transformers/v4.56.2/en/main_classes/quantization#transformers.GPTQConfig), and automatically infer it from checkpoint metadata. So that we no longer need to do hacky patching for GPTQ models now.

Test Plan

python examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B-GPTQ-Int4

python examples/offline_inference/basic/generate.py --model jart25/Qwen3-Coder-30B-A3B-Instruct-Int4-gptq

python examples/offline_inference/basic/generate.py --model Intel/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound

Test Result

Qwen3-MoE models with/without MoE gate quantized can all generate normal outputs without GPTQ patch now.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gemini-code-assist

Code Review

This pull request introduces a more robust and cleaner way to handle unquantized modules in GPTQ models by adding the modules_in_block_to_quantize field to the configuration. This replaces the previous hacky approach and resolves existing issues, which is a great improvement for maintainability. The implementation correctly infers the modules to quantize from the model's checkpoint metadata when not explicitly provided. I have one suggestion regarding code duplication between GPTQConfig and GPTQMarlinConfig that would further enhance the maintainability of this new logic.

gemini-code-assist · 2025-09-23T06:45:41Z

vllm/model_executor/layers/quantization/gptq_marlin.py

+    def apply_vllm_mapper(self, hf_to_vllm_mapper):
+        if self.modules_in_block_to_quantize is not None:
+            self.modules_in_block_to_quantize = hf_to_vllm_mapper.apply_list(
+                self.modules_in_block_to_quantize)
+
+    def maybe_update_config(self,
+                            model_name: str,
+                            revision: Optional[str] = None):
+        if self.modules_in_block_to_quantize:
+            return
+
+        unquant_dtypes = [torch.float16, torch.bfloat16, torch.float32]
+        metadata = get_safetensors_params_metadata(model_name,
+                                                   revision=revision)
+        quant_layers: set[str] = {
+            param_name.rsplit(".", 1)[0]
+            for param_name, info in metadata.items()
+            if (dtype := info.get('dtype', None))
+            and _SAFETENSORS_TO_TORCH_DTYPE[dtype] not in unquant_dtypes
+        }
+        self.modules_in_block_to_quantize = list(quant_layers)


The methods apply_vllm_mapper and maybe_update_config are identical to the ones in vllm/model_executor/layers/quantization/gptq.py. This code duplication can lead to maintenance issues, where a change in one file might not be propagated to the other, potentially causing bugs.

To improve maintainability, I recommend refactoring this shared logic into a common base class or a mixin. For example, you could create a GPTQConfigMixin class to house these methods, and then have both GPTQConfig and GPTQMarlinConfig inherit from it. This would ensure that the logic for handling modules_in_block_to_quantize is defined in a single place.

jeejeelee · 2025-09-24T07:36:30Z

Can you check and clean up other models that contain _maybe_ignore_quant_config as well?

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py · 2025-09-24T15:14:54Z

Can you check and clean up other models that contain _maybe_ignore_quant_config as well?

Removed _maybe_ignore_quant_config from keye, minicpmo and ovis. Have confirmed all models can still be loaded after cleanup. (Ovis is broken at inference due to other reasons)

jeejeelee

Thank you for the improvement. Perhaps we can apply this logic to other quantitative methods

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

JartX · 2025-09-28T09:34:50Z

Hi @Isotr0py , the following PR breaks model loading:

jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ

The model has been generated using Autoround-GPTQ, like the previous ones
I've verified that reverting it works again. Here's the error:

Loading safetensors checkpoint shards:   0% 0/9 [00:00<?, ?it/s](Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597] WorkerProc failed to start.
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597] Traceback (most recent call last):
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.worker.load_model()
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2712, in load_model
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.model = model_loader.load_model(
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.load_weights(model, model_config)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 264, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     loaded_weights = model.load_weights(
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1185, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     return loader.load_weights(weights)    raise ValueError(msg)
vllm1-1  | (Worker_TP3 pid=241) ERROR 09-28 08:46:50 [multiproc_executor.py:597] ValueError: There is no module or parameter named 'layers.10.mlp.experts.0' in Qwen3NextModel
vllm1-1  | (Worker_TP2 pid=240) ERROR 09-28 08:46:50 [multiproc_executor.py:597] WorkerProc failed to start.

vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 297, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     autoloaded_weights = set(self._load_module("", self.module, weights))
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 255, in _load_module
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     yield from self._load_module(prefix,
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 228, in _load_module
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     loaded_params = module_load_weights(weights)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1021, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     param = params_dict[name]
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]             ~~~~~~~~~~~^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597] KeyError: 'layers.7.self_attn.qkqkv_proj.g_idx'
Loading safetensors checkpoint shards:   0% 0/9 [00:02<?, ?it/s]
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:55:52 [multiproc_executor.py:558] Parent process exited, terminating worker

Isotr0py · 2025-09-28T11:21:23Z

Ooops, sorry for breaking this. Let me fix it. 😅

JartX · 2025-09-28T11:23:28Z

@Isotr0py Don't worry, development ☺️ stuff

Isotr0py · 2025-09-28T16:08:23Z

@JartX I double checked the config.json's quant_config in jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ, and found that "modules_in_block_to_quantize" missed self attention layer's projector which is also quantized in checkpoint. Perhaps there are some bugs in AutoRound's GPTQ quantizer?

    "modules_in_block_to_quantize": [
      [
+++     "self_attn.q_proj",
+++     "self_attn.k_proj",
+++     "self_attn.v_proj",
+++     "self_attn.o_proj",
        "linear_attn.in_proj_qkvz",
        "linear_attn.in_proj_ba",
        "linear_attn.out_proj",

With above changes in config.json, the model should run nornamlly now:

$ python examples/offline_inference/basic/generate.py --model /run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/ --max-model-len 4096 --enforce-eager
INFO 09-29 00:05:18 [__init__.py:216] Automatically detected platform cuda.
INFO 09-29 00:05:20 [utils.py:233] non-default args: {'max_model_len': 4096, 'num_redundant_experts': None, 'eplb_window_size': None, 'eplb_step_interval': None, 'eplb_log_balancedness': None, 'enforce_eager': True, 'enable_lora': None, 'model': '/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/'}
INFO 09-29 00:05:20 [model.py:552] Resolved architecture: Qwen3NextForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 09-29 00:05:20 [model.py:1515] Using max model len 4096
INFO 09-29 00:05:21 [gptq_marlin.py:191] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-29 00:05:21 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 09-29 00:05:21 [config.py:297] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
INFO 09-29 00:05:21 [config.py:308] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
INFO 09-29 00:05:24 [config.py:377] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
INFO 09-29 00:05:24 [config.py:398] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-29 00:05:24 [__init__.py:382] Cudagraph is disabled under eager mode
WARNING 09-29 00:05:24 [__init__.py:3035] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
INFO 09-29 00:05:26 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:29 [core.py:648] Waiting for init message from front-end.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:29 [core.py:78] Initializing a V1 LLM engine (v0.10.2rc2.dev114+geddaafc1c) with config: model='/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/', speculative_config=None, tokenizer='/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":null,"cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":false,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1925120) WARNING 09-29 00:05:30 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gpu_model_runner.py:2679] Starting to load model /run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/...
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gpu_model_runner.py:2711] Loading model from scratch...
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gptq_marlin.py:316] Using MacheteLinearKernel for GPTQMarlinLinearMethod
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gptq_marlin.py:316] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore_DP0 pid=1925120) `torch_dtype` is deprecated! Use `dtype` instead!
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [cuda.py:347] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/9 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  11% Completed | 1/9 [00:03<00:26,  3.36s/it]
Loading safetensors checkpoint shards:  22% Completed | 2/9 [00:06<00:24,  3.51s/it]
Loading safetensors checkpoint shards:  33% Completed | 3/9 [00:10<00:20,  3.45s/it]
Loading safetensors checkpoint shards:  44% Completed | 4/9 [00:13<00:17,  3.46s/it]
Loading safetensors checkpoint shards:  56% Completed | 5/9 [00:17<00:13,  3.45s/it]
Loading safetensors checkpoint shards:  67% Completed | 6/9 [00:18<00:08,  2.84s/it]
Loading safetensors checkpoint shards:  78% Completed | 7/9 [00:22<00:06,  3.05s/it]
Loading safetensors checkpoint shards:  89% Completed | 8/9 [00:25<00:03,  3.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:29<00:00,  3.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:29<00:00,  3.23s/it]
(EngineCore_DP0 pid=1925120) 
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:00 [default_loader.py:267] Loading weights took 29.17 seconds
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:03 [gpu_model_runner.py:2730] Model loading took 39.3258 GiB and 32.666466 seconds
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:05 [gpu_worker.py:298] Available KV cache memory: 80.76 GiB
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [kv_cache_utils.py:1087] GPU KV cache size: 881,824 tokens
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [kv_cache_utils.py:1091] Maximum concurrency for 4,096 tokens per request: 589.64x
(EngineCore_DP0 pid=1925120) WARNING 09-29 00:06:06 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [core.py:211] init engine (profile, create kv cache, warmup model) took 2.61 seconds
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [__init__.py:382] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [gc_utils.py:41] GC Debug Config. enabled:False,top_objects:-1
INFO 09-29 00:06:06 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 6486
INFO 09-29 00:06:06 [llm.py:306] Supported_tasks: ['generate']
WARNING 09-29 00:06:06 [model.py:1394] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 452.24it/s]
Processed prompts:   0%|                                                                                 | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

JartX · 2025-09-28T18:35:52Z

@Isotr0py You're right, it works, and Autoround doesn't declare them. Does that mean it used to work “by chance”? I'm going to open an issue in Autoround.

Isotr0py · 2025-09-29T02:31:44Z

Does that mean it used to work “by chance”?

That's right. The modules_in_block_to_quantize was never used in vLLM before this PR. So all modules were treated as quantized in GPTQ models, and we excluded unquantized modules manually with _maybe_ignore_quant_config.

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py added 5 commits September 22, 2025 02:27

draft

88e1cc3

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

local safetensors parsing

118b6ae

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update

2600c8e

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update

b5662d2

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

clean qwen series

7771389

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, sighingnow, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 23, 2025 06:43

mergify bot added the qwen Related to Qwen models label Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

Isotr0py requested a review from jeejeelee September 24, 2025 04:09

further clean up

b1df158

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

jeejeelee approved these changes Sep 24, 2025

View reviewed changes

Merge branch 'main' into gptq-skip

5db3c87

Isotr0py enabled auto-merge (squash) September 24, 2025 16:04

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025

Isotr0py added 2 commits September 26, 2025 21:42

make gptq unquantized layer detection more robust

8675887

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge branch 'main' into gptq-skip

7216f4a

Isotr0py merged commit d4d9899 into vllm-project:main Sep 26, 2025
54 checks passed

Isotr0py deleted the gptq-skip branch September 26, 2025 16:17

Isotr0py mentioned this pull request Sep 29, 2025

FusedMoE support for the Transformers backend #22650

Merged

2 tasks

jeejeelee mentioned this pull request Sep 30, 2025

[Qwen3 VL MoE] Turn off gate quantization #25923

Closed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Quantization] Add field to skip unquantized modules for GPTQ config (v…

7f76f26

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Quantization] Add field to skip unquantized modules for GPTQ config (#…

d70c154

…25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: yewentao256 <zhyanwentao@126.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Quantization] Add field to skip unquantized modules for GPTQ config (v…

3cb60a1

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py mentioned this pull request Oct 15, 2025

[Quantization] Automatically infer AWQ modules_to_not_convert field #26909

Merged

5 tasks

BaldLee mentioned this pull request Oct 20, 2025

[feature] Qwen3 Moe GPTQ is expected to be supported EnflameTechnology/vllm-gcu#3

Open

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Quantization] Add field to skip unquantized modules for GPTQ config (v…

950e57a

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Quantization] Add field to skip unquantized modules for GPTQ config (v…

14e8b31

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Quantization] Add field to skip unquantized modules for GPTQ config (v…

5ec1450

…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

allerou4 mentioned this pull request Nov 28, 2025

[BUG]Qwen3-Omni-30B-A3B saved quantized model can't be loaded by vllm ModelCloud/GPTQModel#2221

Open

Uh oh!

[Quantization] Add field to skip unquantized modules for GPTQ config #25455

[Quantization] Add field to skip unquantized modules for GPTQ config #25455

Uh oh!

Conversation

Isotr0py commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Sep 24, 2025

Uh oh!

Isotr0py commented Sep 24, 2025

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JartX commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py commented Sep 28, 2025

Uh oh!

JartX commented Sep 28, 2025

Uh oh!

Isotr0py commented Sep 28, 2025

Uh oh!

JartX commented Sep 28, 2025

Uh oh!

Isotr0py commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Isotr0py commented Sep 23, 2025 •

edited by github-actions bot

Loading

JartX commented Sep 28, 2025 •

edited

Loading