-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Quantization] Add field to skip unquantized modules for GPTQ config #25455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a more robust and cleaner way to handle unquantized modules in GPTQ models by adding the modules_in_block_to_quantize field to the configuration. This replaces the previous hacky approach and resolves existing issues, which is a great improvement for maintainability. The implementation correctly infers the modules to quantize from the model's checkpoint metadata when not explicitly provided. I have one suggestion regarding code duplication between GPTQConfig and GPTQMarlinConfig that would further enhance the maintainability of this new logic.
| def apply_vllm_mapper(self, hf_to_vllm_mapper): | ||
| if self.modules_in_block_to_quantize is not None: | ||
| self.modules_in_block_to_quantize = hf_to_vllm_mapper.apply_list( | ||
| self.modules_in_block_to_quantize) | ||
|
|
||
| def maybe_update_config(self, | ||
| model_name: str, | ||
| revision: Optional[str] = None): | ||
| if self.modules_in_block_to_quantize: | ||
| return | ||
|
|
||
| unquant_dtypes = [torch.float16, torch.bfloat16, torch.float32] | ||
| metadata = get_safetensors_params_metadata(model_name, | ||
| revision=revision) | ||
| quant_layers: set[str] = { | ||
| param_name.rsplit(".", 1)[0] | ||
| for param_name, info in metadata.items() | ||
| if (dtype := info.get('dtype', None)) | ||
| and _SAFETENSORS_TO_TORCH_DTYPE[dtype] not in unquant_dtypes | ||
| } | ||
| self.modules_in_block_to_quantize = list(quant_layers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods apply_vllm_mapper and maybe_update_config are identical to the ones in vllm/model_executor/layers/quantization/gptq.py. This code duplication can lead to maintenance issues, where a change in one file might not be propagated to the other, potentially causing bugs.
To improve maintainability, I recommend refactoring this shared logic into a common base class or a mixin. For example, you could create a GPTQConfigMixin class to house these methods, and then have both GPTQConfig and GPTQMarlinConfig inherit from it. This would ensure that the logic for handling modules_in_block_to_quantize is defined in a single place.
|
Can you check and clean up other models that contain |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Removed |
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the improvement. Perhaps we can apply this logic to other quantitative methods
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Hi @Isotr0py , the following PR breaks model loading: jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ The model has been generated using Autoround-GPTQ, like the previous ones |
|
Ooops, sorry for breaking this. Let me fix it. 😅 |
|
@Isotr0py Don't worry, development |
|
@JartX I double checked the "modules_in_block_to_quantize": [
[
+++ "self_attn.q_proj",
+++ "self_attn.k_proj",
+++ "self_attn.v_proj",
+++ "self_attn.o_proj",
"linear_attn.in_proj_qkvz",
"linear_attn.in_proj_ba",
"linear_attn.out_proj",With above changes in |
|
@Isotr0py You're right, it works, and Autoround doesn't declare them. Does that mean it used to work “by chance”? I'm going to open an issue in Autoround. |
That's right. The |
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#25455) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Purpose
Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4andQwen/Qwen3-30B-A3B-GPTQ-Int4have unquantized layers, but there is no conifg to skip creating unquantized weights. And we're using_maybe_ignore_quant_configto patch model implementation in a hacky way.modules_in_block_to_quantizefrom TransformersGPTQConfig(https://huggingface.co/docs/transformers/v4.56.2/en/main_classes/quantization#transformers.GPTQConfig), and automatically infer it from checkpoint metadata. So that we no longer need to do hacky patching for GPTQ models now.Test Plan
Test Result
Qwen3-MoE models with/without MoE gate quantized can all generate normal outputs without GPTQ patch now.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.