-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G mem
Reproduction
🎯 Starting one-shot quantization...
2025-11-21T01:12:06.210574+0800 | reset | INFO - Compression lifecycle reset
2025-11-21T01:12:06.365277+0800 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/21-11-2025_01.12.06.log
2025-11-21T01:12:06.365710+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-11-21T01:12:16.006866+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-11-21T01:12:16.006997+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:02<00:00, 471.54it/s]
(1/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 88.18it/s]
(1/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:14<00:00, 68.51it/s]
(2/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 87.54it/s]
(2/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:14<00:00, 72.97it/s]
(3/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 86.87it/s]
(3/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:12<00:00, 84.10it/s]
(4/93): Calibrating: 0%| | 0/1024 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
outputs = forward_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<string>", line 5, in forward
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 395, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 345, in forward
hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 331, in moe
expert_output = expert(expert_input)
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 223, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1840, in inner
hook_result = hook(self, args, result)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
return hook(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 230, in calibrate_module
self._hessians[module] = make_empty_hessian(module, device=init_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py", line 30, in make_empty_hessian
return torch.zeros((num_columns, num_columns), device=device, dtype=GPTQ_PRECISION)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 65.19 MiB is free. Process 3964 has 254.00 MiB memory in use. Including non-PyTorch memory, this process has 23.23 GiB memory in use. Of the allocated memory 22.33 GiB is allocated by PyTorch, and 607.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 376, in <module>
main()
File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 360, in main
oneshot(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
one_shot()
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
self.apply_recipe_modifiers(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
pipeline(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
pipeline(model, dataloader, dataset_args)
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
subgraph.forward(model, **inputs)
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:
1
2
3
4 def forward(self, model_layers_2, model_rotary_emb, wrapped_5, getitem_3, getitem_1):
5 model_layers_3 = getattr(self.model.layers, "3")(model_layers_2, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1, position_embeddings = model_rotary_emb); model_layers_2 = wrapped_5 = getitem_3 = getitem_1 = model_rotary_emb = None
6 return {'model_layers_3': model_layers_3}
7
Others
command: python scripts/convert_gpu_weights.py --model_id /media/data/models/GLM-4.6/ --output_dir /models/ZhipuAI/GLM-4.6-GPTQ8 --force_cpu --trust_remote_code --max_sequence_length 1024 --num_calibration_samples 1024 --quant_type W4A16
I saw that there is "# Force all modules to CPU for quantization if args.force_cpu:" ,
Does this mean enabling this parameter will make the quantization process use only memory and be unrelated to GPU memory? Otherwise, if there is enough GPU memory to loading full weight, I would't need convert as so.