convert_gpu_weights.py crashed by CUDA out of memory, even with --force_cpu

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G mem

### Reproduction

```text
🎯 Starting one-shot quantization...
2025-11-21T01:12:06.210574+0800 | reset | INFO - Compression lifecycle reset
2025-11-21T01:12:06.365277+0800 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/21-11-2025_01.12.06.log
2025-11-21T01:12:06.365710+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-11-21T01:12:16.006866+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-11-21T01:12:16.006997+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:02<00:00, 471.54it/s]
(1/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 88.18it/s]
(1/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:14<00:00, 68.51it/s]
(2/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 87.54it/s]
(2/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:14<00:00, 72.97it/s]
(3/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 86.87it/s]
(3/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:12<00:00, 84.10it/s]
(4/93): Calibrating:   0%|                                                                                                                                                                              | 0/1024 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
    outputs = forward_fn(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 5, in forward
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 395, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 345, in forward
    hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 331, in moe
    expert_output = expert(expert_input)
                    ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 223, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                                                                ^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1840, in inner
    hook_result = hook(self, args, result)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
    return hook(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 230, in calibrate_module
    self._hessians[module] = make_empty_hessian(module, device=init_device)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py", line 30, in make_empty_hessian
    return torch.zeros((num_columns, num_columns), device=device, dtype=GPTQ_PRECISION)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 65.19 MiB is free. Process 3964 has 254.00 MiB memory in use. Including non-PyTorch memory, this process has 23.23 GiB memory in use. Of the allocated memory 22.33 GiB is allocated by PyTorch, and 607.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 376, in <module>
    main()
  File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 360, in main
    oneshot(
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
    one_shot()
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
    self.apply_recipe_modifiers(
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
    pipeline(
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
    pipeline(model, dataloader, dataset_args)
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
    subgraph.forward(model, **inputs)
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
    raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:
```
1 
2 
3 
4 def forward(self, model_layers_2, model_rotary_emb, wrapped_5, getitem_3, getitem_1):
5     model_layers_3 = getattr(self.model.layers, "3")(model_layers_2, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1, position_embeddings = model_rotary_emb);  model_layers_2 = wrapped_5 = getitem_3 = getitem_1 = model_rotary_emb = None
6     return {'model_layers_3': model_layers_3}
7     
```
```


### Others

command: python scripts/convert_gpu_weights.py --model_id /media/data/models/GLM-4.6/ --output_dir /models/ZhipuAI/GLM-4.6-GPTQ8 --force_cpu --trust_remote_code --max_sequence_length 1024 --num_calibration_samples 1024 --quant_type W4A16

I saw that there is "# Force all modules to CPU for quantization if args.force_cpu:" , 
Does this mean enabling this parameter will make the quantization process use only memory and be unrelated to GPU memory? Otherwise, if there is enough GPU memory to loading full weight, I would't need convert as so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert_gpu_weights.py crashed by CUDA out of memory, even with --force_cpu #1635

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

convert_gpu_weights.py crashed by CUDA out of memory, even with --force_cpu #1635

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions