Skip to content

convert_gpu_weights.py crashed by CUDA out of memory, even with --force_cpu #1635

@defWorldBetter

Description

@defWorldBetter

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G mem

Reproduction

🎯 Starting one-shot quantization...
2025-11-21T01:12:06.210574+0800 | reset | INFO - Compression lifecycle reset
2025-11-21T01:12:06.365277+0800 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/21-11-2025_01.12.06.log
2025-11-21T01:12:06.365710+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-11-21T01:12:16.006866+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-11-21T01:12:16.006997+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:02<00:00, 471.54it/s]
(1/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 88.18it/s]
(1/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:14<00:00, 68.51it/s]
(2/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 87.54it/s]
(2/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:14<00:00, 72.97it/s]
(3/93): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:11<00:00, 86.87it/s]
(3/93): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:12<00:00, 84.10it/s]
(4/93): Calibrating:   0%|                                                                                                                                                                              | 0/1024 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
    outputs = forward_fn(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 5, in forward
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 395, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 345, in forward
    hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 331, in moe
    expert_output = expert(expert_input)
                    ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 223, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                                                                ^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1840, in inner
    hook_result = hook(self, args, result)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
    return hook(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 230, in calibrate_module
    self._hessians[module] = make_empty_hessian(module, device=init_device)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py", line 30, in make_empty_hessian
    return torch.zeros((num_columns, num_columns), device=device, dtype=GPTQ_PRECISION)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 65.19 MiB is free. Process 3964 has 254.00 MiB memory in use. Including non-PyTorch memory, this process has 23.23 GiB memory in use. Of the allocated memory 22.33 GiB is allocated by PyTorch, and 607.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 376, in <module>
    main()
  File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 360, in main
    oneshot(
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
    one_shot()
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
    self.apply_recipe_modifiers(
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
    pipeline(
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
    pipeline(model, dataloader, dataset_args)
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
    subgraph.forward(model, **inputs)
  File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
    raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:

1
2
3
4 def forward(self, model_layers_2, model_rotary_emb, wrapped_5, getitem_3, getitem_1):
5 model_layers_3 = getattr(self.model.layers, "3")(model_layers_2, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1, position_embeddings = model_rotary_emb); model_layers_2 = wrapped_5 = getitem_3 = getitem_1 = model_rotary_emb = None
6 return {'model_layers_3': model_layers_3}
7

Others

command: python scripts/convert_gpu_weights.py --model_id /media/data/models/GLM-4.6/ --output_dir /models/ZhipuAI/GLM-4.6-GPTQ8 --force_cpu --trust_remote_code --max_sequence_length 1024 --num_calibration_samples 1024 --quant_type W4A16

I saw that there is "# Force all modules to CPU for quantization if args.force_cpu:" ,
Does this mean enabling this parameter will make the quantization process use only memory and be unrelated to GPU memory? Otherwise, if there is enough GPU memory to loading full weight, I would't need convert as so.

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions