fix OOM when converting gpu weights #1640

ovowei · 2025-11-24T12:40:55Z

gemini-code-assist · 2025-11-24T12:41:13Z

Summary of Changes

Hello @ovowei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request directly addresses Out-Of-Memory (OOM) issues that can occur during the GPU weight conversion process, particularly when performing GPTQ quantization. It introduces robust memory management features, allowing users to define specific GPU and CPU memory limits, and provides clear instructions for optimizing these settings. The changes also ensure that the model loading process avoids unsupported disk offloading, thereby improving the stability and reliability of the weight conversion script.

Highlights

Memory Management for GPU Weight Conversion: Introduces new command-line arguments, --max_gpu_memory and --max_cpu_memory, to the convert_gpu_weights.py script, enabling users to precisely control memory allocation during GPTQ quantization and mitigate Out-Of-Memory (OOM) errors.
Enhanced Documentation: Adds a comprehensive 'Memory Management' section to the README.md, providing clear guidance on using the new memory parameters, recommended settings for various GPU VRAM configurations, and practical troubleshooting tips for OOM issues.
Prevention of Disk Offloading: Implements explicit checks and configurations within convert_gpu_weights.py to detect and prevent unsupported disk offloading during device map inference and model loading, ensuring that models are loaded entirely into available GPU/CPU memory as required by llmcompressor.
Improved Argument Parsing Readability: Refactors the argument parsing in convert_gpu_weights.py to enhance readability and provides more descriptive help messages for various parameters, improving the user experience.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a potential Out-Of-Memory (OOM) error when converting GPU weights by introducing --max_gpu_memory and --max_cpu_memory arguments. The implementation is robust, preventing unsupported disk offloading and providing clear, user-friendly error messages. The accompanying documentation updates in README.md are comprehensive and will be very helpful for users. The code has also been nicely reformatted for better readability. I have one minor suggestion to correct a typo in a help string.

gemini-code-assist · 2025-11-24T12:42:25Z

kt-kernel/scripts/convert_gpu_weights.py

+        default=None,
+        help="Maximum GPU memory for model weights per device (e.g., '40GiB'). "
+             "GPTQ quantization requires additional GPU memory for Hessian matrix computation, "
+             "so reserve 40-50%% of total VRAM. For example, use '40GiB' on 80GB GPUs. "


There's a typo in the help string. The %% should be a single % to represent a percentage.

Suggested change

"so reserve 40-50%% of total VRAM. For example, use '40GiB' on 80GB GPUs. "

"so reserve 40-50% of total VRAM. For example, use '40GiB' on 80GB GPUs. "

KMSorSMS · 2025-11-25T03:24:00Z

See the issue. The bug still remains.

ovowei · 2025-11-28T08:20:30Z

@CodeZ-Hao @KMSorSMS I’ve updated the script and verified on my machine(1TB DRAM + L20) that both --force_cpu on and --force_cpu off successfully complete the GLM-4.6 quantization. Could you try again with this pr and see if the issue persists?

KMSorSMS

Ready to merge.

CodeZ-Hao · 2025-12-01T02:52:26Z

@ovowei 我测试了在我的设备上，--max_gpu_memory 12GB仍然CUDA out of memory，而--force_cpu则提示内存不足，可能与我本地内存只有384G有关系？
我的配置：单路 Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G内存
CUDA版本12.6

KMSorSMS · 2025-12-01T03:13:26Z

@ovowei 我测试了在我的设备上，--max_gpu_memory 12GB仍然CUDA out of memory，而--force_cpu则提示内存不足，可能与我本地内存只有384G有关系？我的配置：单路 Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G内存 CUDA版本12.6

我感觉是的，我们考虑加一个 resume 的操作

ovowei · 2025-12-01T03:24:13Z

@CodeZ-Hao @KMSorSMS The memory requirement comes from needing to hold the entire model in CPU RAM during quantization. For GLM-4.6, the full-precision weights are about 357 B, so you need roughly 357 GB × 2 (bf16) ≈ 714 GB of available system memory to run the quantization pipeline successfully.

This script is based on llmcompressor — we simply call the quantization interfaces it provides. For more details on the underlying workflow, please refer to the official guide:
https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress

I think features like resume or disk offloading would need to be supported natively by llm-compressor. Because of this, resume support is not on our roadmap right now.

However, we will upload a pre-quantized GLM-4.6 GPTQ model to HuggingFace/ModelScope soon.

CodeZ-Hao · 2025-12-01T06:05:24Z

ok, tks @ovowei

update scripts

1b095e7

ovowei requested a review from KMSorSMS November 24, 2025 12:41

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

yonlunwu and others added 2 commits November 28, 2025 06:59

[fix(script)] update script:

8e8112a

merge main

85bbfe8

KMSorSMS added the run-ci label Nov 28, 2025

KMSorSMS reviewed Nov 28, 2025

View reviewed changes

ovowei merged commit fd78fe5 into main Dec 1, 2025
8 of 12 checks passed

KMSorSMS deleted the update-scripts-djw branch December 1, 2025 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix OOM when converting gpu weights #1640

fix OOM when converting gpu weights #1640

Uh oh!

ovowei commented Nov 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Uh oh!

KMSorSMS commented Nov 25, 2025

Uh oh!

ovowei commented Nov 28, 2025

Uh oh!

KMSorSMS left a comment

Uh oh!

CodeZ-Hao commented Dec 1, 2025 •

edited by KMSorSMS

Loading

Uh oh!

KMSorSMS commented Dec 1, 2025 •

edited

Loading

Uh oh!

ovowei commented Dec 1, 2025 •

edited by KMSorSMS

Loading

Uh oh!

CodeZ-Hao commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	"so reserve 40-50%% of total VRAM. For example, use '40GiB' on 80GB GPUs. "
	"so reserve 40-50% of total VRAM. For example, use '40GiB' on 80GB GPUs. "

fix OOM when converting gpu weights #1640

fix OOM when converting gpu weights #1640

Uh oh!

Conversation

ovowei commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

KMSorSMS commented Nov 25, 2025

Uh oh!

ovowei commented Nov 28, 2025

Uh oh!

KMSorSMS left a comment

Choose a reason for hiding this comment

Uh oh!

CodeZ-Hao commented Dec 1, 2025 • edited by KMSorSMS Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KMSorSMS commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ovowei commented Dec 1, 2025 • edited by KMSorSMS Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CodeZ-Hao commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ovowei commented Nov 24, 2025 •

edited

Loading

CodeZ-Hao commented Dec 1, 2025 •

edited by KMSorSMS

Loading

KMSorSMS commented Dec 1, 2025 •

edited

Loading

ovowei commented Dec 1, 2025 •

edited by KMSorSMS

Loading