Skip to content

Conversation

@ovowei
Copy link
Contributor

@ovowei ovowei commented Nov 24, 2025

Fixes #1635

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ovowei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request directly addresses Out-Of-Memory (OOM) issues that can occur during the GPU weight conversion process, particularly when performing GPTQ quantization. It introduces robust memory management features, allowing users to define specific GPU and CPU memory limits, and provides clear instructions for optimizing these settings. The changes also ensure that the model loading process avoids unsupported disk offloading, thereby improving the stability and reliability of the weight conversion script.

Highlights

  • Memory Management for GPU Weight Conversion: Introduces new command-line arguments, --max_gpu_memory and --max_cpu_memory, to the convert_gpu_weights.py script, enabling users to precisely control memory allocation during GPTQ quantization and mitigate Out-Of-Memory (OOM) errors.
  • Enhanced Documentation: Adds a comprehensive 'Memory Management' section to the README.md, providing clear guidance on using the new memory parameters, recommended settings for various GPU VRAM configurations, and practical troubleshooting tips for OOM issues.
  • Prevention of Disk Offloading: Implements explicit checks and configurations within convert_gpu_weights.py to detect and prevent unsupported disk offloading during device map inference and model loading, ensuring that models are loaded entirely into available GPU/CPU memory as required by llmcompressor.
  • Improved Argument Parsing Readability: Refactors the argument parsing in convert_gpu_weights.py to enhance readability and provides more descriptive help messages for various parameters, improving the user experience.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ovowei ovowei requested a review from KMSorSMS November 24, 2025 12:41
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a potential Out-Of-Memory (OOM) error when converting GPU weights by introducing --max_gpu_memory and --max_cpu_memory arguments. The implementation is robust, preventing unsupported disk offloading and providing clear, user-friendly error messages. The accompanying documentation updates in README.md are comprehensive and will be very helpful for users. The code has also been nicely reformatted for better readability. I have one minor suggestion to correct a typo in a help string.

default=None,
help="Maximum GPU memory for model weights per device (e.g., '40GiB'). "
"GPTQ quantization requires additional GPU memory for Hessian matrix computation, "
"so reserve 40-50%% of total VRAM. For example, use '40GiB' on 80GB GPUs. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the help string. The %% should be a single % to represent a percentage.

Suggested change
"so reserve 40-50%% of total VRAM. For example, use '40GiB' on 80GB GPUs. "
"so reserve 40-50% of total VRAM. For example, use '40GiB' on 80GB GPUs. "

@KMSorSMS
Copy link
Collaborator

See the issue. The bug still remains.

@ovowei
Copy link
Contributor Author

ovowei commented Nov 28, 2025

@CodeZ-Hao @KMSorSMS I’ve updated the script and verified on my machine(1TB DRAM + L20) that both --force_cpu on and --force_cpu off successfully complete the GLM-4.6 quantization. Could you try again with this pr and see if the issue persists?

Copy link
Collaborator

@KMSorSMS KMSorSMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge.

@CodeZ-Hao
Copy link

CodeZ-Hao commented Dec 1, 2025

@ovowei 我测试了在我的设备上,--max_gpu_memory 12GB仍然CUDA out of memory,而--force_cpu则提示内存不足,可能与我本地内存只有384G有关系?
我的配置:单路 Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G内存
CUDA版本12.6

@KMSorSMS
Copy link
Collaborator

KMSorSMS commented Dec 1, 2025

@ovowei 我测试了在我的设备上,--max_gpu_memory 12GB仍然CUDA out of memory,而--force_cpu则提示内存不足,可能与我本地内存只有384G有关系? 我的配置:单路 Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G内存 CUDA版本12.6

我感觉是的,我们考虑加一个 resume 的操作

@ovowei
Copy link
Contributor Author

ovowei commented Dec 1, 2025

@CodeZ-Hao @KMSorSMS The memory requirement comes from needing to hold the entire model in CPU RAM during quantization. For GLM-4.6, the full-precision weights are about 357 B, so you need roughly 357 GB × 2 (bf16) ≈ 714 GB of available system memory to run the quantization pipeline successfully.

This script is based on llmcompressor — we simply call the quantization interfaces it provides. For more details on the underlying workflow, please refer to the official guide:
https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress

I think features like resume or disk offloading would need to be supported natively by llm-compressor. Because of this, resume support is not on our roadmap right now.

However, we will upload a pre-quantized GLM-4.6 GPTQ model to HuggingFace/ModelScope soon.

@CodeZ-Hao
Copy link

ok, tks @ovowei

@ovowei ovowei merged commit fd78fe5 into main Dec 1, 2025
8 of 12 checks passed
@KMSorSMS KMSorSMS deleted the update-scripts-djw branch December 1, 2025 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convert_gpu_weights.py crashed by CUDA out of memory, even with --force_cpu

6 participants