-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix OOM when converting gpu weights #1640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @ovowei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request directly addresses Out-Of-Memory (OOM) issues that can occur during the GPU weight conversion process, particularly when performing GPTQ quantization. It introduces robust memory management features, allowing users to define specific GPU and CPU memory limits, and provides clear instructions for optimizing these settings. The changes also ensure that the model loading process avoids unsupported disk offloading, thereby improving the stability and reliability of the weight conversion script. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a potential Out-Of-Memory (OOM) error when converting GPU weights by introducing --max_gpu_memory and --max_cpu_memory arguments. The implementation is robust, preventing unsupported disk offloading and providing clear, user-friendly error messages. The accompanying documentation updates in README.md are comprehensive and will be very helpful for users. The code has also been nicely reformatted for better readability. I have one minor suggestion to correct a typo in a help string.
| default=None, | ||
| help="Maximum GPU memory for model weights per device (e.g., '40GiB'). " | ||
| "GPTQ quantization requires additional GPU memory for Hessian matrix computation, " | ||
| "so reserve 40-50%% of total VRAM. For example, use '40GiB' on 80GB GPUs. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
See the issue. The bug still remains. |
|
@CodeZ-Hao @KMSorSMS I’ve updated the script and verified on my machine(1TB DRAM + L20) that both --force_cpu on and --force_cpu off successfully complete the GLM-4.6 quantization. Could you try again with this pr and see if the issue persists? |
KMSorSMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge.
|
@ovowei 我测试了在我的设备上,--max_gpu_memory 12GB仍然CUDA out of memory,而--force_cpu则提示内存不足,可能与我本地内存只有384G有关系? |
我感觉是的,我们考虑加一个 resume 的操作 |
|
@CodeZ-Hao @KMSorSMS The memory requirement comes from needing to hold the entire model in CPU RAM during quantization. For GLM-4.6, the full-precision weights are about 357 B, so you need roughly 357 GB × 2 (bf16) ≈ 714 GB of available system memory to run the quantization pipeline successfully. This script is based on llmcompressor — we simply call the quantization interfaces it provides. For more details on the underlying workflow, please refer to the official guide: I think features like resume or disk offloading would need to be supported natively by llm-compressor. Because of this, resume support is not on our roadmap right now. However, we will upload a pre-quantized GLM-4.6 GPTQ model to HuggingFace/ModelScope soon. |
|
ok, tks @ovowei |
Fixes #1635