-
Notifications
You must be signed in to change notification settings - Fork 690
fix(ci): Reduce the free gpu memory fraction #2433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new cache_transceiver config, this high free_gpu_memory_fraction can lead to OOM issues
Does this reserve some GPU memory as well? Are there any guides/tips on estimating combination of free_gpu_memory_fraction and cache_transceiver_config?
I wonder if we should consider this a bug to raise with TRTLLM, such that free_gpu_memory_fraction should only be considered AFTER all the other things (model weights, cache transceiver config, etc.) so that it doesn't need to be tuned as much based on the others.
WalkthroughUpdated kv_cache_config.free_gpu_memory_fraction from 0.95 to 0.85 across three TRT-LLM engine configuration YAMLs: agg.yaml, decode.yaml, and prefill.yaml. No other configuration fields were modified. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the "Integrations" page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
components/backends/trtllm/engine_configs/prefill.yaml (1)
26-27: Document KV-cache change and validate prefill capacityQuick summary: I searched the repo — the trtllm prefill engine config now sets free_gpu_memory_fraction: 0.85 and there are no other prefill engine configs still set to 0.95. Please add the inline rationale comment and run the runtime sanity checks below.
Files to inspect / keep consistent:
- components/backends/trtllm/engine_configs/prefill.yaml — update here (primary)
- Notable repo occurrences found (for context):
- components/backends/trtllm/engine_configs/decode.yaml, agg.yaml — 0.85
- components/backends/trtllm/engine_configs/multimodal/*/prefill.yaml — 0.30
- components/backends/trtllm/engine_configs/deepseek_r1/*/prefill.yaml — 0.75 / 0.30
- components/backends/trtllm/engine_configs/llama4/**/eagle_prefill.yaml — 0.5
Proposed inline comment (apply to components/backends/trtllm/engine_configs/prefill.yaml):
kv_cache_config: - free_gpu_memory_fraction: 0.85 + # Lowered free GPU memory fraction to 0.85 to reduce OOM risk for long prefill sequences. + # Verify target concurrency and long-context workloads still fit with this KV allocation. + free_gpu_memory_fraction: 0.85Sanity checks to run (manual / CI):
- Replay representative prefill-heavy workloads (e.g., max_num_tokens=8192) at expected batch sizes and concurrency; confirm no regression in throughput and no OOMs.
- If using MIG, validate MIG profiles remain stable with this fraction.
- If you use the perf_sweeps tools, regenerate/test configs that inject ctx_free_gpu_memory_fraction (scripts in components/backends/trtllm/performance_sweeps/) to ensure automation matches this default.
Repo check performed:
- Searched for free_gpu_memory_fraction across repo; no other prefill engine configs were left at 0.95.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/backends/trtllm/engine_configs/agg.yaml(1 hunks)components/backends/trtllm/engine_configs/decode.yaml(1 hunks)components/backends/trtllm/engine_configs/prefill.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Overview:
With the new cache_transceiver config, this high free_gpu_memory_fraction can lead to OOM issues on A100 machines.
Updating to reasonable defaults.
Summary by CodeRabbit