-
Notifications
You must be signed in to change notification settings - Fork 686
fix: Migrating trtllm examples from 1.0.0rc0 to 1.0.4rc4
#2217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1.0.0rc0 to 1.0.4rc41.0.0rc0 to 1.0.4rc4
WalkthroughThis change restructures multiple YAML configuration files for TensorRT-LLM backends by consolidating CUDA graph and KV cache settings into nested dictionaries, standardizing dtype specification, and introducing or modifying sections such as Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ConfigFile
participant BackendEngine
User->>ConfigFile: Edits YAML (adds cuda_graph_config, dtype, etc.)
ConfigFile->>BackendEngine: Loads nested config structure
BackendEngine->>BackendEngine: Parses cuda_graph_config, kv_cache_config
BackendEngine->>BackendEngine: Applies settings (max_batch_size, dtype, etc.)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (3)
components/backends/trtllm/engine_configs/agg.yaml (1)
19-20: Duplicatemax_batch_size– pick one source of truth.
max_batch_sizeis declared both at the top level (Line 19) and insidecuda_graph_config(Line 34).
Leaving both leads to ambiguous or loader-specific precedence rules.Suggested fix: drop the top-level key and keep the nested value.
-max_batch_size: 16Also applies to: 33-34
components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml (1)
38-50: Out-of-date CUDA-graph keys still present – migrate tocuda_graph_config.
use_cuda_graph,cuda_graph_padding_enabled,cuda_graph_batch_sizes, andprint_iter_logwere removed from the API.
Keeping them will cause loader failures on ≥ 1.0.4.-use_cuda_graph: true -cuda_graph_padding_enabled: true -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -print_iter_log: true +cuda_graph_config: + max_batch_size: 256components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml (1)
23-25: Out-of-date keys: migrate to 1.0.4 schema
kv_cache_dtypeandautotuner_enabledare legacy names. The new schema expects:kv_cache_config: free_gpu_memory_fraction: 0.5 enable_block_reuse: false dtype: fp8 # <-- migrate here enable_autotuner: false # <-- renamedRetaining the old keys will lead to “unknown field” errors.
-kv_cache_dtype: fp8 -disable_overlap_scheduler: true -autotuner_enabled: false +disable_overlap_scheduler: true +enable_autotuner: false + +kv_cache_config: + free_gpu_memory_fraction: 0.5 + enable_block_reuse: false + dtype: fp8
🧹 Nitpick comments (11)
components/backends/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml (1)
36-36: Add trailing newline to satisfy YAML-lint.
File currently ends without a newline which triggers thenew-line-at-end-of-filerule.-print_iter_log: true +print_iter_log: true +components/backends/trtllm/engine_configs/prefill.yaml (1)
24-30: Minor formatting – missing EOF newline.
Identical linter warning as elsewhere; add one blank line afterbackend: default.backend: default +components/backends/trtllm/engine_configs/agg.yaml (1)
33-34: Trailing newline missing.
Same YAML-lint error; append one newline.max_batch_size: 16 +components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml (1)
14-17: Check FP8 availability comment
dtype: fp8is valid only on Hopper/Blackwell or later GPUs and requires--fp8_kv_cacheenablement at runtime.
Consider adding an inline comment (as done in other files) or a guard in the launcher script to avoid silent mis-config on older HW.components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml (1)
30-32: Consistency: add HW note fordtype: fp8Mirror the explanatory comment you placed in other DeepSeek configs so future maintainers know why FP8 is chosen and on which GPUs it is safe.
components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml (1)
34-34: Minor: duplicate HW disclaimerFor parity with other files, add the standard comment explaining FP8 support requirements just above this line.
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml (1)
41-41: No trailing newline — YAML-lint errorStatic analysis flagged the missing newline at EOF. Add one to keep linters green.
-print_iter_log: true +print_iter_log: true +components/backends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml (1)
34-34: Add FP8 support commentReplicate the FP8 hardware support note for clarity.
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml (1)
54-54: Missing trailing newline
YAML-lint flags the absence of a newline at EOF. Add one to keep linters quiet.-print_iter_log: true +print_iter_log: true +components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml (1)
28-39: CUDA-graph block looks good but newline missing
Config is valid; however, the file again lacks a trailing newline – fix to satisfy linters.- 256 +components/backends/trtllm/engine_configs/decode.yaml (1)
30-31: Add missing newline at end of file.The file is missing a newline character at the end, which violates YAML formatting conventions.
cache_transceiver_config: - backend: default + backend: default +
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
components/backends/trtllm/engine_configs/agg.yaml(1 hunks)components/backends/trtllm/engine_configs/decode.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_prefill.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml(1 hunks)components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml(2 hunks)components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml(2 hunks)components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml(1 hunks)components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml(1 hunks)components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml(1 hunks)components/backends/trtllm/engine_configs/prefill.yaml(1 hunks)components/backends/trtllm/src/dynamo/trtllm/main.py(1 hunks)container/build.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
container/build.sh (2)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Learnt from: zaristei
PR: #2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml (1)
Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the resources section under ServiceArgs specifies resources per individual worker, not total resources for all workers. So workers: 8 with gpu: '1' means 8 workers × 1 GPU each = 8 GPUs total.
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml (1)
Learnt from: ptarasiewiczNV
PR: #2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The --torch-backend=auto flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
🪛 YAMLlint (1.37.1)
components/backends/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml
[error] 36-36: no new line character at the end of file
(new-line-at-end-of-file)
components/backends/trtllm/engine_configs/agg.yaml
[error] 34-34: no new line character at the end of file
(new-line-at-end-of-file)
components/backends/trtllm/engine_configs/prefill.yaml
[error] 30-30: no new line character at the end of file
(new-line-at-end-of-file)
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml
[error] 41-41: no new line character at the end of file
(new-line-at-end-of-file)
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml
[error] 54-54: no new line character at the end of file
(new-line-at-end-of-file)
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml
[error] 39-39: no new line character at the end of file
(new-line-at-end-of-file)
components/backends/trtllm/engine_configs/decode.yaml
[error] 31-31: no new line character at the end of file
(new-line-at-end-of-file)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (17)
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_prefill.yaml (1)
28-31:dtypenesting matches the 1.0.4rc4 schema – looks good.
No further issues spotted in this hunk.components/backends/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml (1)
27-30: Correct relocation ofdtypeintokv_cache_config.
Change is compliant with the new spec.components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml (1)
42-56: Verify field name correctnessThis file uses
enable_padding, whereassimple/agg.yamlusesenable_graph_padding.
Please confirm which spelling thetrtllmloader accepts in 1.0.4rc4 and normalise across all configs to avoid hard-to-debug mismatches.components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml (1)
32-35: Changedfree_gpu_memory_fractionfrom 0.75 → 0.3This is a large swing; make sure the reduction is intentional and tested, otherwise you may hit OOM on high-concurrency prefill.
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml (2)
34-34:dtypemoved underkv_cache_config– good migration
Placingdtype: fp8insidekv_cache_configbrings this file in line with the 1.0.4 schema. No issues spotted.
41-53: Double-check CUDA-graph batch list vs.max_batch_size
max_batch_size(Line 25) is 256, yet the largest graph batch registered is also 256. Any future increase ofmax_batch_sizewithout expanding this list will silently fall back to eager mode and negate CUDA-graph benefits. Consider adding a comment or helper script guarding this invariant.components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml (2)
6-15: Verifybackend: WIDEEPspelling
Upstream docs useWidEEP/WideEP. Confirm that the all-capsWIDEEPis what 1.0.4 expects—otherwise MoE routing will fail at load time.
25-26: Dramatic drop offree_gpu_memory_fractionto 0.3
Going from 0.7 ➜ 0.3 frees memory for DP attention but can throttle throughput due to smaller KV pages. Make sure this is intentional and tested.components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml (3)
18-21: Backend naming & path sanity
SameWIDEEPspelling question as in agg file. Also ensure the absolute path inload_balancerexists inside the container image.
39-39: Nice: KV-cache dtype now scoped
dtype: fp8underkv_cache_configis the correct 1.0.4 placement.
47-60: CUDA-graph section OK
Padding enabled and batch sizes cover the advertised range. No action required.container/build.sh (1)
97-97: Update totensorrt-llm==1.0.0rc4acknowledged
Version bump matches the configs. Ensure any cached wheel directories or pinned commit hashes still correspond to RC4 to avoid ABI mismatches.components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml (1)
32-32: Lower-case boolean fixed – thanks
YAML boolean now valid.components/backends/trtllm/src/dynamo/trtllm/main.py (1)
104-107: LGTM! Proper dictionary key access implementation.The change correctly updates from attribute-style access to dictionary key checking, which aligns with the configuration restructuring in this TensorRT-LLM migration. The logic properly sets the default buffer size only when the key is absent.
components/backends/trtllm/engine_configs/decode.yaml (1)
24-25: LGTM! Proper CUDA graph configuration consolidation.The restructuring of CUDA graph settings into a nested
cuda_graph_configdictionary aligns with the API changes in TensorRT-LLM 1.0.4rc4 and improves configuration organization.components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml (2)
24-24: LGTM! Parameter rename aligns with API update.The rename from
autotuner_enabledtoenable_autotunerfollows the TensorRT-LLM 1.0.4rc4 API conventions mentioned in the PR objectives.
39-40: LGTM! CUDA graph configuration consolidation.The restructuring into nested
cuda_graph_configis consistent with the migration to TensorRT-LLM 1.0.4rc4 and matches the pattern used across other configuration files.
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml
Show resolved
Hide resolved
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml
Show resolved
Hide resolved
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml
Outdated
Show resolved
Hide resolved
|
For |
Overview:
This PR upgrades the TensorRT-LLM examples from
trtllm=1.0.0rc0totrtllm=1.0.0rc4. The migration involves updating configuration files to align with API changes between these release candidates.Details:
Mainly consists of config changes, such as:
use_cuda_graphfield is now implicitly enabled by the presence of thecuda_graph_configfield.autotuner_enabledparameter has been renamed toenable_autotuner.Summary by CodeRabbit
Configuration Updates
Dependency Update