-
Notifications
You must be signed in to change notification settings - Fork 688
feat: add NIM FE (num_request_max) + runtime config metrics with periodic polling #3107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds per-model runtime-configuration and health metrics to HTTP metrics, registers new Prometheus gauges, and starts a background polling task from service initialization to periodically collect ModelManager and optional etcd (MDC) data, updating metrics and marking model health accordingly. Poll interval is set via DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Env as Env
participant Service as HttpService::build
participant Metrics as Metrics
participant Manager as ModelManager
participant Etcd as etcd::Client
participant Prom as Prometheus
note over Service: Startup
Env->>Service: Read DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS
Service->>Metrics: start_runtime_config_polling_task(metrics, manager, etcd?, interval)
activate Metrics
Metrics->>Metrics: Spawn background task (loop every interval)
deactivate Metrics
rect rgb(235, 245, 255)
note right of Metrics: Polling cycle
loop Every interval
Metrics->>Manager: list/get ModelEntry items
Manager-->>Metrics: ModelEntry[]
alt etcd client present
Metrics->>Etcd: Load MDC (context_length, kv_block_size, migration_limit)
Etcd-->>Metrics: MDC values
Metrics->>Metrics: update_mdc_metrics(model_name, ...)
end
Metrics->>Metrics: update_runtime_config_metrics(model_name, ...)
Metrics->>Metrics: set_model_health_status(model_name, healthy=true)
Metrics->>Prom: Update gauge vectors for models
Metrics->>Metrics: mark inactive models healthy=false
Metrics->>Prom: Update health gauges
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (5)
lib/llm/src/http/service/metrics.rs (4)
138-153: Doc nit: unknown health is supported but never emitted by the pollerYou expose -1 = unknown and a setter, but the polling path never sets unknown. Consider when/if you want unknown surfaced (e.g., before first successful poll, or when the manager is temporarily unavailable).
262-325: Metric suffix constants: align with repo naming registry for consistencyTo keep naming consistent across the codebase (per prior standardization), define suffix constants in prometheus_names::frontend_service and use them here instead of string literals.
Would you like a follow‑up patch adding constants like FRONTEND_MODEL_TOTAL_KV_BLOCKS, etc., and wiring them here?
470-491: Normalize model label for MDC metricsKeep label casing consistent with other metrics to avoid duplicate series.
Apply:
pub fn update_mdc_metrics( &self, model_name: &str, context_length: u32, kv_cache_block_size: u32, migration_limit: u32, ) { - self.model_context_length - .with_label_values(&[model_name]) + let model = model_name.to_lowercase(); + self.model_context_length + .with_label_values(&[&model]) .set(context_length as i64); - self.model_kv_cache_block_size - .with_label_values(&[model_name]) + self.model_kv_cache_block_size + .with_label_values(&[&model]) .set(kv_cache_block_size as i64); - self.model_migration_limit - .with_label_values(&[model_name]) + self.model_migration_limit + .with_label_values(&[&model]) .set(migration_limit as i64); }
553-623: Poller robustness: missed ticks + initial run semantics
- Set MissedTickBehavior::Delay to avoid burst catch‑up if the loop runs long.
- Confirm whether the first tick is immediate in your Tokio version; if not, perform an initial poll before the loop.
Apply:
tokio::spawn(async move { - let mut interval = tokio::time::interval(poll_interval); + let mut interval = tokio::time::interval(poll_interval); + interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);Optional: run one poll before the loop to avoid waiting a full interval on startup.
lib/llm/src/http/service/service_v2.rs (1)
327-334: Background task lifecycle: consider graceful shutdownThe JoinHandle is dropped; the task runs past service shutdown and can’t be cancelled. Store the handle (e.g., on HttpService or State) and abort it when CancellationToken triggers.
If you want, I can draft a follow‑up patch that threads the handle into HttpService and aborts it in run() when observer is cancelled.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/llm/src/http/service/metrics.rs(6 hunks)lib/llm/src/http/service/service_v2.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/http/service/metrics.rs
🧬 Code graph analysis (2)
lib/llm/src/http/service/service_v2.rs (3)
lib/bindings/python/rust/lib.rs (1)
etcd_client(368-373)lib/runtime/src/transports/etcd.rs (1)
etcd_client(131-133)lib/llm/src/http/service/metrics.rs (1)
start_runtime_config_polling_task(556-623)
lib/llm/src/http/service/metrics.rs (5)
lib/llm/src/local_model.rs (2)
runtime_config(179-182)runtime_config(376-378)lib/llm/src/http/service/service_v2.rs (3)
new(74-86)etcd_client(114-116)manager(106-108)lib/llm/src/discovery/model_manager.rs (1)
new(52-60)lib/bindings/python/rust/llm/local_model.rs (2)
total_kv_blocks(56-58)max_num_seqs(61-63)lib/runtime/src/transports/etcd.rs (1)
etcd_client(131-133)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (2)
lib/llm/src/http/service/metrics.rs (1)
39-47: Good addition: clear, per‑model runtime config and health gaugesFields look consistent and minimal in label cardinality (only "model"). Nice.
lib/llm/src/http/service/service_v2.rs (1)
296-299: Cloning etcd client once is fineLocal clone for passing into the poller while also storing in State looks correct.
kthui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have some extra test case(s) on the new metrics - to get a better understanding on how the new metrics would look from end users.
ryan-lempka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments (non-blocking). Otherwise LGTM.
Is polling required here? Are there use-cases where the same MDC changes after initial publishing? |
Good point-- once MDC is registered it's always there. However, I do have one gauge called model_workers where it counts the # of workers available serving a particular model. The value can be anywhere between 0 to n. I tested this with multiple workers serving the same model. Other than that, if we can assume that workers that start/stop will always have the exact same MDC, then we don't have to have to keep polling for MDC values. Are these valid assumptions to make? |
8ca98e5 to
392cec1
Compare
- Add new Prometheus metrics for model runtime configuration - Include metrics for total_kv_blocks, max_num_seqs, max_num_batched_tokens - Add MDC metrics for context_length, kv_cache_block_size, migration_limit - Implement model health status tracking (healthy/unhealthy/unknown) - Add background polling task to keep metrics current as backends change - Preserve all historical data - metrics are never removed, only marked unhealthy - Configurable poll interval via DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…Prometheus conventions - Update constant name from MODEL_WORKER_COUNT to MODEL_WORKERS_TOTAL - Change metric name from model_worker_count to model_workers_total - Update all references in metrics service implementation - Update documentation in README to reflect new metric name - Improve metrics polling task lifecycle management in service_v2 - Follows Prometheus naming convention using _total suffix for counters Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…INTERVAL_SECS - Add validation to ensure poll interval is greater than 0 - Use filter() to reject zero or negative values - Update documentation to clarify validation requirement - Prevents potential issues with zero-duration polling intervals Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…kends SGLang: get max_num_seqs from server_args since SGLang separates config from runtime stats TensorRT-LLM: populate max_num_seqs and max_num_batched_tokens from config for metrics consistency
- Fix metric name from model_workers_total to model_workers - Document model name deduplication behavior in README.md - Add comments explaining gauge vs counter usage for runtime config metrics - Clarify that some metrics use gauges because they're synchronized from upstream Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…g intervals Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
392cec1 to
3151cb1
Compare
Discussed this on Slack. Will try to get this in before code-freeze, and then work on migrating polling to the Watcher method. |
KrishnanPrash
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a note to document future work.
Note: Under the assumption that MDC are static, we can use a watcher pattern used by the frontend [reference] instead of the polling approach currently being utilized.
…odic polling (#3107) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
…odic polling (#3107) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
…odic polling (#3107) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Kyle H <kylhuang@nvidia.com>
Overview:
Add runtime configuration metrics for model deployment monitoring across all backend frameworks. Fix missing max_num_seqs metrics in SGLang and TensorRT-LLM backends while maintaining existing vLLM functionality.
Details:
Where should the reviewer start?
Related Issues:
DIS-607 DIS-641
/coderabbit profile chill