Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Sep 18, 2025

Overview:

Add runtime configuration metrics for model deployment monitoring across all backend frameworks. Fix missing max_num_seqs metrics in SGLang and TensorRT-LLM backends while maintaining existing vLLM functionality.

Details:

  • Add periodic polling system for model runtime config metrics (max_num_seqs, total_kv_blocks, max_num_batched_tokens)
  • vLLM Backend: Maintain existing functionality where max_num_seqs is retrieved from engine.vllm_config.scheduler_config
  • TensorRT-LLM Backend: Add missing runtime config metric population by setting max_num_seqs from config.max_batch_size and max_num_batched_tokens from config.max_num_tokens
  • SGLang Backend: Fix max_num_seqs retrieval by accessing server_args instead of scheduler_info due to SGLang's architectural separation of configuration parameters from runtime statistics
  • Rename MODEL_WORKER_COUNT to MODEL_WORKERS_TOTAL following Prometheus conventions
  • Add configuration safety guard against invalid polling intervals
  • Update metrics documentation with new runtime configuration metrics

Where should the reviewer start?

  • lib/llm/src/http/service/metrics.rs - Core metrics implementation with periodic polling
  • components/backends/trtllm/src/dynamo/trtllm/main.py - TensorRT-LLM metric consistency improvements
  • components/backends/sglang/src/dynamo/sglang/register.py - SGLang architectural fix for config vs runtime separation

Related Issues:

DIS-607 DIS-641

/coderabbit profile chill

@keivenchang keivenchang requested a review from a team as a code owner September 18, 2025 03:52
@keivenchang keivenchang self-assigned this Sep 18, 2025
@github-actions github-actions bot added the feat label Sep 18, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Walkthrough

Adds per-model runtime-configuration and health metrics to HTTP metrics, registers new Prometheus gauges, and starts a background polling task from service initialization to periodically collect ModelManager and optional etcd (MDC) data, updating metrics and marking model health accordingly. Poll interval is set via DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS.

Changes

Cohort / File(s) Summary
Per-model runtime metrics & health
lib/llm/src/http/service/metrics.rs
Adds IntGaugeVec fields for per-model runtime config and health; public APIs to update runtime config, MDC values, and health; helpers to sync from ModelEntry; async helper to fetch MDC via etcd; launches periodic polling task to refresh metrics and health states; registers new metrics.
Service wiring & polling
lib/llm/src/http/service/service_v2.rs
Clones etcd_client in builder; reads poll interval from DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS (default 30s); starts background runtime-config metrics polling via Metrics::start_runtime_config_polling_task; does not retain JoinHandle.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Env as Env
  participant Service as HttpService::build
  participant Metrics as Metrics
  participant Manager as ModelManager
  participant Etcd as etcd::Client
  participant Prom as Prometheus

  note over Service: Startup
  Env->>Service: Read DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS
  Service->>Metrics: start_runtime_config_polling_task(metrics, manager, etcd?, interval)
  activate Metrics
  Metrics->>Metrics: Spawn background task (loop every interval)
  deactivate Metrics

  rect rgb(235, 245, 255)
  note right of Metrics: Polling cycle
  loop Every interval
    Metrics->>Manager: list/get ModelEntry items
    Manager-->>Metrics: ModelEntry[]
    alt etcd client present
      Metrics->>Etcd: Load MDC (context_length, kv_block_size, migration_limit)
      Etcd-->>Metrics: MDC values
      Metrics->>Metrics: update_mdc_metrics(model_name, ...)
    end
    Metrics->>Metrics: update_runtime_config_metrics(model_name, ...)
    Metrics->>Metrics: set_model_health_status(model_name, healthy=true)
    Metrics->>Prom: Update gauge vectors for models
    Metrics->>Metrics: mark inactive models healthy=false
    Metrics->>Prom: Update health gauges
  end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

In burrows of bytes where models dwell,
I twitch my nose at gauges that tell—
Blocks, lengths, health, a metric stew,
With polls that hop each interval anew.
Etcd whispers, Manager hums,
Prometheus hears the gentle drums.
Thump-thump—runtime truth becomes!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title accurately highlights the main change—adding runtime configuration metrics with periodic polling—which matches the new metrics and the background polling task added in lib/llm/src/http/service/metrics.rs and service_v2.rs; the "NIM FE (num_request_max)" fragment refers to branch-level work but is not clearly reflected in the provided file summaries, so it is mildly noisy. Overall the title is specific and relevant to the changeset and not generic or misleading.
Description Check ✅ Passed The pull request description largely follows the repository template: it contains an Overview, Details, Where should the reviewer start, and Related Issues, and it clearly summarizes the intent (new Prometheus metrics, periodic polling, MDC fields) and points reviewers to the primary files. It documents runtime behavior (historical-data preservation, DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS) and backend-specific fixes so reviewers can assess scope and impact. The Related Issues entry lists internal issue IDs but does not use GitHub action keywords or explicit #issue references as suggested by the template.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
lib/llm/src/http/service/metrics.rs (4)

138-153: Doc nit: unknown health is supported but never emitted by the poller

You expose -1 = unknown and a setter, but the polling path never sets unknown. Consider when/if you want unknown surfaced (e.g., before first successful poll, or when the manager is temporarily unavailable).


262-325: Metric suffix constants: align with repo naming registry for consistency

To keep naming consistent across the codebase (per prior standardization), define suffix constants in prometheus_names::frontend_service and use them here instead of string literals.

Would you like a follow‑up patch adding constants like FRONTEND_MODEL_TOTAL_KV_BLOCKS, etc., and wiring them here?


470-491: Normalize model label for MDC metrics

Keep label casing consistent with other metrics to avoid duplicate series.

Apply:

 pub fn update_mdc_metrics(
     &self,
     model_name: &str,
     context_length: u32,
     kv_cache_block_size: u32,
     migration_limit: u32,
 ) {
-    self.model_context_length
-        .with_label_values(&[model_name])
+    let model = model_name.to_lowercase();
+    self.model_context_length
+        .with_label_values(&[&model])
         .set(context_length as i64);
 
-    self.model_kv_cache_block_size
-        .with_label_values(&[model_name])
+    self.model_kv_cache_block_size
+        .with_label_values(&[&model])
         .set(kv_cache_block_size as i64);
 
-    self.model_migration_limit
-        .with_label_values(&[model_name])
+    self.model_migration_limit
+        .with_label_values(&[&model])
         .set(migration_limit as i64);
 }

553-623: Poller robustness: missed ticks + initial run semantics

  • Set MissedTickBehavior::Delay to avoid burst catch‑up if the loop runs long.
  • Confirm whether the first tick is immediate in your Tokio version; if not, perform an initial poll before the loop.

Apply:

         tokio::spawn(async move {
-            let mut interval = tokio::time::interval(poll_interval);
+            let mut interval = tokio::time::interval(poll_interval);
+            interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);

Optional: run one poll before the loop to avoid waiting a full interval on startup.

lib/llm/src/http/service/service_v2.rs (1)

327-334: Background task lifecycle: consider graceful shutdown

The JoinHandle is dropped; the task runs past service shutdown and can’t be cancelled. Store the handle (e.g., on HttpService or State) and abort it when CancellationToken triggers.

If you want, I can draft a follow‑up patch that threads the handle into HttpService and aborts it in run() when observer is cancelled.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd3326 and 83cb3fe.

📒 Files selected for processing (2)
  • lib/llm/src/http/service/metrics.rs (6 hunks)
  • lib/llm/src/http/service/service_v2.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/http/service/metrics.rs
🧬 Code graph analysis (2)
lib/llm/src/http/service/service_v2.rs (3)
lib/bindings/python/rust/lib.rs (1)
  • etcd_client (368-373)
lib/runtime/src/transports/etcd.rs (1)
  • etcd_client (131-133)
lib/llm/src/http/service/metrics.rs (1)
  • start_runtime_config_polling_task (556-623)
lib/llm/src/http/service/metrics.rs (5)
lib/llm/src/local_model.rs (2)
  • runtime_config (179-182)
  • runtime_config (376-378)
lib/llm/src/http/service/service_v2.rs (3)
  • new (74-86)
  • etcd_client (114-116)
  • manager (106-108)
lib/llm/src/discovery/model_manager.rs (1)
  • new (52-60)
lib/bindings/python/rust/llm/local_model.rs (2)
  • total_kv_blocks (56-58)
  • max_num_seqs (61-63)
lib/runtime/src/transports/etcd.rs (1)
  • etcd_client (131-133)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Build and Test - sglang
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (2)
lib/llm/src/http/service/metrics.rs (1)

39-47: Good addition: clear, per‑model runtime config and health gauges

Fields look consistent and minimal in label cardinality (only "model"). Nice.

lib/llm/src/http/service/service_v2.rs (1)

296-299: Cloning etcd client once is fine

Local clone for passing into the poller while also storing in State looks correct.

@keivenchang keivenchang requested review from a team as code owners September 20, 2025 00:02
@keivenchang keivenchang changed the title feat: add runtime config metrics with periodic polling feat: add NIM args + runtime config metrics with periodic polling Sep 20, 2025
@keivenchang keivenchang changed the title feat: add NIM args + runtime config metrics with periodic polling feat: add NIM FE args + runtime config metrics with periodic polling Sep 20, 2025
Copy link
Contributor

@kthui kthui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have some extra test case(s) on the new metrics - to get a better understanding on how the new metrics would look from end users.

@keivenchang keivenchang changed the title feat: add NIM FE args + runtime config metrics with periodic polling feat: add NIM FE (num_request_max) + runtime config metrics with periodic polling Sep 22, 2025
Copy link
Contributor

@ryan-lempka ryan-lempka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments (non-blocking). Otherwise LGTM.

@KrishnanPrash
Copy link
Contributor

KrishnanPrash commented Sep 23, 2025

Add periodic polling system for model runtime config metrics (max_num_seqs, total_kv_blocks, max_num_batched_tokens)

Is polling required here? Are there use-cases where the same MDC changes after initial publishing?

@keivenchang
Copy link
Contributor Author

Add periodic polling system for model runtime config metrics (max_num_seqs, total_kv_blocks, max_num_batched_tokens)

Is polling required here? Are there use-cases where the same MDC changes after initial publishing?

Good point-- once MDC is registered it's always there. However, I do have one gauge called model_workers where it counts the # of workers available serving a particular model. The value can be anywhere between 0 to n. I tested this with multiple workers serving the same model. Other than that, if we can assume that workers that start/stop will always have the exact same MDC, then we don't have to have to keep polling for MDC values.

Are these valid assumptions to make?

- Add new Prometheus metrics for model runtime configuration
- Include metrics for total_kv_blocks, max_num_seqs, max_num_batched_tokens
- Add MDC metrics for context_length, kv_cache_block_size, migration_limit
- Implement model health status tracking (healthy/unhealthy/unknown)
- Add background polling task to keep metrics current as backends change
- Preserve all historical data - metrics are never removed, only marked unhealthy
- Configurable poll interval via DYN_RUNTIME_CONFIG_METRICS_POLL_INTERVAL_SECS

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…Prometheus conventions

- Update constant name from MODEL_WORKER_COUNT to MODEL_WORKERS_TOTAL
- Change metric name from model_worker_count to model_workers_total
- Update all references in metrics service implementation
- Update documentation in README to reflect new metric name
- Improve metrics polling task lifecycle management in service_v2
- Follows Prometheus naming convention using _total suffix for counters

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…INTERVAL_SECS

- Add validation to ensure poll interval is greater than 0
- Use filter() to reject zero or negative values
- Update documentation to clarify validation requirement
- Prevents potential issues with zero-duration polling intervals

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…kends

SGLang: get max_num_seqs from server_args since SGLang separates config from runtime stats
TensorRT-LLM: populate max_num_seqs and max_num_batched_tokens from config for metrics consistency
- Fix metric name from model_workers_total to model_workers
- Document model name deduplication behavior in README.md
- Add comments explaining gauge vs counter usage for runtime config metrics
- Clarify that some metrics use gauges because they're synchronized from upstream

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…g intervals

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang
Copy link
Contributor Author

Add periodic polling system for model runtime config metrics (max_num_seqs, total_kv_blocks, max_num_batched_tokens)

Is polling required here? Are there use-cases where the same MDC changes after initial publishing?

Discussed this on Slack. Will try to get this in before code-freeze, and then work on migrating polling to the Watcher method.

Copy link
Contributor

@KrishnanPrash KrishnanPrash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a note to document future work.

Note: Under the assumption that MDC are static, we can use a watcher pattern used by the frontend [reference] instead of the polling approach currently being utilized.

@keivenchang keivenchang merged commit 116b9b4 into main Sep 24, 2025
17 of 19 checks passed
@keivenchang keivenchang deleted the keivenchang/DIS_607__NIM_num_request_max_ branch September 24, 2025 17:19
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
…odic polling (#3107)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
…odic polling (#3107)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
kylehh pushed a commit that referenced this pull request Sep 25, 2025
…odic polling (#3107)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Kyle H <kylhuang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants