feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint #2008

keivenchang · 2025-07-18T06:20:30Z

Overview:

Added hierarchical Metrics trait for DistributedRuntime, Namespace, Component, and Endpoint.

Details:

Introduced a hierarchical metrics registry framework for Prometheus metrics, enabling auto labels in Prometheus for namespaces, components, and endpoints.
Added a new example project, "system_metrics", demonstrating custom stats endpoints and metrics collection.
Implemented MetricsRegistry struct, and provided trait implementations for drt, Namespace, Component, and Endpoint.

Where should the reviewer start?

lib/runtime/src/metrics.rs
lib/runtime/examples/system_metrics/src/bin/system_server.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-216

Summary by CodeRabbit

New Features
- Introduced a hierarchical metrics registry framework for Prometheus integration, enabling unified metric management across distributed runtime, namespaces, components, and endpoints.
- Added a new example project, "system_stats_endpoint," demonstrating custom stats endpoints and metrics collection.
- Implemented new public structs and traits for metrics, including MetricsRegistry, and provided trait implementations for core entities.
Improvements
- Refactored HTTP server metrics handling to use the new metrics registry, with improved uptime tracking and Prometheus output.
- Enhanced configuration and modularity for metrics and profiling in the distributed runtime.
Documentation
- Added a README for the new "system_metrics" example.
Tests
- Added comprehensive tests for hierarchical metric registration, uptime initialization, and Prometheus output formatting.
Chores
- Updated workspace configuration to include new examples and dependencies.

…mespace now has its own metrics registry Others: Added NamespaceMetrics struct for namespace-level metrics management Refactored HTTP server - removed RuntimeMetrics, simplified to HttpServerState with direct registry access Updated metric naming - changed from single underscore (test_metric) to double underscore (test__metric) for Prometheus compatibility Enhanced hierarchical registries - improved parent-child registry relationships with create_child_registry() method Updated system stats example - aligned component/endpoint names and registry usage patterns

copy-pr-bot · 2025-07-18T06:20:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-07-18T06:24:41Z

Walkthrough

This update introduces a hierarchical metrics registry framework to the runtime, enabling Prometheus metrics collection and registration at multiple levels (runtime, namespace, component, endpoint). New example projects and a system stats endpoint are added, along with trait implementations to support hierarchical metric naming and registration. HTTP server metrics and uptime logic are refactored to use the new registry.

Changes

File(s)	Change Summary
Cargo.toml, lib/runtime/examples/Cargo.toml	Updated workspace members to include new example crates; added Prometheus as a workspace dependency; added comments about Prometheus and opentelemetry dependencies.
lib/runtime/examples/system_stats_endpoint/Cargo.toml, README.md, src/*	Added new example project "system_stats_endpoint" with metadata, dependencies, README, library, server, and client binaries. The server demonstrates metrics collection and custom stats handling; the client interacts with the endpoint and prints stats.
lib/runtime/examples/system_stats_endpoint/src/lib.rs	Introduced `DEFAULT_NAMESPACE` constant and `MyStats` struct with serde traits for demonstration purposes.
lib/runtime/src/profiling.rs	New module implementing a hierarchical `MetricsRegistry` trait for Prometheus metrics, with macro-assisted registration and full test coverage.
lib/runtime/src/component.rs, component/namespace.rs, distributed.rs	Implemented `MetricsRegistry` for `Component`, `Endpoint`, `Namespace`, and `DistributedRuntime` to support hierarchical metrics. Adjusted trait bounds and method calls for distributed runtime access.
lib/runtime/src/http_server.rs	Refactored HTTP server to use the new metrics registry, with explicit uptime initialization, Prometheus gauge integration, and updated endpoints. Added new tests for metrics and uptime handling.
lib/runtime/src/lib.rs	Added public `profiling` module; updated `DistributedRuntime` struct to remove `start_time` and add `prometheus_registries_by_prefix` map. Added placeholder fields for health/liveness.
lib/runtime/src/traits.rs	Implemented `DistributedRuntimeProvider` for `DistributedRuntime` to satisfy trait bounds for metrics registry.
lib/runtime/src/component/endpoint.rs	Removed commented-out health/liveness assignments; added explanatory TODO.
components/metrics/src/main.rs	Added new imports and comments clarifying event subscription is for illustration only.

Sequence Diagram(s)

sequenceDiagram
    participant Client as system_stats_client
    participant Runtime as DistributedRuntime
    participant Namespace
    participant Component
    participant Endpoint
    participant Server as system_stats_server

    Client->>Runtime: Initialize from config
    Client->>Namespace: Access namespace "system"
    Client->>Component: Access component
    Client->>Endpoint: Get client for endpoint
    Client->>Endpoint: Wait for endpoint instances
    Client->>Endpoint: Send "hello world" via PushRouter
    Endpoint-->>Client: Stream responses (Annotated<String>)
    Client->>Component: Scrape stats periodically
    Component-->>Client: Return MyStats
    Client->>Runtime: Shutdown

    Server->>Runtime: Initialize from config
    Server->>Namespace: Create namespace
    Server->>Component: Create component
    Server->>Endpoint: Create endpoint, attach ingress handler
    Endpoint->>Server: Handle requests, collect metrics
    Endpoint->>Server: Serve stats via handler

Poem

In the meadow of metrics, a registry grew,
With prefixes, counters, and histograms too.
Endpoints and namespaces now track with delight,
As Prometheus gathers their stats day and night.
New examples are hopping, the system feels bright—
This rabbit approves, and metrics are right!
🐇📈

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (12)

lib/runtime/examples/system_stats_endpoint/README.md (1)

1-3: README is only a stub – please add minimal usage & build instructions

A single-line heading is not sufficient for an example crate that ships in the repo. At minimum, outline:

• how to run the server & client binaries
• the metrics/Prometheus endpoints they expose
• any required env-vars (e.g. DYN_NAMESPACE)

Completing this now saves future devs from hunting through source code.
components/metrics/src/main.rs (2)
34-36: PushRouter and Annotated imports are unused – will trigger dead_code warnings

Unless follow-up code uses them, remove these imports (or #[allow(unused_imports)] behind a feature flag) to keep cargo check clean.
-use dynamo_runtime::{
-    error, logging,
-    pipeline::PushRouter,
-    protocols::annotated::Annotated,
+use dynamo_runtime::{
+    error, logging,
178-179: Clarify comment vs. implementation

The added comment states KV hit-rate events are “not used in production”, yet the subscription is compiled unconditionally.
Either guard the block behind a feature (e.g. #[cfg(feature = "metrics-dev")]) or drop the subscription to avoid unnecessary NATS traffic in prod.
lib/runtime/examples/Cargo.toml (1)

36-37: Duplicate prometheus version declarations across workspaces

prometheus is added here and also at the root Cargo.toml (line 69). Keep a single declaration to avoid version drift; using only the root-level [workspace.dependencies] is sufficient.
Cargo.toml (2)
23-27: Development-only example crates are committed as active workspace members

The inline note says “Do not add this for production”, yet the paths are uncommented.
If crates.io publishing is planned, they’ll be bundled and increase compile time. Consider:
[workspace]
exclude = ["lib/runtime/examples/*"]
# or
members = ["lib/runtime", "..."] # omit examples
Alternatively wrap with cfg(workspace_example) and enable only in CI.

68-70: Prometheus dependency: follow-up TODO should have an issue link

Moving Prometheus into lib/runtime will be a breaking change across multiple crates. Create a tracking issue and reference it here to avoid this TODO becoming stale.
lib/runtime/src/distributed.rs (1)
219-231: Consider using tracing instead of println! for debug output.

Debug methods in library code should use the logging framework rather than printing directly to stdout.
-/// Debug function to print out all the keys of prometheus_registries_by_prefix
-pub fn debug_prometheus_registry_keys(&self) {
+/// Debug function to log all the keys of prometheus_registries_by_prefix
+pub fn debug_prometheus_registry_keys(&self) {
     let registries = self.prometheus_registries_by_prefix.lock().unwrap();
-    println!("=== Prometheus Registry Keys ===");
+    tracing::debug!("=== Prometheus Registry Keys ===");
     if registries.is_empty() {
-        println!("No registries found");
+        tracing::debug!("No registries found");
     } else {
         for (key, _registry) in registries.iter() {
-            println!("Registry key: '{}'", key);
+            tracing::debug!("Registry key: '{}'", key);
         }
     }
-    println!("=== End Prometheus Registry Keys ===");
+    tracing::debug!("=== End Prometheus Registry Keys ===");
 }
Alternatively, consider returning the keys as a Vec<String> for programmatic access:
pub fn get_prometheus_registry_keys(&self) -> Vec<String> {
    let registries = self.prometheus_registries_by_prefix.lock().unwrap();
    registries.keys().cloned().collect()
}
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1)
126-133: Consider using tracing instead of println! in the stats handler.

For consistency with logging practices, the stats handler should use the tracing framework.
         .stats_handler(|stats| {
-            println!("stats: {:?}", stats);
+            tracing::debug!("stats: {:?}", stats);
             let stats = MyStats { val: 10 };
             serde_json::to_value(stats).unwrap()
         })
lib/runtime/src/profiling.rs (2)
192-204: Consider reducing mutex lock scope for better performance.

The current implementation holds the mutex lock while cloning the registry. Consider releasing the lock earlier to reduce contention.
 fn prometheus_metrics_fmt(&self) -> anyhow::Result<String> {
-    let mut registry = self.drt().prometheus_registries_by_prefix.lock().unwrap();
-    let prometheus_registry = registry
-        .entry(self.prefix())
-        .or_insert(prometheus::Registry::new())
-        .clone();
+    let prometheus_registry = {
+        let mut registry = self.drt().prometheus_registries_by_prefix.lock().unwrap();
+        registry
+            .entry(self.prefix())
+            .or_insert(prometheus::Registry::new())
+            .clone()
+    }; // Lock is released here
     let metric_families = prometheus_registry.gather();
     let encoder = prometheus::TextEncoder::new();
     let mut buffer = Vec::new();
     encoder.encode(&metric_families, &mut buffer)?;
     Ok(String::from_utf8(buffer)?)
 }
269-288: Consider reducing println! usage in tests.

While println! statements can be helpful during test development, they can clutter test output in CI. Consider using them only when tests fail or behind a debug flag.
-// Print only the histogram section for inspection
-println!("Histogram section:");
-for line in prometheus_output.lines() {
-    if line.contains("my_histogram")
-        || line.contains("# HELP testprefix_my_histogram")
-        || line.contains("# TYPE testprefix_my_histogram")
-    {
-        println!("{}", line);
-    }
-}
+// Optionally print histogram section for debugging
+if std::env::var("DEBUG_TESTS").is_ok() {
+    println!("Histogram section:");
+    for line in prometheus_output.lines() {
+        if line.contains("my_histogram")
+            || line.contains("# HELP testprefix_my_histogram")
+            || line.contains("# TYPE testprefix_my_histogram")
+        {
+            println!("{}", line);
+        }
+    }
+}
lib/runtime/src/http_server.rs (2)
107-107: Remove extra forward slash in comment

The comment has an extra forward slash that should be removed.
-// // Initialize the start time
+// Initialize the start time
178-178: Remove commented code referencing old functionality

This commented line references DB_errors which appears to be from the old implementation and should be removed.
-    //let health = state.DB_errors <= 0;

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6f392c and 95d2296.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
lib/runtime/examples/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (16)

Cargo.toml (2 hunks)
components/metrics/src/main.rs (2 hunks)
lib/runtime/examples/Cargo.toml (2 hunks)
lib/runtime/examples/system_stats_endpoint/Cargo.toml (1 hunks)
lib/runtime/examples/system_stats_endpoint/README.md (1 hunks)
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (1 hunks)
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1 hunks)
lib/runtime/examples/system_stats_endpoint/src/lib.rs (1 hunks)
lib/runtime/src/component.rs (3 hunks)
lib/runtime/src/component/endpoint.rs (1 hunks)
lib/runtime/src/component/namespace.rs (4 hunks)
lib/runtime/src/distributed.rs (5 hunks)
lib/runtime/src/http_server.rs (3 hunks)
lib/runtime/src/lib.rs (2 hunks)
lib/runtime/src/profiling.rs (1 hunks)
lib/runtime/src/traits.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (10)

lib/runtime/src/component/endpoint.rs (3)

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/runtime/src/traits.rs (2)

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

components/metrics/src/main.rs (9)

Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

lib/runtime/examples/system_stats_endpoint/src/lib.rs (3)

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (2)

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

lib/runtime/src/component/namespace.rs (5)

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.

lib/runtime/src/distributed.rs (4)

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/runtime/src/lib.rs (3)

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1)

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

lib/runtime/src/http_server.rs (4)

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

🧬 Code Graph Analysis (3)

lib/runtime/src/component.rs (2)

lib/runtime/src/component/namespace.rs (2)

basename (86-88)

parent_hierarchy (90-92)

lib/runtime/src/profiling.rs (9)

basename (105-105)

basename (226-228)

basename (310-312)

basename (330-332)

parent_hierarchy (121-121)

parent_hierarchy (229-231)

parent_hierarchy (313-315)

parent_hierarchy (333-335)

prefix (109-118)

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (6)

lib/runtime/src/distributed.rs (4)

new (45-129)

from_settings (131-134)

from_settings (242-248)

runtime (142-144)

lib/runtime/src/service.rs (1)

new (37-39)

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (1)

app (30-54)

lib/runtime/src/protocols/annotated.rs (1)

from_data (56-63)

lib/runtime/src/profiling.rs (3)

drt (220-222)

drt (304-306)

drt (324-326)

lib/runtime/src/pipeline/network.rs (1)

for_engine (308-319)

lib/runtime/src/http_server.rs (8)

lib/runtime/src/traits.rs (4)

drt (26-26)

drt (41-43)

rt (21-21)

rt (30-32)

lib/runtime/src/profiling.rs (11)

drt (220-222)

drt (304-306)

drt (324-326)

basename (105-105)

basename (226-228)

basename (310-312)

basename (330-332)

parent_hierarchy (121-121)

parent_hierarchy (229-231)

parent_hierarchy (313-315)

parent_hierarchy (333-335)

lib/runtime/src/component/namespace.rs (2)

basename (86-88)

parent_hierarchy (90-92)

lib/runtime/src/distributed.rs (4)

basename (35-37)

parent_hierarchy (39-41)

new (45-129)

from_settings_without_discovery (137-140)

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (2)

new (40-57)

new (76-78)

lib/runtime/src/service.rs (1)

new (37-39)

lib/runtime/src/runtime.rs (1)

handle (123-128)

lib/runtime/src/pipeline/context.rs (1)

registry (251-253)

🔇 Additional comments (27)

lib/runtime/examples/Cargo.toml (1)

18-21: Example crates are part of the default workspace build – contradicting the guidance comment

Including system_stats_endpoint in members means it will be built/published with every workspace action, which the comment in the root Cargo.toml suggests should not happen for production.
Confirm this is intentional; otherwise gate examples behind an env flag or move them to [workspace.exclude].

lib/runtime/examples/system_stats_endpoint/Cargo.toml (1)

1-34: LGTM! Well-structured Cargo.toml for the system stats endpoint example.

The manifest follows standard Rust conventions with proper workspace inheritance and appropriate dependencies for a system stats endpoint demonstration.

lib/runtime/src/traits.rs (1)

35-44: LGTM! Clear trait implementation with excellent documentation.

The implementation correctly provides the distributed runtime instance reference, and the detailed comments explain the necessity for satisfying the MetricsRegistry trait bounds. This follows the expected pattern for provider traits.

lib/runtime/examples/system_stats_endpoint/src/lib.rs (1)

16-24: LGTM! Simple and appropriate example structure.

The constant and struct are well-defined for demonstration purposes. The serde derives are correctly applied for serialization support.

lib/runtime/src/lib.rs (2)

43-43: LGTM! Profiling module addition supports the new metrics framework.

The addition of the profiling module is appropriate for the hierarchical metrics registry functionality being introduced.

103-108: LGTM! Well-structured metrics registry storage with proper thread safety.

The prometheus_registries_by_prefix field uses appropriate thread-safe types for concurrent access. The commented TODO fields are properly documented for future health and liveness check implementations.

lib/runtime/src/component/namespace.rs (4)

22-24: LGTM! Appropriate trait imports for the new metrics framework.

The imports are correctly added to support the new trait-based approach for metrics registry and distributed runtime access.

47-52: LGTM! Consistent trait-based access pattern.

The change to use DistributedRuntimeProvider::drt(self) follows the new trait-based approach consistently with other similar changes in the codebase.

62-67: LGTM! Consistent trait-based access pattern.

The change to use DistributedRuntimeProvider::drt(self) maintains consistency with the new trait-based approach for accessing the distributed runtime.

85-93: LGTM! Appropriate MetricsRegistry implementation for hierarchical metrics.

The implementation correctly provides the namespace name as the basename and establishes the hierarchical relationship by including the distributed runtime prefix in the parent hierarchy. This follows the expected pattern for the metrics registry framework.

lib/runtime/src/component.rs (3)

32-34: Import looks good.

The addition of profiling::MetricsRegistry to the imports is necessary for the trait implementations.

173-185: MetricsRegistry implementation for Component is correct.

The implementation properly defines the component's place in the metrics hierarchy by returning its name as basename and constructing the parent hierarchy from the namespace's hierarchy plus the namespace's prefix.

319-331: MetricsRegistry implementation for Endpoint follows the correct pattern.

The implementation maintains the hierarchical structure by returning the endpoint's name as basename and building the parent hierarchy from the component's hierarchy plus the component's prefix.

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (2)

1-28: Example client structure looks good.

The imports and overall structure are appropriate for a distributed runtime client example.

30-54: Well-structured example demonstrating distributed runtime usage.

The client properly initializes the runtime, waits for endpoint instances, sends a request, handles streamed responses, scrapes statistics, and cleanly shuts down. This provides a clear example of the distributed runtime client APIs.

lib/runtime/src/distributed.rs (3)

34-42: Correct root-level MetricsRegistry implementation.

The implementation properly establishes DistributedRuntime as the root of the metrics hierarchy with an empty basename and no parent hierarchy.

79-88: Excellent handling of Rust ownership rules.

The code correctly extracts the cancel_token from runtime before moving it into the struct, preventing ownership issues. The comment clearly explains why this is necessary.

97-101: Thread-safe registry storage is well-designed.

The prometheus_registries_by_prefix field properly uses Arc<std::sync::Mutex<HashMap>> to enable thread-safe access to Prometheus registries indexed by their hierarchical prefixes.

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (2)

33-58: Well-designed metrics struct demonstrating the MetricsRegistry trait usage.

The implementation properly uses Arc for shared ownership, accepts a generic MetricsRegistry parameter for flexibility, and follows Prometheus naming conventions for metrics.

71-106: Good example of AsyncEngine with metrics instrumentation.

The implementation properly demonstrates request counting, duration measurement, and stream-based response handling. The metrics are recorded at appropriate points in the request lifecycle.

lib/runtime/src/profiling.rs (2)

51-98: Excellent macro design with clear documentation.

The macro elegantly solves the constraints of maintaining dyn compatibility while handling the Box/clone requirements. The extensive comments clearly explain the technical reasoning, making the code maintainable.

103-122: Well-designed trait with proper hierarchical support.

The trait correctly defines the contract for hierarchical metrics with appropriate methods for basename, parent hierarchy, and prefix calculation. The default prefix() implementation properly handles edge cases.

lib/runtime/src/http_server.rs (5)

16-21: Imports look good

The new imports for MetricsRegistry, DistributedRuntimeProvider, OnceLock, and Instant are appropriate for the refactored metrics implementation.

26-44: Well-structured metrics registry implementation

The HttpMetricsRegistry correctly implements the required traits and follows the hierarchical naming pattern established in the codebase.

200-208: Good test utility function

The make_test_drt() helper properly creates a test instance of DistributedRuntime for use in tests.

284-335: Comprehensive test coverage for metrics functionality

The tests thoroughly verify:

Start time initialization and restrictions

Uptime calculation

Metrics formatting with correct namespace prefixes

Gauge value updates

Great job on the test coverage!

354-363: Test confirms panic behavior

This test verifies the current panic behavior when uptime() is called without initialization. If you implement my suggestion to return a Result instead, this test would need to be updated to verify the error case.

lib/runtime/src/component/endpoint.rs

lib/runtime/src/http_server.rs

lib/runtime/src/profiling.rs

… hierarchical labels

lib/runtime/src/profiling.rs

keivenchang · 2025-07-21T16:55:18Z

Addressed the concern-- I removed the hierarchical prefix, and added auto-labeling. Currently working addressing the nitpick items, but do please start the review while I fix them. Thanks.

Cargo.toml

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs

…ver/client

ryanolson

👍

keivenchang and others added 10 commits July 15, 2025 12:42

feat: adding a common Metrics API for metrics endpoints DIS-216

f286e56

added hierarchical registry

3390adb

clean up the code

e5063a7

more cleanup

bd25ea1

more fixes

0d1e99d

paired programming, Ryan w/Keiven for the Prometheus Registry

bcce94f

version that compiles and passes tests

db63c34

completely working MetricsRegistry in Runtime and Components

d3b4a96

fulling working MetricsRegistry

95d2296

pull-request-size bot added the size/XXL label Jul 18, 2025

github-actions bot added the feat label Jul 18, 2025

coderabbitai bot reviewed Jul 18, 2025

View reviewed changes

lib/runtime/src/component/endpoint.rs Outdated Show resolved Hide resolved

lib/runtime/src/http_server.rs Outdated Show resolved Hide resolved

lib/runtime/src/http_server.rs Outdated Show resolved Hide resolved

keivenchang changed the title ~~feat: add a Metrics trait for DistributedRuntime, Namespace, Components, and Endpoint~~ feat: add a hierarchical MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint Jul 18, 2025

cargo fmt

c102ddc

keivenchang changed the title ~~feat: add a hierarchical MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint~~ feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint Jul 18, 2025

rmccorm4 reviewed Jul 18, 2025

View reviewed changes

lib/runtime/src/http_server.rs Outdated Show resolved Hide resolved

rmccorm4 reviewed Jul 18, 2025

View reviewed changes

lib/runtime/src/profiling.rs Outdated Show resolved Hide resolved

keivenchang added 3 commits July 18, 2025 18:47

Use basename instead of prefix

10c0d87

updated such that there is no prefix in the metric name, and now with…

05f7087

… hierarchical labels

modified profiling.rs to hide private functions

ea320f0

keivenchang requested a review from a team as a code owner July 21, 2025 04:22

nnshah1 reviewed Jul 21, 2025

View reviewed changes

lib/runtime/src/profiling.rs Show resolved Hide resolved

rmccorm4 reviewed Jul 21, 2025

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

nnshah1 reviewed Jul 21, 2025

View reviewed changes

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs Show resolved Hide resolved

keivenchang added 2 commits July 21, 2025 13:54

renaming from profiling.rs to metrics.rs, and the example metrics ser…

15942ad

…ver/client

add README in the example and fix nitpicks

6a12946

keivenchang added 9 commits July 21, 2025 14:36

very minor spacing reverts, and adding comment

118f850

cargo fmt

ef8af0d

fix vec macros

b2c5235

fix unused var error

a286abe

add ignore NATS during testing

6cd2475

Updated the create_test_drt() methods

76b575c

cargo fmt

bc8283f

namespace to be unique in tests

dcba192

Set etcd/NATS tests to be integration

2926571

ryanolson approved these changes Jul 22, 2025

View reviewed changes

keivenchang merged commit e5a8628 into main Jul 22, 2025
10 checks passed

keivenchang deleted the keivenchang/Observability_Metric-API__DIS-216 branch July 22, 2025 15:39

This was referenced Jul 24, 2025

feat: Base metrics: add generic ingress handler metrics #2090

Merged

fix: Integration tests fixes #2161

Merged

coderabbitai bot mentioned this pull request Aug 6, 2025

feat: graceful drop of recorder #2326

Closed

This was referenced Aug 18, 2025

fix: use tokio spawn / interval.tick(), make nats metric names clearer, fix tests sharing environment variables (temp_env) #2506

Merged

feat: enable dynamo metrics on KVBM #2626

Merged

feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint #2008

feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint #2008

Uh oh!

Conversation

keivenchang commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jul 18, 2025

Uh oh!

coderabbitai bot commented Jul 18, 2025

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keivenchang commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

ryanolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

keivenchang commented Jul 18, 2025 •

edited

Loading