Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Jul 18, 2025

Overview:

Added hierarchical Metrics trait for DistributedRuntime, Namespace, Component, and Endpoint.

Details:

  • Introduced a hierarchical metrics registry framework for Prometheus metrics, enabling auto labels in Prometheus for namespaces, components, and endpoints.
  • Added a new example project, "system_metrics", demonstrating custom stats endpoints and metrics collection.
  • Implemented MetricsRegistry struct, and provided trait implementations for drt, Namespace, Component, and Endpoint.

Where should the reviewer start?

lib/runtime/src/metrics.rs
lib/runtime/examples/system_metrics/src/bin/system_server.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-216

Summary by CodeRabbit

  • New Features

    • Introduced a hierarchical metrics registry framework for Prometheus integration, enabling unified metric management across distributed runtime, namespaces, components, and endpoints.
    • Added a new example project, "system_stats_endpoint," demonstrating custom stats endpoints and metrics collection.
    • Implemented new public structs and traits for metrics, including MetricsRegistry, and provided trait implementations for core entities.
  • Improvements

    • Refactored HTTP server metrics handling to use the new metrics registry, with improved uptime tracking and Prometheus output.
    • Enhanced configuration and modularity for metrics and profiling in the distributed runtime.
  • Documentation

    • Added a README for the new "system_metrics" example.
  • Tests

    • Added comprehensive tests for hierarchical metric registration, uptime initialization, and Prometheus output formatting.
  • Chores

    • Updated workspace configuration to include new examples and dependencies.

keivenchang and others added 10 commits July 15, 2025 12:42
…mespace now has its own metrics registry

Others:
Added NamespaceMetrics struct for namespace-level metrics management
Refactored HTTP server - removed RuntimeMetrics, simplified to HttpServerState with direct registry access
Updated metric naming - changed from single underscore (test_metric) to double underscore (test__metric) for Prometheus compatibility
Enhanced hierarchical registries - improved parent-child registry relationships with create_child_registry() method
Updated system stats example - aligned component/endpoint names and registry usage patterns
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 18, 2025

Walkthrough

This update introduces a hierarchical metrics registry framework to the runtime, enabling Prometheus metrics collection and registration at multiple levels (runtime, namespace, component, endpoint). New example projects and a system stats endpoint are added, along with trait implementations to support hierarchical metric naming and registration. HTTP server metrics and uptime logic are refactored to use the new registry.

Changes

File(s) Change Summary
Cargo.toml, lib/runtime/examples/Cargo.toml Updated workspace members to include new example crates; added Prometheus as a workspace dependency; added comments about Prometheus and opentelemetry dependencies.
lib/runtime/examples/system_stats_endpoint/Cargo.toml, README.md, src/* Added new example project "system_stats_endpoint" with metadata, dependencies, README, library, server, and client binaries. The server demonstrates metrics collection and custom stats handling; the client interacts with the endpoint and prints stats.
lib/runtime/examples/system_stats_endpoint/src/lib.rs Introduced DEFAULT_NAMESPACE constant and MyStats struct with serde traits for demonstration purposes.
lib/runtime/src/profiling.rs New module implementing a hierarchical MetricsRegistry trait for Prometheus metrics, with macro-assisted registration and full test coverage.
lib/runtime/src/component.rs, component/namespace.rs, distributed.rs Implemented MetricsRegistry for Component, Endpoint, Namespace, and DistributedRuntime to support hierarchical metrics. Adjusted trait bounds and method calls for distributed runtime access.
lib/runtime/src/http_server.rs Refactored HTTP server to use the new metrics registry, with explicit uptime initialization, Prometheus gauge integration, and updated endpoints. Added new tests for metrics and uptime handling.
lib/runtime/src/lib.rs Added public profiling module; updated DistributedRuntime struct to remove start_time and add prometheus_registries_by_prefix map. Added placeholder fields for health/liveness.
lib/runtime/src/traits.rs Implemented DistributedRuntimeProvider for DistributedRuntime to satisfy trait bounds for metrics registry.
lib/runtime/src/component/endpoint.rs Removed commented-out health/liveness assignments; added explanatory TODO.
components/metrics/src/main.rs Added new imports and comments clarifying event subscription is for illustration only.

Sequence Diagram(s)

sequenceDiagram
    participant Client as system_stats_client
    participant Runtime as DistributedRuntime
    participant Namespace
    participant Component
    participant Endpoint
    participant Server as system_stats_server

    Client->>Runtime: Initialize from config
    Client->>Namespace: Access namespace "system"
    Client->>Component: Access component
    Client->>Endpoint: Get client for endpoint
    Client->>Endpoint: Wait for endpoint instances
    Client->>Endpoint: Send "hello world" via PushRouter
    Endpoint-->>Client: Stream responses (Annotated<String>)
    Client->>Component: Scrape stats periodically
    Component-->>Client: Return MyStats
    Client->>Runtime: Shutdown

    Server->>Runtime: Initialize from config
    Server->>Namespace: Create namespace
    Server->>Component: Create component
    Server->>Endpoint: Create endpoint, attach ingress handler
    Endpoint->>Server: Handle requests, collect metrics
    Endpoint->>Server: Serve stats via handler
Loading

Poem

In the meadow of metrics, a registry grew,
With prefixes, counters, and histograms too.
Endpoints and namespaces now track with delight,
As Prometheus gathers their stats day and night.
New examples are hopping, the system feels bright—
This rabbit approves, and metrics are right!
🐇📈


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (12)
lib/runtime/examples/system_stats_endpoint/README.md (1)

1-3: README is only a stub – please add minimal usage & build instructions

A single-line heading is not sufficient for an example crate that ships in the repo. At minimum, outline:

• how to run the server & client binaries
• the metrics/Prometheus endpoints they expose
• any required env-vars (e.g. DYN_NAMESPACE)

Completing this now saves future devs from hunting through source code.

components/metrics/src/main.rs (2)

34-36: PushRouter and Annotated imports are unused – will trigger dead_code warnings

Unless follow-up code uses them, remove these imports (or #[allow(unused_imports)] behind a feature flag) to keep cargo check clean.

-use dynamo_runtime::{
-    error, logging,
-    pipeline::PushRouter,
-    protocols::annotated::Annotated,
+use dynamo_runtime::{
+    error, logging,

178-179: Clarify comment vs. implementation

The added comment states KV hit-rate events are “not used in production”, yet the subscription is compiled unconditionally.
Either guard the block behind a feature (e.g. #[cfg(feature = "metrics-dev")]) or drop the subscription to avoid unnecessary NATS traffic in prod.

lib/runtime/examples/Cargo.toml (1)

36-37: Duplicate prometheus version declarations across workspaces

prometheus is added here and also at the root Cargo.toml (line 69). Keep a single declaration to avoid version drift; using only the root-level [workspace.dependencies] is sufficient.

Cargo.toml (2)

23-27: Development-only example crates are committed as active workspace members

The inline note says “Do not add this for production”, yet the paths are uncommented.
If crates.io publishing is planned, they’ll be bundled and increase compile time. Consider:

[workspace]
exclude = ["lib/runtime/examples/*"]
# or
members = ["lib/runtime", "..."] # omit examples

Alternatively wrap with cfg(workspace_example) and enable only in CI.


68-70: Prometheus dependency: follow-up TODO should have an issue link

Moving Prometheus into lib/runtime will be a breaking change across multiple crates. Create a tracking issue and reference it here to avoid this TODO becoming stale.

lib/runtime/src/distributed.rs (1)

219-231: Consider using tracing instead of println! for debug output.

Debug methods in library code should use the logging framework rather than printing directly to stdout.

-/// Debug function to print out all the keys of prometheus_registries_by_prefix
-pub fn debug_prometheus_registry_keys(&self) {
+/// Debug function to log all the keys of prometheus_registries_by_prefix
+pub fn debug_prometheus_registry_keys(&self) {
     let registries = self.prometheus_registries_by_prefix.lock().unwrap();
-    println!("=== Prometheus Registry Keys ===");
+    tracing::debug!("=== Prometheus Registry Keys ===");
     if registries.is_empty() {
-        println!("No registries found");
+        tracing::debug!("No registries found");
     } else {
         for (key, _registry) in registries.iter() {
-            println!("Registry key: '{}'", key);
+            tracing::debug!("Registry key: '{}'", key);
         }
     }
-    println!("=== End Prometheus Registry Keys ===");
+    tracing::debug!("=== End Prometheus Registry Keys ===");
 }

Alternatively, consider returning the keys as a Vec<String> for programmatic access:

pub fn get_prometheus_registry_keys(&self) -> Vec<String> {
    let registries = self.prometheus_registries_by_prefix.lock().unwrap();
    registries.keys().cloned().collect()
}
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1)

126-133: Consider using tracing instead of println! in the stats handler.

For consistency with logging practices, the stats handler should use the tracing framework.

         .stats_handler(|stats| {
-            println!("stats: {:?}", stats);
+            tracing::debug!("stats: {:?}", stats);
             let stats = MyStats { val: 10 };
             serde_json::to_value(stats).unwrap()
         })
lib/runtime/src/profiling.rs (2)

192-204: Consider reducing mutex lock scope for better performance.

The current implementation holds the mutex lock while cloning the registry. Consider releasing the lock earlier to reduce contention.

 fn prometheus_metrics_fmt(&self) -> anyhow::Result<String> {
-    let mut registry = self.drt().prometheus_registries_by_prefix.lock().unwrap();
-    let prometheus_registry = registry
-        .entry(self.prefix())
-        .or_insert(prometheus::Registry::new())
-        .clone();
+    let prometheus_registry = {
+        let mut registry = self.drt().prometheus_registries_by_prefix.lock().unwrap();
+        registry
+            .entry(self.prefix())
+            .or_insert(prometheus::Registry::new())
+            .clone()
+    }; // Lock is released here
     let metric_families = prometheus_registry.gather();
     let encoder = prometheus::TextEncoder::new();
     let mut buffer = Vec::new();
     encoder.encode(&metric_families, &mut buffer)?;
     Ok(String::from_utf8(buffer)?)
 }

269-288: Consider reducing println! usage in tests.

While println! statements can be helpful during test development, they can clutter test output in CI. Consider using them only when tests fail or behind a debug flag.

-// Print only the histogram section for inspection
-println!("Histogram section:");
-for line in prometheus_output.lines() {
-    if line.contains("my_histogram")
-        || line.contains("# HELP testprefix_my_histogram")
-        || line.contains("# TYPE testprefix_my_histogram")
-    {
-        println!("{}", line);
-    }
-}
+// Optionally print histogram section for debugging
+if std::env::var("DEBUG_TESTS").is_ok() {
+    println!("Histogram section:");
+    for line in prometheus_output.lines() {
+        if line.contains("my_histogram")
+            || line.contains("# HELP testprefix_my_histogram")
+            || line.contains("# TYPE testprefix_my_histogram")
+        {
+            println!("{}", line);
+        }
+    }
+}
lib/runtime/src/http_server.rs (2)

107-107: Remove extra forward slash in comment

The comment has an extra forward slash that should be removed.

-// // Initialize the start time
+// Initialize the start time

178-178: Remove commented code referencing old functionality

This commented line references DB_errors which appears to be from the old implementation and should be removed.

-    //let health = state.DB_errors <= 0;
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6f392c and 95d2296.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • lib/runtime/examples/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (16)
  • Cargo.toml (2 hunks)
  • components/metrics/src/main.rs (2 hunks)
  • lib/runtime/examples/Cargo.toml (2 hunks)
  • lib/runtime/examples/system_stats_endpoint/Cargo.toml (1 hunks)
  • lib/runtime/examples/system_stats_endpoint/README.md (1 hunks)
  • lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (1 hunks)
  • lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1 hunks)
  • lib/runtime/examples/system_stats_endpoint/src/lib.rs (1 hunks)
  • lib/runtime/src/component.rs (3 hunks)
  • lib/runtime/src/component/endpoint.rs (1 hunks)
  • lib/runtime/src/component/namespace.rs (4 hunks)
  • lib/runtime/src/distributed.rs (5 hunks)
  • lib/runtime/src/http_server.rs (3 hunks)
  • lib/runtime/src/lib.rs (2 hunks)
  • lib/runtime/src/profiling.rs (1 hunks)
  • lib/runtime/src/traits.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (10)
lib/runtime/src/component/endpoint.rs (3)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/src/traits.rs (2)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
components/metrics/src/main.rs (9)
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
lib/runtime/examples/system_stats_endpoint/src/lib.rs (3)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (2)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
lib/runtime/src/component/namespace.rs (5)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
lib/runtime/src/distributed.rs (4)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/src/lib.rs (3)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (1)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/runtime/src/http_server.rs (4)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
🧬 Code Graph Analysis (3)
lib/runtime/src/component.rs (2)
lib/runtime/src/component/namespace.rs (2)
  • basename (86-88)
  • parent_hierarchy (90-92)
lib/runtime/src/profiling.rs (9)
  • basename (105-105)
  • basename (226-228)
  • basename (310-312)
  • basename (330-332)
  • parent_hierarchy (121-121)
  • parent_hierarchy (229-231)
  • parent_hierarchy (313-315)
  • parent_hierarchy (333-335)
  • prefix (109-118)
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (6)
lib/runtime/src/distributed.rs (4)
  • new (45-129)
  • from_settings (131-134)
  • from_settings (242-248)
  • runtime (142-144)
lib/runtime/src/service.rs (1)
  • new (37-39)
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (1)
  • app (30-54)
lib/runtime/src/protocols/annotated.rs (1)
  • from_data (56-63)
lib/runtime/src/profiling.rs (3)
  • drt (220-222)
  • drt (304-306)
  • drt (324-326)
lib/runtime/src/pipeline/network.rs (1)
  • for_engine (308-319)
lib/runtime/src/http_server.rs (8)
lib/runtime/src/traits.rs (4)
  • drt (26-26)
  • drt (41-43)
  • rt (21-21)
  • rt (30-32)
lib/runtime/src/profiling.rs (11)
  • drt (220-222)
  • drt (304-306)
  • drt (324-326)
  • basename (105-105)
  • basename (226-228)
  • basename (310-312)
  • basename (330-332)
  • parent_hierarchy (121-121)
  • parent_hierarchy (229-231)
  • parent_hierarchy (313-315)
  • parent_hierarchy (333-335)
lib/runtime/src/component/namespace.rs (2)
  • basename (86-88)
  • parent_hierarchy (90-92)
lib/runtime/src/distributed.rs (4)
  • basename (35-37)
  • parent_hierarchy (39-41)
  • new (45-129)
  • from_settings_without_discovery (137-140)
lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (2)
  • new (40-57)
  • new (76-78)
lib/runtime/src/service.rs (1)
  • new (37-39)
lib/runtime/src/runtime.rs (1)
  • handle (123-128)
lib/runtime/src/pipeline/context.rs (1)
  • registry (251-253)
🔇 Additional comments (27)
lib/runtime/examples/Cargo.toml (1)

18-21: Example crates are part of the default workspace build – contradicting the guidance comment

Including system_stats_endpoint in members means it will be built/published with every workspace action, which the comment in the root Cargo.toml suggests should not happen for production.
Confirm this is intentional; otherwise gate examples behind an env flag or move them to [workspace.exclude].

lib/runtime/examples/system_stats_endpoint/Cargo.toml (1)

1-34: LGTM! Well-structured Cargo.toml for the system stats endpoint example.

The manifest follows standard Rust conventions with proper workspace inheritance and appropriate dependencies for a system stats endpoint demonstration.

lib/runtime/src/traits.rs (1)

35-44: LGTM! Clear trait implementation with excellent documentation.

The implementation correctly provides the distributed runtime instance reference, and the detailed comments explain the necessity for satisfying the MetricsRegistry trait bounds. This follows the expected pattern for provider traits.

lib/runtime/examples/system_stats_endpoint/src/lib.rs (1)

16-24: LGTM! Simple and appropriate example structure.

The constant and struct are well-defined for demonstration purposes. The serde derives are correctly applied for serialization support.

lib/runtime/src/lib.rs (2)

43-43: LGTM! Profiling module addition supports the new metrics framework.

The addition of the profiling module is appropriate for the hierarchical metrics registry functionality being introduced.


103-108: LGTM! Well-structured metrics registry storage with proper thread safety.

The prometheus_registries_by_prefix field uses appropriate thread-safe types for concurrent access. The commented TODO fields are properly documented for future health and liveness check implementations.

lib/runtime/src/component/namespace.rs (4)

22-24: LGTM! Appropriate trait imports for the new metrics framework.

The imports are correctly added to support the new trait-based approach for metrics registry and distributed runtime access.


47-52: LGTM! Consistent trait-based access pattern.

The change to use DistributedRuntimeProvider::drt(self) follows the new trait-based approach consistently with other similar changes in the codebase.


62-67: LGTM! Consistent trait-based access pattern.

The change to use DistributedRuntimeProvider::drt(self) maintains consistency with the new trait-based approach for accessing the distributed runtime.


85-93: LGTM! Appropriate MetricsRegistry implementation for hierarchical metrics.

The implementation correctly provides the namespace name as the basename and establishes the hierarchical relationship by including the distributed runtime prefix in the parent hierarchy. This follows the expected pattern for the metrics registry framework.

lib/runtime/src/component.rs (3)

32-34: Import looks good.

The addition of profiling::MetricsRegistry to the imports is necessary for the trait implementations.


173-185: MetricsRegistry implementation for Component is correct.

The implementation properly defines the component's place in the metrics hierarchy by returning its name as basename and constructing the parent hierarchy from the namespace's hierarchy plus the namespace's prefix.


319-331: MetricsRegistry implementation for Endpoint follows the correct pattern.

The implementation maintains the hierarchical structure by returning the endpoint's name as basename and building the parent hierarchy from the component's hierarchy plus the component's prefix.

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_client.rs (2)

1-28: Example client structure looks good.

The imports and overall structure are appropriate for a distributed runtime client example.


30-54: Well-structured example demonstrating distributed runtime usage.

The client properly initializes the runtime, waits for endpoint instances, sends a request, handles streamed responses, scrapes statistics, and cleanly shuts down. This provides a clear example of the distributed runtime client APIs.

lib/runtime/src/distributed.rs (3)

34-42: Correct root-level MetricsRegistry implementation.

The implementation properly establishes DistributedRuntime as the root of the metrics hierarchy with an empty basename and no parent hierarchy.


79-88: Excellent handling of Rust ownership rules.

The code correctly extracts the cancel_token from runtime before moving it into the struct, preventing ownership issues. The comment clearly explains why this is necessary.


97-101: Thread-safe registry storage is well-designed.

The prometheus_registries_by_prefix field properly uses Arc<std::sync::Mutex<HashMap>> to enable thread-safe access to Prometheus registries indexed by their hierarchical prefixes.

lib/runtime/examples/system_stats_endpoint/src/bin/system_stats_server.rs (2)

33-58: Well-designed metrics struct demonstrating the MetricsRegistry trait usage.

The implementation properly uses Arc for shared ownership, accepts a generic MetricsRegistry parameter for flexibility, and follows Prometheus naming conventions for metrics.


71-106: Good example of AsyncEngine with metrics instrumentation.

The implementation properly demonstrates request counting, duration measurement, and stream-based response handling. The metrics are recorded at appropriate points in the request lifecycle.

lib/runtime/src/profiling.rs (2)

51-98: Excellent macro design with clear documentation.

The macro elegantly solves the constraints of maintaining dyn compatibility while handling the Box/clone requirements. The extensive comments clearly explain the technical reasoning, making the code maintainable.


103-122: Well-designed trait with proper hierarchical support.

The trait correctly defines the contract for hierarchical metrics with appropriate methods for basename, parent hierarchy, and prefix calculation. The default prefix() implementation properly handles edge cases.

lib/runtime/src/http_server.rs (5)

16-21: Imports look good

The new imports for MetricsRegistry, DistributedRuntimeProvider, OnceLock, and Instant are appropriate for the refactored metrics implementation.


26-44: Well-structured metrics registry implementation

The HttpMetricsRegistry correctly implements the required traits and follows the hierarchical naming pattern established in the codebase.


200-208: Good test utility function

The make_test_drt() helper properly creates a test instance of DistributedRuntime for use in tests.


284-335: Comprehensive test coverage for metrics functionality

The tests thoroughly verify:

  • Start time initialization and restrictions
  • Uptime calculation
  • Metrics formatting with correct namespace prefixes
  • Gauge value updates

Great job on the test coverage!


354-363: Test confirms panic behavior

This test verifies the current panic behavior when uptime() is called without initialization. If you implement my suggestion to return a Result instead, this test would need to be updated to verify the error case.

@keivenchang keivenchang changed the title feat: add a Metrics trait for DistributedRuntime, Namespace, Components, and Endpoint feat: add a hierarchical MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint Jul 18, 2025
@keivenchang keivenchang changed the title feat: add a hierarchical MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint feat: add a hierarchical Prometheus MetricsRegistry trait for DistributedRuntime, Namespace, Components, and Endpoint Jul 18, 2025
@keivenchang keivenchang requested a review from a team as a code owner July 21, 2025 04:22
@keivenchang
Copy link
Contributor Author

Addressed the concern-- I removed the hierarchical prefix, and added auto-labeling. Currently working addressing the nitpick items, but do please start the review while I fix them. Thanks.

Copy link
Contributor

@ryanolson ryanolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@keivenchang keivenchang merged commit e5a8628 into main Jul 22, 2025
10 checks passed
@keivenchang keivenchang deleted the keivenchang/Observability_Metric-API__DIS-216 branch July 22, 2025 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants