Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Aug 18, 2025

Overview:

Follow-up to #2480 with improvements and feedback

drt runtime::spawn, interval.tick(), refactored NATS metrics organization, and fixed test parallelization issues

Details:

  • Use drt runtime::spawn instead of thread::spawn, and interval.tick() instead of sleep(...). Note that the blocking problem was lready fixed in PR 2480 where I removed std::thread::spawn, used tokio::runtime::Handle::try_current()). Graham and Olson said to use tokio spawn/interval.tick, which I'm doing here.
  • Split NATS metrics into separate nats_client and nats_service modules for clearer separation of concerns
  • Enhanced NATS service metrics with service_name and component-specific labels for better observability
  • Fixed test parallelization failures by ensuring uptime_seconds metric is consistently registered
  • Improved duplicate metric registration handling with intelligent error categorization
  • Updated test helpers to use temp_env for proper environment variable isolation

Where should the reviewer start?

  • lib/runtime/src/component.rs - Check the improved error handling for duplicate metrics
  • lib/runtime/src/metrics/prometheus_names.rs - See the new module structure for NATS metrics
  • lib/runtime/src/distributed.rs - Review the uptime_seconds metric registration logic

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Fixes DIS-445

Summary by CodeRabbit

  • New Features

    • Expanded NATS client metrics (bytes, messages, connects, connection state).
    • Introduced NATS service metrics (avg processing time, totals, active services/endpoints) with consistent labels (including service name).
    • Uptime metrics now tracked even when system HTTP endpoints are disabled.
  • Refactor

    • Split metrics into two namespaces: nats_client_* and nats_service_* (old nats_* renamed).
    • Streamlined background metrics scraping with fixed-interval runtime tasks and a shorter polling interval.
  • Bug Fixes

    • More robust handling of duplicate Prometheus metric registrations to prevent startup conflicts.

… test isolation

- Refactored prometheus_names to separate nats_client and nats_service modules
- Added service_name and component labels to NATS service metrics for better observability
- Fixed test parallelization issues by ensuring uptime_seconds metric is always registered
- Updated SystemStatusState to handle duplicate metric registrations gracefully
- Modified create_test_drt helpers to use temp_env for proper environment isolation
- Updated all tests to use new nats_client/nats_service metric names
- Improved error logging to distinguish between expected duplicate metrics and real errors
- Cleaned up component metric scraping method naming and documentation
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

Walkthrough

Introduces separate NATS client vs service metric namespaces, renames/extends related metric structs, adds component label accessor, changes component metrics scraping to a periodic async task on the secondary runtime, shortens scrape interval, updates DRT to new client metrics, adds duplicate-metric handling in system status, and adjusts tests accordingly.

Changes

Cohort / File(s) Summary of Changes
NATS metric namespaces split
lib/runtime/src/metrics/prometheus_names.rs, lib/runtime/src/metrics.rs
Replaces single nats namespace with nats_client and nats_service. Updates constants, prefixes, exported metric lists, and tests to reference the split namespaces.
Component scraping + labels
lib/runtime/src/component.rs
Adds labels(&self) -> &[(String, String)]. Renames start method to start_scraping_nats_service_component_metrics. Switches background scraping to DRT secondary runtime task using fixed tokio::time::interval (MissedTickBehavior::Skip). Reduces interval (MAX_DELAY_MS 4700→777). Handles duplicate metrics with targeted logging. Updates to ComponentNatsServerPrometheusMetrics.
Service metrics struct rename and labeling
lib/runtime/src/service.rs
Renames ComponentNatsPrometheusMetricsComponentNatsServerPrometheusMetrics. Renames fields with service_ prefix. Constructs metrics under nats_service names and applies labels: service_name plus component.labels(). Adds update_from_service_set(&self, ...). Adjusts reset/update logic.
DRT NATS client metrics wiring
lib/runtime/src/distributed.rs, lib/runtime/src/transports/nats.rs
Renames DRTNatsPrometheusMetricsDRTNatsClientPrometheusMetrics. Adds gauges: out_bytes, in_messages, out_messages, connects. Uses nats_client names. Updates callback binding to set from client stats. When HTTP endpoints disabled, still initializes uptime tracking state.
System status duplicate-metric handling
lib/runtime/src/system_status_server.rs
On duplicate registration of uptime_seconds, logs debug and substitutes a dummy Gauge; otherwise propagates errors. Test wiring adjusted to rely on DRT-managed metrics and new NATS prefixes.

Sequence Diagram(s)

sequenceDiagram
  participant DRT as DistributedRuntime
  participant Comp as Component
  participant NATS as NATS Server
  participant Prom as Prometheus Registry

  Comp->>DRT: start_scraping_nats_service_component_metrics()
  DRT-->>Comp: secondary().spawn(task)
  loop every interval (MissedTickBehavior=Skip)
    Comp->>NATS: $SRV.STATS.<service_name> request
    NATS-->>Comp: ServiceSet stats
    Comp->>Prom: Update nats_service_* gauges (with labels)
    Note right of Comp: On error: log, zero metrics
  end
Loading
sequenceDiagram
  participant DRT as DistributedRuntime
  participant NClient as NATS Client
  participant Prom as Prometheus Registry

  DRT->>NClient: register metrics callback
  loop on client stats update
    NClient-->>DRT: ClientStats snapshot
    DRT->>Prom: Update nats_client_* gauges (in/out bytes, msgs, connects, state)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

I twitch my nose at gauges bright,
Two streams now sing by day and night —
Client winds and service seas,
Labeled leaves on metrics trees.
Ticks that skip, we hop along,
No duplicates can stop my song.
Prometheus, count my bunny throng! 🐇📈

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
lib/runtime/src/distributed.rs (1)

31-31: Remove stray import that shadows symbols and may confuse macro calls

This import appears unused and also shadows the error symbol name used elsewhere, increasing the chance of calling the wrong thing.

- use figment::error;
🧹 Nitpick comments (14)
lib/runtime/src/distributed.rs (1)

273-286: Optional: downrank duplicate metric registration to debug to reduce signal noise

You’re already logging errors with context, but PR goals mention better duplicate-metric handling. Consider downranking “duplicate” cases to debug to avoid log spam during test parallelization.

-            Err(e) => {
-                let error_msg = e.to_string();
-                tracing::error!(
+            Err(e) => {
+                let error_msg = e.to_string();
+                let is_dup = error_msg.contains("Duplicate metrics") || error_msg.contains("already registered");
+                let level = if is_dup { tracing::Level::DEBUG } else { tracing::Level::ERROR };
+                tracing::event!(
+                    level,
                     hierarchy = ?hierarchy,
                     error = ?error_msg,
                     metric_name = ?metric_name,
                     "Metric registration failed"
                 );
                 Err(e.into())
             }
lib/runtime/src/transports/nats.rs (2)

547-551: Help text vs metric name mismatch (OUT_OVERHEAD_BYTES)

You’re registering OUT_OVERHEAD_BYTES but describing it as “Total number of bytes sent by NATS client.” If the metric truly tracks protocol overhead only, the help should reflect that. If it’s total bytes, the metric name should be OUT_TOTAL_BYTES.

-        let out_bytes = drt.create_intgauge(
-            nats_metrics::OUT_OVERHEAD_BYTES,
-            "Total number of bytes sent by NATS client",
+        let out_bytes = drt.create_intgauge(
+            nats_metrics::OUT_OVERHEAD_BYTES,
+            "Total number of protocol-overhead bytes sent by NATS client",
             &[],
         )?;

If the intent is total bytes (not overhead), rename the constant used here to the appropriate one from nats_client and keep the original help text.


568-571: Align connection_state help with actual states; handle future variants defensively

The help mentions “2=reconnecting” but the code maps only 0/1 and doesn’t emit 2. If async-nats only exposes Connected/Disconnected/Pending, keep it 0/1 and update the help accordingly; otherwise add a mapping for other states.

-        let connection_state = drt.create_intgauge(
-            nats_metrics::CONNECTION_STATE,
-            "Current connection state of NATS client (0=disconnected, 1=connected, 2=reconnecting)",
+        let connection_state = drt.create_intgauge(
+            nats_metrics::CONNECTION_STATE,
+            "Current connection state of NATS client (0=disconnected, 1=connected)",
             &[],
         )?;
-        let connection_state = match self.nats_client.connection_state() {
-            State::Connected => 1,
-            // treat Disconnected and Pending as "down"
-            State::Disconnected | State::Pending => 0,
-        };
+        let connection_state = match self.nats_client.connection_state() {
+            State::Connected => 1,
+            _ => 0, // treat all other states as "down"
+        };

Also applies to: 595-601

lib/runtime/src/system_status_server.rs (1)

81-102: Duplicate-metric fallback creates a gauge that isn’t registered

When create_gauge returns a duplicate, a “dummy” Gauge is constructed and stored. That gauge is not registered in any registry, so update_uptime_gauge() will update a metric that will never be scraped.

If the goal is only to ensure presence of the metric in tests, this is fine. If you intend to keep the uptime value updated when duplicates happen, consider:

  • Storing Option and skipping updates when None, or
  • Centralizing uptime gauge ownership (create once at DRT init and pass a handle), or
  • Providing a shared Arc from the registry (if your registry supports retrieving by name), or
  • Creating a synthetic callback that updates uptime via a callback instead of a held Gauge handle.
lib/runtime/src/metrics.rs (1)

1270-1271: Note on uptime metric default value

Expected value “0” in the DRT output is fine given it’s set immediately on scrape. If you later change scrape ordering to update before gather, tests might need a small tolerance.

lib/runtime/src/component.rs (2)

274-284: Background scraping uses DRT runtime, interval, and reset-on-error — solid

  • Using the DRT’s secondary runtime avoids “no runtime in context” footguns.
  • interval.tick() with MissedTickBehavior::Skip is the right choice.
  • Resetting metrics to zero on error prevents stale values.

Two small nits:

  • The docstring still mentions “approximately 4.7 seconds” while MAX_DELAY_MS is 777ms.
  • Consider a backoff or log downranking if errors are continuous to avoid log noise.
-    /// and updates the corresponding Prometheus metrics. The scraping interval is set to
-    /// approximately 4.7 seconds to ensure fresh data is available for each Prometheus
-    /// polling cycle (which occurs every 6 seconds by default).
+    /// and updates the corresponding Prometheus metrics. The scraping interval is set to
+    /// ~0.777 seconds to provide fresh data well within typical Prometheus scrape intervals.

Also applies to: 300-309, 313-329


583-604: Improve duplicate-metric error detection robustness

String-matching “Duplicate metrics” is brittle. Consider accepting other common phrasings like “already registered”.

-        if let Err(err) = component.start_scraping_nats_service_component_metrics() {
-            let error_str = err.to_string();
-
-            // Check if this is a duplicate metrics registration (expected in some cases)
-            // or a different error (unexpected)
-            if error_str.contains("Duplicate metrics") {
+        if let Err(err) = component.start_scraping_nats_service_component_metrics() {
+            let error_str = err.to_string();
+            // Handle duplicate metrics registration as a non-critical path
+            if error_str.contains("Duplicate metrics") || error_str.contains("already registered") {
                 // This is not a critical error because it's possible for multiple Components
                 // with the same service_name to register metrics callbacks.
                 tracing::debug!(
                     "Duplicate metrics registration for component '{}' (expected when multiple components share the same service_name): {}",
                     component.service_name(),
                     error_str
                 );
             } else {
                 // This is unexpected and should be more visible
                 tracing::warn!(
                     "Failed to start scraping metrics for component '{}': {}",
                     component.service_name(),
                     err
                 );
             }
         }
lib/runtime/src/service.rs (4)

339-347: Centralize the "service_name" label via a shared constant for consistency

We already have a labels module for standardized label keys. Prefer using a constant (e.g., labels::SERVICE_NAME) instead of a raw "service_name" string literal to avoid drift and enable global updates.

Apply this diff in this file:

-        // Build labels: service_name first, then component's labels
-        let mut labels_vec = vec![("service_name", service_name.as_str())];
+        // Build labels: service_name first, then component's labels
+        let mut labels_vec = vec![(prometheus_names::labels::SERVICE_NAME, service_name.as_str())];

And add the missing constant in prometheus_names.rs labels module (see comment in that file).


399-433: Remove unused processing_time_samples and simplify average calculation

processing_time_samples is never used in the final computation. The guard can be simplified to total_requests > 0, since that already implies at least one non-zero sample.

-        // Variables ordered to match NatsStatsMetrics fields
-        let mut processing_time_samples = 0u64; // for average_processing_time calculation
+        // Variables ordered to match NatsStatsMetrics fields
         let mut total_errors = 0u64; // maps to: num_errors
         let mut total_requests = 0u64; // maps to: num_requests
         let mut total_processing_time_nanos = 0u64; // maps to: processing_time (nanoseconds from NATS)
         let mut endpoint_count = 0u64; // for derived metrics
@@
-                    if stats.num_requests > 0 {
-                        processing_time_samples += 1;
-                    }
+                    // no need to track sample counts; we compute per-request average across all endpoints
@@
-        if processing_time_samples > 0 && total_requests > 0 {
+        if total_requests > 0 {
             let avg_time_nanos = total_processing_time_nanos as f64 / total_requests as f64;
             let avg_time_ms = avg_time_nanos / 1_000_000.0; // Convert nanoseconds to milliseconds
             self.service_avg_processing_ms.set(avg_time_ms);
         } else {
             self.service_avg_processing_ms.set(0.0);
         }

171-205: Consider truncating payload in debug logs to avoid log bloat

When decoding fails, the full payload is logged at debug. If these payloads can be large, truncate or log length to keep logs manageable.

-                    let payload = String::from_utf8_lossy(&message.payload);
-                    tracing::debug!(%err, service_name, %payload, "error decoding service info");
+                    let payload = String::from_utf8_lossy(&message.payload);
+                    let truncated = if payload.len() > 512 { format!("{}…", &payload[..512]) } else { payload.to_string() };
+                    tracing::debug!(%err, service_name, payload_len = payload.len(), %truncated, "error decoding service info");

237-256: Test fixture units are inconsistent with comments

Comments state 0.1ms = 100_000ns for average_processing_time, but processing_time values are set to 100 (ns). Consider aligning processing_time to 100_000 for clarity.

Also applies to: 271-290

lib/runtime/src/metrics/prometheus_names.rs (3)

23-33: Add SERVICE_NAME to standardized label keys

Service-level metrics introduced the "service_name" label. Add it to the shared labels module to prevent string-literal drift.

 pub mod labels {
     /// Label for component identification
     pub const COMPONENT: &str = "dynamo_component";
@@
     /// Label for endpoint identification
     pub const ENDPOINT: &str = "dynamo_endpoint";
+
+    /// Label for NATS service identification (used by nats_service metrics)
+    pub const SERVICE_NAME: &str = "service_name";
 }

35-36: Minor doc nit: stray parenthesis

Drop the trailing parenthesis for cleaner docs.

-/// NATS client metrics. DistributedRuntime contains a NATS client shared by all children)
+/// NATS client metrics. DistributedRuntime contains a NATS client shared by all children.

96-102: Fix unit comments: These constants represent milliseconds, not nanoseconds

The comments on COMPONENT_NATS_METRICS still reference nanoseconds, but the constant names explicitly say “_ms”. Align the doc comments to avoid confusion.

 pub const COMPONENT_NATS_METRICS: &[&str] = &[
-    nats_service::AVG_PROCESSING_MS, // maps to: average_processing_time (nanoseconds)
-    nats_service::TOTAL_ERRORS,      // maps to: num_errors
-    nats_service::TOTAL_REQUESTS,    // maps to: num_requests
-    nats_service::TOTAL_PROCESSING_MS, // maps to: processing_time (nanoseconds)
-    nats_service::ACTIVE_SERVICES,   // derived from ServiceSet.services
-    nats_service::ACTIVE_ENDPOINTS,  // derived from ServiceInfo.endpoints
+    nats_service::AVG_PROCESSING_MS,   // maps to: average_processing_time (milliseconds)
+    nats_service::TOTAL_ERRORS,        // maps to: num_errors
+    nats_service::TOTAL_REQUESTS,      // maps to: num_requests
+    nats_service::TOTAL_PROCESSING_MS, // maps to: processing_time (milliseconds)
+    nats_service::ACTIVE_SERVICES,     // derived from ServiceSet.services
+    nats_service::ACTIVE_ENDPOINTS,    // derived from ServiceInfo.endpoints
 ];
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6297859 and cf27800.

📒 Files selected for processing (7)
  • lib/runtime/src/component.rs (5 hunks)
  • lib/runtime/src/distributed.rs (3 hunks)
  • lib/runtime/src/metrics.rs (13 hunks)
  • lib/runtime/src/metrics/prometheus_names.rs (1 hunks)
  • lib/runtime/src/service.rs (3 hunks)
  • lib/runtime/src/system_status_server.rs (2 hunks)
  • lib/runtime/src/transports/nats.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Applied to files:

  • lib/runtime/src/service.rs
  • lib/runtime/src/transports/nats.rs
  • lib/runtime/src/metrics.rs
  • lib/runtime/src/distributed.rs
🧬 Code Graph Analysis (7)
lib/runtime/src/system_status_server.rs (1)
lib/runtime/src/distributed.rs (2)
  • new (47-165)
  • nats_client (238-240)
lib/runtime/src/service.rs (2)
lib/runtime/src/component.rs (5)
  • new (564-570)
  • component (426-428)
  • component (573-607)
  • service_name (204-207)
  • labels (226-228)
lib/llm/src/kv_router/metrics_aggregator.rs (2)
  • new (41-55)
  • new (72-86)
lib/runtime/src/metrics/prometheus_names.rs (2)
lib/runtime/src/distributed.rs (1)
  • nats_client (238-240)
lib/runtime/src/metrics.rs (2)
  • DRT_NATS_METRICS (1365-1368)
  • COMPONENT_NATS_METRICS (1433-1436)
lib/runtime/src/component.rs (2)
lib/runtime/src/service.rs (2)
  • new (46-48)
  • new (338-395)
lib/llm/src/local_model.rs (1)
  • service_name (296-298)
lib/runtime/src/transports/nats.rs (2)
lib/runtime/src/pipeline/network.rs (1)
  • metrics (338-340)
lib/runtime/src/distributed.rs (1)
  • nats_client (238-240)
lib/runtime/src/metrics.rs (4)
lib/runtime/src/metrics/prometheus_names.rs (1)
  • build_metric_name (10-12)
lib/runtime/src/distributed.rs (3)
  • nats_client (238-240)
  • runtime (178-180)
  • new (47-165)
lib/runtime/src/transports/nats.rs (2)
  • new (380-394)
  • new (541-582)
lib/runtime/src/system_status_server.rs (3)
  • new (40-45)
  • new (79-109)
  • drt (126-128)
lib/runtime/src/distributed.rs (2)
lib/runtime/src/transports/nats.rs (2)
  • new (380-394)
  • new (541-582)
lib/runtime/src/system_status_server.rs (3)
  • new (40-45)
  • new (79-109)
  • port (55-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (17)
lib/runtime/src/distributed.rs (1)

98-113: Good move: client metrics callback wiring is clean and side-effect free

Creating DRTNatsClientPrometheusMetrics and registering a lightweight callback that snapshots client stats on scrape aligns with the “snapshot on demand” model and avoids background polling here. Looks solid.

lib/runtime/src/transports/nats.rs (3)

47-47: Namespace swap to nats_client is consistent

Importing metric names from prometheus_names::nats_client matches the client-side metrics split. Good.


523-537: Client metrics surface looks right for snapshotting

Publishing gauges for in/out bytes, messages, connects, and connection_state is appropriate since these are cumulative snapshots from the client.


584-610: Snapshot update logic is simple and correct

Relaxed loads from the client statistics and direct gauge sets are appropriate for scrape-time snapshots. No locking here is good.

lib/runtime/src/system_status_server.rs (2)

270-296: Metrics handler orchestrates a single scrape well

Updating uptime, executing callbacks, then gathering the DRT registry keeps scrape-time computations in one place. Looks good.


407-437: Tests: namespace and filtering updates match the new client/service split

Asserting presence of dynamo_component_uptime_seconds and filtering both nats_client and nats_service prefixes aligns with the refactor and reduces flakiness. Good adjustment.

Also applies to: 415-422, 425-437

lib/runtime/src/metrics.rs (4)

35-38: Import split (nats_client, nats_service) is consistent across helpers and tests

The refactor of imports and constants usage reflects the two-namespace model cleanly.


619-645: Filtering helpers correctly exclude/include both NATS namespaces

These helpers now account for both client and service prefixes and keep the tests readable.


585-591: Single-threaded runtime for tests reduces contention

Using a single-threaded runtime and blocking to construct a DRT improves test isolation and reliability.

Also applies to: 598-602


1459-1473: NATS metrics tests match the new surface and are robust

  • Verifying metric name sets via DRT_NATS_METRICS and COMPONENT_NATS_METRICS is a nice invariance test.
  • Value-range assertions are wide enough to avoid flakiness for byte counts and message counts. Good balance.

Also applies to: 1460-1466, 1534-1559, 1631-1656

lib/runtime/src/component.rs (2)

51-51: Import to server-side metrics type reflects split

Switching to ComponentNatsServerPrometheusMetrics matches the service side namespace.


226-229: labels() accessor is helpful for composable label construction

Exposing labels as a slice enables reuse in service metrics initialization without allocations.

lib/runtime/src/service.rs (3)

351-386: Metric registration pattern and label construction look correct

  • Using nats_service::* constants aligns with the new module split.
  • Label composition with service_name followed by component.labels() is clear and keeps cardinality predictable.

434-440: Aggregations and unit conversions are sound

  • Errors/requests are summed as IntGauge values.
  • processing_time is converted from nanos to ms before setting the IntGauge.
  • Active services/endpoints derived correctly from the ServiceSet.

443-450: Reset API is complete and consistent

All service_* metrics are reset; this prevents stale values during gaps in scraping.

lib/runtime/src/metrics/prometheus_names.rs (2)

59-81: LGTM: Clear separation between client and service namespaces

The split into nats_client and nats_service is clean and improves discoverability. Names are explicit and prefixed consistently.


83-91: Client metrics set looks good

The selected client-side metrics match typical NATS client observability needs and the new naming scheme.

@keivenchang keivenchang changed the title refactor: split NATS metrics into client and service modules fix: use tokio spawn and interval.tick(), make nats metric names clearer, fixed a bunch of tests due to sharing environment variables (temp_env) Aug 18, 2025
@github-actions github-actions bot added fix and removed refactor labels Aug 18, 2025
- Add macro-based prefix generation for nats_client and nats_service metrics
- Update component scraping interval from 777ms to 873ms
- Convert failing integration tests to async tokio::test format
- Improve error handling in distributed runtime start time initialization
- Fix test helper functions to use proper async runtime context
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock - looking forward to the other follow-ups from comments on test case. dummy gauge, max delay number, etc. - thanks for the improvements!

@rmccorm4 rmccorm4 changed the title fix: use tokio spawn and interval.tick(), make nats metric names clearer, fixed a bunch of tests due to sharing environment variables (temp_env) fix: use tokio spawn / interval.tick(), make nats metric names clearer, fix tests sharing environment variables (temp_env) Aug 19, 2025
@keivenchang keivenchang enabled auto-merge (squash) August 19, 2025 03:28
@keivenchang keivenchang merged commit bec1dd5 into main Aug 19, 2025
16 of 18 checks passed
@keivenchang keivenchang deleted the keiven/DIS-445__NATS-metrics-collection-fix2 branch August 19, 2025 03:58
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
…r, fix tests sharing environment variables (temp_env) (#2506)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants