-
Notifications
You must be signed in to change notification settings - Fork 679
fix: use tokio spawn / interval.tick(), make nats metric names clearer, fix tests sharing environment variables (temp_env) #2506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use tokio spawn / interval.tick(), make nats metric names clearer, fix tests sharing environment variables (temp_env) #2506
Conversation
… test isolation - Refactored prometheus_names to separate nats_client and nats_service modules - Added service_name and component labels to NATS service metrics for better observability - Fixed test parallelization issues by ensuring uptime_seconds metric is always registered - Updated SystemStatusState to handle duplicate metric registrations gracefully - Modified create_test_drt helpers to use temp_env for proper environment isolation - Updated all tests to use new nats_client/nats_service metric names - Improved error logging to distinguish between expected duplicate metrics and real errors - Cleaned up component metric scraping method naming and documentation
WalkthroughIntroduces separate NATS client vs service metric namespaces, renames/extends related metric structs, adds component label accessor, changes component metrics scraping to a periodic async task on the secondary runtime, shortens scrape interval, updates DRT to new client metrics, adds duplicate-metric handling in system status, and adjusts tests accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant DRT as DistributedRuntime
participant Comp as Component
participant NATS as NATS Server
participant Prom as Prometheus Registry
Comp->>DRT: start_scraping_nats_service_component_metrics()
DRT-->>Comp: secondary().spawn(task)
loop every interval (MissedTickBehavior=Skip)
Comp->>NATS: $SRV.STATS.<service_name> request
NATS-->>Comp: ServiceSet stats
Comp->>Prom: Update nats_service_* gauges (with labels)
Note right of Comp: On error: log, zero metrics
end
sequenceDiagram
participant DRT as DistributedRuntime
participant NClient as NATS Client
participant Prom as Prometheus Registry
DRT->>NClient: register metrics callback
loop on client stats update
NClient-->>DRT: ClientStats snapshot
DRT->>Prom: Update nats_client_* gauges (in/out bytes, msgs, connects, state)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
lib/runtime/src/distributed.rs (1)
31-31: Remove stray import that shadows symbols and may confuse macro callsThis import appears unused and also shadows the
errorsymbol name used elsewhere, increasing the chance of calling the wrong thing.- use figment::error;
🧹 Nitpick comments (14)
lib/runtime/src/distributed.rs (1)
273-286: Optional: downrank duplicate metric registration to debug to reduce signal noiseYou’re already logging errors with context, but PR goals mention better duplicate-metric handling. Consider downranking “duplicate” cases to debug to avoid log spam during test parallelization.
- Err(e) => { - let error_msg = e.to_string(); - tracing::error!( + Err(e) => { + let error_msg = e.to_string(); + let is_dup = error_msg.contains("Duplicate metrics") || error_msg.contains("already registered"); + let level = if is_dup { tracing::Level::DEBUG } else { tracing::Level::ERROR }; + tracing::event!( + level, hierarchy = ?hierarchy, error = ?error_msg, metric_name = ?metric_name, "Metric registration failed" ); Err(e.into()) }lib/runtime/src/transports/nats.rs (2)
547-551: Help text vs metric name mismatch (OUT_OVERHEAD_BYTES)You’re registering OUT_OVERHEAD_BYTES but describing it as “Total number of bytes sent by NATS client.” If the metric truly tracks protocol overhead only, the help should reflect that. If it’s total bytes, the metric name should be OUT_TOTAL_BYTES.
- let out_bytes = drt.create_intgauge( - nats_metrics::OUT_OVERHEAD_BYTES, - "Total number of bytes sent by NATS client", + let out_bytes = drt.create_intgauge( + nats_metrics::OUT_OVERHEAD_BYTES, + "Total number of protocol-overhead bytes sent by NATS client", &[], )?;If the intent is total bytes (not overhead), rename the constant used here to the appropriate one from nats_client and keep the original help text.
568-571: Align connection_state help with actual states; handle future variants defensivelyThe help mentions “2=reconnecting” but the code maps only 0/1 and doesn’t emit 2. If async-nats only exposes Connected/Disconnected/Pending, keep it 0/1 and update the help accordingly; otherwise add a mapping for other states.
- let connection_state = drt.create_intgauge( - nats_metrics::CONNECTION_STATE, - "Current connection state of NATS client (0=disconnected, 1=connected, 2=reconnecting)", + let connection_state = drt.create_intgauge( + nats_metrics::CONNECTION_STATE, + "Current connection state of NATS client (0=disconnected, 1=connected)", &[], )?;- let connection_state = match self.nats_client.connection_state() { - State::Connected => 1, - // treat Disconnected and Pending as "down" - State::Disconnected | State::Pending => 0, - }; + let connection_state = match self.nats_client.connection_state() { + State::Connected => 1, + _ => 0, // treat all other states as "down" + };Also applies to: 595-601
lib/runtime/src/system_status_server.rs (1)
81-102: Duplicate-metric fallback creates a gauge that isn’t registeredWhen create_gauge returns a duplicate, a “dummy” Gauge is constructed and stored. That gauge is not registered in any registry, so update_uptime_gauge() will update a metric that will never be scraped.
If the goal is only to ensure presence of the metric in tests, this is fine. If you intend to keep the uptime value updated when duplicates happen, consider:
- Storing Option and skipping updates when None, or
- Centralizing uptime gauge ownership (create once at DRT init and pass a handle), or
- Providing a shared Arc from the registry (if your registry supports retrieving by name), or
- Creating a synthetic callback that updates uptime via a callback instead of a held Gauge handle.
lib/runtime/src/metrics.rs (1)
1270-1271: Note on uptime metric default valueExpected value “0” in the DRT output is fine given it’s set immediately on scrape. If you later change scrape ordering to update before gather, tests might need a small tolerance.
lib/runtime/src/component.rs (2)
274-284: Background scraping uses DRT runtime, interval, and reset-on-error — solid
- Using the DRT’s secondary runtime avoids “no runtime in context” footguns.
- interval.tick() with MissedTickBehavior::Skip is the right choice.
- Resetting metrics to zero on error prevents stale values.
Two small nits:
- The docstring still mentions “approximately 4.7 seconds” while MAX_DELAY_MS is 777ms.
- Consider a backoff or log downranking if errors are continuous to avoid log noise.
- /// and updates the corresponding Prometheus metrics. The scraping interval is set to - /// approximately 4.7 seconds to ensure fresh data is available for each Prometheus - /// polling cycle (which occurs every 6 seconds by default). + /// and updates the corresponding Prometheus metrics. The scraping interval is set to + /// ~0.777 seconds to provide fresh data well within typical Prometheus scrape intervals.Also applies to: 300-309, 313-329
583-604: Improve duplicate-metric error detection robustnessString-matching “Duplicate metrics” is brittle. Consider accepting other common phrasings like “already registered”.
- if let Err(err) = component.start_scraping_nats_service_component_metrics() { - let error_str = err.to_string(); - - // Check if this is a duplicate metrics registration (expected in some cases) - // or a different error (unexpected) - if error_str.contains("Duplicate metrics") { + if let Err(err) = component.start_scraping_nats_service_component_metrics() { + let error_str = err.to_string(); + // Handle duplicate metrics registration as a non-critical path + if error_str.contains("Duplicate metrics") || error_str.contains("already registered") { // This is not a critical error because it's possible for multiple Components // with the same service_name to register metrics callbacks. tracing::debug!( "Duplicate metrics registration for component '{}' (expected when multiple components share the same service_name): {}", component.service_name(), error_str ); } else { // This is unexpected and should be more visible tracing::warn!( "Failed to start scraping metrics for component '{}': {}", component.service_name(), err ); } }lib/runtime/src/service.rs (4)
339-347: Centralize the "service_name" label via a shared constant for consistencyWe already have a labels module for standardized label keys. Prefer using a constant (e.g., labels::SERVICE_NAME) instead of a raw "service_name" string literal to avoid drift and enable global updates.
Apply this diff in this file:
- // Build labels: service_name first, then component's labels - let mut labels_vec = vec![("service_name", service_name.as_str())]; + // Build labels: service_name first, then component's labels + let mut labels_vec = vec![(prometheus_names::labels::SERVICE_NAME, service_name.as_str())];And add the missing constant in prometheus_names.rs labels module (see comment in that file).
399-433: Remove unused processing_time_samples and simplify average calculationprocessing_time_samples is never used in the final computation. The guard can be simplified to total_requests > 0, since that already implies at least one non-zero sample.
- // Variables ordered to match NatsStatsMetrics fields - let mut processing_time_samples = 0u64; // for average_processing_time calculation + // Variables ordered to match NatsStatsMetrics fields let mut total_errors = 0u64; // maps to: num_errors let mut total_requests = 0u64; // maps to: num_requests let mut total_processing_time_nanos = 0u64; // maps to: processing_time (nanoseconds from NATS) let mut endpoint_count = 0u64; // for derived metrics @@ - if stats.num_requests > 0 { - processing_time_samples += 1; - } + // no need to track sample counts; we compute per-request average across all endpoints @@ - if processing_time_samples > 0 && total_requests > 0 { + if total_requests > 0 { let avg_time_nanos = total_processing_time_nanos as f64 / total_requests as f64; let avg_time_ms = avg_time_nanos / 1_000_000.0; // Convert nanoseconds to milliseconds self.service_avg_processing_ms.set(avg_time_ms); } else { self.service_avg_processing_ms.set(0.0); }
171-205: Consider truncating payload in debug logs to avoid log bloatWhen decoding fails, the full payload is logged at debug. If these payloads can be large, truncate or log length to keep logs manageable.
- let payload = String::from_utf8_lossy(&message.payload); - tracing::debug!(%err, service_name, %payload, "error decoding service info"); + let payload = String::from_utf8_lossy(&message.payload); + let truncated = if payload.len() > 512 { format!("{}…", &payload[..512]) } else { payload.to_string() }; + tracing::debug!(%err, service_name, payload_len = payload.len(), %truncated, "error decoding service info");
237-256: Test fixture units are inconsistent with commentsComments state 0.1ms = 100_000ns for average_processing_time, but processing_time values are set to 100 (ns). Consider aligning processing_time to 100_000 for clarity.
Also applies to: 271-290
lib/runtime/src/metrics/prometheus_names.rs (3)
23-33: Add SERVICE_NAME to standardized label keysService-level metrics introduced the "service_name" label. Add it to the shared labels module to prevent string-literal drift.
pub mod labels { /// Label for component identification pub const COMPONENT: &str = "dynamo_component"; @@ /// Label for endpoint identification pub const ENDPOINT: &str = "dynamo_endpoint"; + + /// Label for NATS service identification (used by nats_service metrics) + pub const SERVICE_NAME: &str = "service_name"; }
35-36: Minor doc nit: stray parenthesisDrop the trailing parenthesis for cleaner docs.
-/// NATS client metrics. DistributedRuntime contains a NATS client shared by all children) +/// NATS client metrics. DistributedRuntime contains a NATS client shared by all children.
96-102: Fix unit comments: These constants represent milliseconds, not nanosecondsThe comments on COMPONENT_NATS_METRICS still reference nanoseconds, but the constant names explicitly say “_ms”. Align the doc comments to avoid confusion.
pub const COMPONENT_NATS_METRICS: &[&str] = &[ - nats_service::AVG_PROCESSING_MS, // maps to: average_processing_time (nanoseconds) - nats_service::TOTAL_ERRORS, // maps to: num_errors - nats_service::TOTAL_REQUESTS, // maps to: num_requests - nats_service::TOTAL_PROCESSING_MS, // maps to: processing_time (nanoseconds) - nats_service::ACTIVE_SERVICES, // derived from ServiceSet.services - nats_service::ACTIVE_ENDPOINTS, // derived from ServiceInfo.endpoints + nats_service::AVG_PROCESSING_MS, // maps to: average_processing_time (milliseconds) + nats_service::TOTAL_ERRORS, // maps to: num_errors + nats_service::TOTAL_REQUESTS, // maps to: num_requests + nats_service::TOTAL_PROCESSING_MS, // maps to: processing_time (milliseconds) + nats_service::ACTIVE_SERVICES, // derived from ServiceSet.services + nats_service::ACTIVE_ENDPOINTS, // derived from ServiceInfo.endpoints ];
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
lib/runtime/src/component.rs(5 hunks)lib/runtime/src/distributed.rs(3 hunks)lib/runtime/src/metrics.rs(13 hunks)lib/runtime/src/metrics/prometheus_names.rs(1 hunks)lib/runtime/src/service.rs(3 hunks)lib/runtime/src/system_status_server.rs(2 hunks)lib/runtime/src/transports/nats.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Applied to files:
lib/runtime/src/service.rslib/runtime/src/transports/nats.rslib/runtime/src/metrics.rslib/runtime/src/distributed.rs
🧬 Code Graph Analysis (7)
lib/runtime/src/system_status_server.rs (1)
lib/runtime/src/distributed.rs (2)
new(47-165)nats_client(238-240)
lib/runtime/src/service.rs (2)
lib/runtime/src/component.rs (5)
new(564-570)component(426-428)component(573-607)service_name(204-207)labels(226-228)lib/llm/src/kv_router/metrics_aggregator.rs (2)
new(41-55)new(72-86)
lib/runtime/src/metrics/prometheus_names.rs (2)
lib/runtime/src/distributed.rs (1)
nats_client(238-240)lib/runtime/src/metrics.rs (2)
DRT_NATS_METRICS(1365-1368)COMPONENT_NATS_METRICS(1433-1436)
lib/runtime/src/component.rs (2)
lib/runtime/src/service.rs (2)
new(46-48)new(338-395)lib/llm/src/local_model.rs (1)
service_name(296-298)
lib/runtime/src/transports/nats.rs (2)
lib/runtime/src/pipeline/network.rs (1)
metrics(338-340)lib/runtime/src/distributed.rs (1)
nats_client(238-240)
lib/runtime/src/metrics.rs (4)
lib/runtime/src/metrics/prometheus_names.rs (1)
build_metric_name(10-12)lib/runtime/src/distributed.rs (3)
nats_client(238-240)runtime(178-180)new(47-165)lib/runtime/src/transports/nats.rs (2)
new(380-394)new(541-582)lib/runtime/src/system_status_server.rs (3)
new(40-45)new(79-109)drt(126-128)
lib/runtime/src/distributed.rs (2)
lib/runtime/src/transports/nats.rs (2)
new(380-394)new(541-582)lib/runtime/src/system_status_server.rs (3)
new(40-45)new(79-109)port(55-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (17)
lib/runtime/src/distributed.rs (1)
98-113: Good move: client metrics callback wiring is clean and side-effect freeCreating DRTNatsClientPrometheusMetrics and registering a lightweight callback that snapshots client stats on scrape aligns with the “snapshot on demand” model and avoids background polling here. Looks solid.
lib/runtime/src/transports/nats.rs (3)
47-47: Namespace swap to nats_client is consistentImporting metric names from prometheus_names::nats_client matches the client-side metrics split. Good.
523-537: Client metrics surface looks right for snapshottingPublishing gauges for in/out bytes, messages, connects, and connection_state is appropriate since these are cumulative snapshots from the client.
584-610: Snapshot update logic is simple and correctRelaxed loads from the client statistics and direct gauge sets are appropriate for scrape-time snapshots. No locking here is good.
lib/runtime/src/system_status_server.rs (2)
270-296: Metrics handler orchestrates a single scrape wellUpdating uptime, executing callbacks, then gathering the DRT registry keeps scrape-time computations in one place. Looks good.
407-437: Tests: namespace and filtering updates match the new client/service splitAsserting presence of dynamo_component_uptime_seconds and filtering both nats_client and nats_service prefixes aligns with the refactor and reduces flakiness. Good adjustment.
Also applies to: 415-422, 425-437
lib/runtime/src/metrics.rs (4)
35-38: Import split (nats_client, nats_service) is consistent across helpers and testsThe refactor of imports and constants usage reflects the two-namespace model cleanly.
619-645: Filtering helpers correctly exclude/include both NATS namespacesThese helpers now account for both client and service prefixes and keep the tests readable.
585-591: Single-threaded runtime for tests reduces contentionUsing a single-threaded runtime and blocking to construct a DRT improves test isolation and reliability.
Also applies to: 598-602
1459-1473: NATS metrics tests match the new surface and are robust
- Verifying metric name sets via DRT_NATS_METRICS and COMPONENT_NATS_METRICS is a nice invariance test.
- Value-range assertions are wide enough to avoid flakiness for byte counts and message counts. Good balance.
Also applies to: 1460-1466, 1534-1559, 1631-1656
lib/runtime/src/component.rs (2)
51-51: Import to server-side metrics type reflects splitSwitching to ComponentNatsServerPrometheusMetrics matches the service side namespace.
226-229: labels() accessor is helpful for composable label constructionExposing labels as a slice enables reuse in service metrics initialization without allocations.
lib/runtime/src/service.rs (3)
351-386: Metric registration pattern and label construction look correct
- Using nats_service::* constants aligns with the new module split.
- Label composition with service_name followed by component.labels() is clear and keeps cardinality predictable.
434-440: Aggregations and unit conversions are sound
- Errors/requests are summed as IntGauge values.
- processing_time is converted from nanos to ms before setting the IntGauge.
- Active services/endpoints derived correctly from the ServiceSet.
443-450: Reset API is complete and consistentAll service_* metrics are reset; this prevents stale values during gaps in scraping.
lib/runtime/src/metrics/prometheus_names.rs (2)
59-81: LGTM: Clear separation between client and service namespacesThe split into nats_client and nats_service is clean and improves discoverability. Names are explicit and prefixed consistently.
83-91: Client metrics set looks goodThe selected client-side metrics match typical NATS client observability needs and the new naming scheme.
- Add macro-based prefix generation for nats_client and nats_service metrics - Update component scraping interval from 777ms to 873ms - Convert failing integration tests to async tokio::test format - Improve error handling in distributed runtime start time initialization - Fix test helper functions to use proper async runtime context
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock - looking forward to the other follow-ups from comments on test case. dummy gauge, max delay number, etc. - thanks for the improvements!
…r, fix tests sharing environment variables (temp_env) (#2506) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Overview:
Follow-up to #2480 with improvements and feedback
drt runtime::spawn, interval.tick(), refactored NATS metrics organization, and fixed test parallelization issues
Details:
nats_clientandnats_servicemodules for clearer separation of concernsservice_nameand component-specific labels for better observabilityuptime_secondsmetric is consistently registeredtemp_envfor proper environment variable isolationWhere should the reviewer start?
lib/runtime/src/component.rs- Check the improved error handling for duplicate metricslib/runtime/src/metrics/prometheus_names.rs- See the new module structure for NATS metricslib/runtime/src/distributed.rs- Review the uptime_seconds metric registration logicRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Fixes DIS-445
Summary by CodeRabbit
New Features
Refactor
Bug Fixes