Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Aug 16, 2025

Overview:

Fix metrics collection timeouts by replacing the synchronous metrics callback with a background scraping task, and resolve various test failures in the metrics system.

Details:

  • Replace metrics callback with background scraping task to prevent timeouts
  • Fix IntGaugeVec cardinality mismatch in Prometheus metrics tests
  • Change metric name from dynamo_component_dynamo_uptime_seconds to dynamo_component_uptime_seconds
  • Update test expectations to match actual Prometheus output format
  • Fix system_metrics integration test method name from system_status_info to system_status_server_info

Where should the reviewer start?

  • lib/runtime/src/distributed.rs - Review the removed duplicate metric registration check
  • lib/runtime/src/component.rs - Look at the new start_scraping_metrics method
  • lib/runtime/src/metrics.rs - Check the fixed test expectations and IntGaugeVec usage

Related Issues:

Closes DIS-445

Summary by CodeRabbit

  • New Features
    • Metrics are now scraped continuously in the background for more timely and resilient updates.
    • Prometheus outputs include automatic labels (namespace, component, endpoint) and expanded metric series.
  • Refactor
    • Metric registration now permits multiple series sharing a name when labels differ.
    • Uptime metric renamed to uptime_seconds for consistent Prometheus naming.

…eouts

- Replace metrics callback with background scraping task to prevent timeouts
- Fix IntGaugeVec cardinality mismatch in Prometheus metrics tests
- Change metric name from dynamo_component_dynamo_uptime_seconds to dynamo_component_uptime_seconds
- Update test expectations to match actual Prometheus output format
- Fix system_metrics integration test method name from system_status_info to system_status_server_info
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 16, 2025

Walkthrough

Refactors component metrics collection to a dedicated background scraping thread, adjusts metrics registration to allow duplicate names with differing labels, updates metric labeling/naming conventions and related tests, and renames/updates system status API usage in an integration test.

Changes

Cohort / File(s) Summary of changes
Component metrics scraping refactor
lib/runtime/src/component.rs
Replaces add_metrics_callback with start_scraping_metrics. Spawns a background thread with its own Tokio runtime to periodically scrape stats, update Prometheus metrics, reset on failures, and apply exponential backoff. Call sites now use start_scraping_metrics.
Metrics registration behavior
lib/runtime/src/distributed.rs
Removes pre-check blocking duplicate metric-name registration; updates docs to clarify duplicates allowed when labels differ. Keeps signature and error handling unchanged.
Unified auto-labeling and test alignment
lib/runtime/src/metrics.rs
Updates test expectations/output parsing for auto-injected labels (namespace, component, endpoint), widened ranges, additional metrics, and histogram buckets; minor test stability tweaks.
System status server metric rename
lib/runtime/src/system_status_server.rs
Renames uptime metric from dynamo_uptime_seconds to uptime_seconds, affecting emitted name (now dynamo_component_uptime_seconds).
Integration test API rename
lib/runtime/examples/system_metrics/tests/integration_test.rs
Switches distributed.system_status_info() to distributed.system_status_server_info(); test logic otherwise unchanged.

Sequence Diagram(s)

sequenceDiagram
  participant App as Namespace/Component init
  participant Comp as Component
  participant Scraper as Background Thread
  participant RT as Local Tokio Runtime
  participant Metrics as Prometheus Metrics

  App->>Comp: start_scraping_metrics()
  Comp->>Scraper: spawn thread with metrics handles
  Scraper->>RT: build local runtime
  loop periodic (delay/backoff)
    Scraper->>Comp: scrape_stats(timeout=300ms)
    alt success
      Scraper->>Metrics: update_from_service_set()
    else failure
      Scraper->>Metrics: reset to zeros
      Scraper->>Scraper: increase backoff (cap MAX_DELAY)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A rabbit taps the metric keys,
Threads awake with scraping breeze.
Labels hop on every line,
Uptime sings in cleaner rhyme.
Status server, renamed true—
Carrots counted, fresh and new.
Prometheus smiles: “Good chew!” 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fb10ffb and c2dcdb1.

📒 Files selected for processing (5)
  • lib/runtime/examples/system_metrics/tests/integration_test.rs (1 hunks)
  • lib/runtime/src/component.rs (4 hunks)
  • lib/runtime/src/distributed.rs (1 hunks)
  • lib/runtime/src/metrics.rs (10 hunks)
  • lib/runtime/src/system_status_server.rs (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/runtime/src/system_status_server.rs (1)
lib/runtime/src/component.rs (3)
  • drt (171-173)
  • drt (385-387)
  • drt (533-535)
lib/runtime/src/metrics.rs (1)
lib/runtime/src/metrics/prometheus_names.rs (1)
  • build_metric_name (10-12)
lib/runtime/src/component.rs (1)
lib/runtime/src/metrics.rs (1)
  • new (1470-1472)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (14)
lib/runtime/src/system_status_server.rs (2)

80-85: LGTM!

The metric name change from dynamo_uptime_seconds to uptime_seconds is appropriate and aligns with the auto-labeling convention where the dynamo_component_ prefix is automatically added during metric creation.


365-368: Test expectations correctly updated.

The expected output properly reflects the new metric name dynamo_component_uptime_seconds with the auto-added prefix.

lib/runtime/src/distributed.rs (1)

250-275: Good documentation clarification and improved metric registration.

The updated documentation clearly explains that metrics with the same name but different labels can be registered, which is a valid Prometheus pattern. Removing the pre-check allows for more flexibility in metric registration while still maintaining proper error handling.

lib/runtime/src/component.rs (1)

580-586: Good non-blocking error handling.

The error handling approach is appropriate - metrics collection failures shouldn't block component creation, and the warning log provides sufficient visibility for debugging.

lib/runtime/examples/system_metrics/tests/integration_test.rs (1)

46-46: Method name change aligns with API updates.

The change from system_status_info() to system_status_server_info() correctly reflects the updated API naming in the distributed runtime.

lib/runtime/src/metrics.rs (9)

1092-1094: Test expectations correctly updated for auto-labeled metrics.

The test expectations properly reflect the new single-brace label format and auto-injected labels (dynamo_namespace, dynamo_component, dynamo_endpoint).


1121-1126: Consistent label formatting in component output.

The component-level metrics correctly show auto-labels without the endpoint label, as expected for component-scoped metrics.


1152-1160: Namespace metrics properly formatted.

The namespace-level metrics show only the namespace label, correctly reflecting the hierarchical auto-labeling approach.


1183-1184: IntGaugeVec label ordering is correct.

The label names ["instance", "status"] for the IntGaugeVec are properly ordered and will combine with auto-labels and const labels as expected.


1223-1258: Comprehensive histogram bucket test coverage.

The test properly validates all histogram buckets with auto-labels, ensuring the histogram metrics work correctly with the new labeling scheme.


1482-1483: Response handling simplified appropriately.

Using data.to_string() directly is cleaner than format!("{}", data) for string conversion.


1507-1508: Consistent parse_prometheus_metric API usage.

The filter_map correctly passes the line as &str to maintain consistency with the parse_prometheus_metric function signature.


1517-1529: Appropriate metric value ranges for testing.

The wide value ranges account for variability in NATS metrics during test execution, making the tests more stable. The ranges are reasonable given the non-deterministic nature of network metrics.

Also applies to: 1603-1616


1597-1600: Good stabilization wait for metrics.

Adding a 1-second wait allows metrics to stabilize before final validation, reducing test flakiness.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

- Remove unnecessary test delays (1000ms, 100ms) improving performance ~14x
- Enhance system status server test helpers with proper runtime management
- Use DRT's built-in system_status_server_info() instead of manual spawning
- Add 200/200 soak test to hit /health endpoint for reliability validation
- All 124 integration tests now passing
@keivenchang keivenchang enabled auto-merge (squash) August 18, 2025 17:07
@keivenchang keivenchang disabled auto-merge August 18, 2025 17:09
@keivenchang keivenchang merged commit 0444217 into main Aug 18, 2025
12 of 14 checks passed
@keivenchang keivenchang deleted the keivenchang/DIS-445__NATS-metrics-collection-timedout branch August 18, 2025 19:38
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
#2480)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants