-
Notifications
You must be signed in to change notification settings - Fork 680
fix: replace metrics callback with background scraping to prevent tim… #2480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: replace metrics callback with background scraping to prevent tim… #2480
Conversation
…eouts - Replace metrics callback with background scraping task to prevent timeouts - Fix IntGaugeVec cardinality mismatch in Prometheus metrics tests - Change metric name from dynamo_component_dynamo_uptime_seconds to dynamo_component_uptime_seconds - Update test expectations to match actual Prometheus output format - Fix system_metrics integration test method name from system_status_info to system_status_server_info
WalkthroughRefactors component metrics collection to a dedicated background scraping thread, adjusts metrics registration to allow duplicate names with differing labels, updates metric labeling/naming conventions and related tests, and renames/updates system status API usage in an integration test. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Namespace/Component init
participant Comp as Component
participant Scraper as Background Thread
participant RT as Local Tokio Runtime
participant Metrics as Prometheus Metrics
App->>Comp: start_scraping_metrics()
Comp->>Scraper: spawn thread with metrics handles
Scraper->>RT: build local runtime
loop periodic (delay/backoff)
Scraper->>Comp: scrape_stats(timeout=300ms)
alt success
Scraper->>Metrics: update_from_service_set()
else failure
Scraper->>Metrics: reset to zeros
Scraper->>Scraper: increase backoff (cap MAX_DELAY)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (5)
🧰 Additional context used🧬 Code Graph Analysis (3)lib/runtime/src/system_status_server.rs (1)
lib/runtime/src/metrics.rs (1)
lib/runtime/src/component.rs (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
🔇 Additional comments (14)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
- Remove unnecessary test delays (1000ms, 100ms) improving performance ~14x - Enhance system status server test helpers with proper runtime management - Use DRT's built-in system_status_server_info() instead of manual spawning - Add 200/200 soak test to hit /health endpoint for reliability validation - All 124 integration tests now passing
#2480) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Overview:
Fix metrics collection timeouts by replacing the synchronous metrics callback with a background scraping task, and resolve various test failures in the metrics system.
Details:
Where should the reviewer start?
Related Issues:
Closes DIS-445
Summary by CodeRabbit