-
Notifications
You must be signed in to change notification settings - Fork 688
fix: reduce nats stats query frequency #2847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reduce nats stats query frequency #2847
Conversation
WalkthroughShifts devcontainer naming from local-dev to dev across docs/config and build stages. Updates container build logic to use TARGET=dev for UID/GID args. Modifies Component::start_scraping_nats_service_component_metrics to use exponential backoff (500ms → max ~9.8s) and a 300ms NATS timeout, replacing previous fixed-interval constants. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Component
participant T as Interval (exp backoff)
participant N as NATS
Note over C,T: Start scraping NATS service component metrics
C->>T: init wait = 500ms
loop until shutdown
T-->>C: tick
C->>N: request metrics (timeout=300ms)
alt success or timeout/error
N-->>C: response or timeout
Note over C: compute next wait = min(wait*2, ~9.8s)
C->>T: reinit interval if wait changed
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
.devcontainer/README.md (2)
44-46: Update the Mermaid diagram to match the new image/tag.The diagram still shows dynamo:latest-vllm-local-dev. Suggest:
- IMAGE["Docker Image<br/>dynamo:latest-vllm-local-dev"] + IMAGE["Docker Image<br/>dynamo:latest-vllm"]
375-383: Fix the troubleshooting command that still uses --target local-dev.- ./container/build.sh --target local-dev + ./container/build.sh --target devcontainer/build.sh (1)
606-610: Tag mismatch for dev builds: produce dynamo:latest-vllm (no -dev) when TARGET=dev.Devcontainer expects dynamo:latest-vllm. With --target dev, we currently tag latest-vllm-dev. Fix by omitting the suffix when TARGET=dev:
LATEST_TAG="--tag dynamo:latest-${FRAMEWORK,,}" -if [ -n "${TARGET}" ]; then - LATEST_TAG="${LATEST_TAG}-${TARGET}" -fi +if [ -n "${TARGET}" ] && [ "${TARGET}" != "dev" ]; then + LATEST_TAG="${LATEST_TAG}-${TARGET}" +fi
🧹 Nitpick comments (7)
.devcontainer/README.md (1)
240-259: Re-check SGLANG instructions for local-dev vs dev naming.If the SGLANG Dockerfile stage has also been renamed to dev, update these commands and the example image accordingly; if not, add a clarifying note that SGLANG still uses local-dev. Avoid mixed conventions across frameworks.
Would you like me to scan the repo and list all remaining “local-dev” references to update?
container/build.sh (1)
353-359: Optional: keep versioned tag without -dev for TARGET=dev as well.If you also want the versioned tag to match devcontainer pull patterns, mirror the LATEST_TAG logic here too:
- if [ -n "${TARGET}" ]; then - TAG="${TAG}-${TARGET}" - fi + if [ -n "${TARGET}" ] && [ "${TARGET}" != "dev" ]; then + TAG="${TAG}-${TARGET}" + ficontainer/Dockerfile.vllm (1)
343-345: Prefer no default for USER_UID/USER_GID to avoid silent permission mismatches.Team practice (see prior learnings) was to require explicit UID/GID. Consider removing defaults:
-ARG USER_UID=1000 -ARG USER_GID=1000 +ARG USER_UID +ARG USER_GIDbuild.sh already injects these for TARGET=dev.
lib/runtime/src/component.rs (4)
276-279: Doc/comment consistency: clarify when we back off.Here it says “regardless of success or failure,” but the PR summary says “after each successful scrape.” Please pick one behavior and document it precisely.
280-283: Nit: names imply milliseconds but values are Durations.Consider dropping the “_MS” suffix to avoid confusion:
-const NATS_TIMEOUT_MS: std::time::Duration = std::time::Duration::from_millis(300); -const INITIAL_WAIT_MS: std::time::Duration = std::time::Duration::from_millis(500); -const MAX_WAIT_MS: std::time::Duration = std::time::Duration::from_millis(9800); +const NATS_TIMEOUT: std::time::Duration = std::time::Duration::from_millis(300); +const INITIAL_WAIT: std::time::Duration = std::time::Duration::from_millis(500); +const MAX_WAIT: std::time::Duration = std::time::Duration::from_millis(9_800);
298-301: Stale comment: scraping “every 5 seconds.”This no longer applies with exponential backoff. Update or remove to avoid confusion.
316-333: Optional: add jitter and a reset policy.To avoid thundering herd across many components and to regain responsiveness after idle periods, consider:
- Small random jitter (e.g., ±10%) on current_wait.
- Reset current_wait back to INITIAL_WAIT after N errored scrapes in a row or when stats indicate load changes.
Happy to draft this if you want it in this PR.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
.devcontainer/README.md(1 hunks).devcontainer/devcontainer.json(1 hunks)container/Dockerfile.vllm(1 hunks)container/build.sh(1 hunks)lib/runtime/src/component.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
📚 Learning: 2025-09-03T01:10:12.599Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
Applied to files:
container/build.sh.devcontainer/README.md.devcontainer/devcontainer.json
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Applied to files:
.devcontainer/README.md.devcontainer/devcontainer.json
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
.devcontainer/README.md.devcontainer/devcontainer.json
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
.devcontainer/README.md.devcontainer/devcontainer.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (3)
container/build.sh (1)
468-471: Good: pass USER_UID/USER_GID only for TARGET=dev.This aligns the UID/GID mapping with dev builds without affecting runtime images.
.devcontainer/devcontainer.json (1)
10-10: Ensure the built tag matches this image reference.With my build.sh patch, --target dev will produce dynamo:latest-vllm as expected. Without it, this would fail to find the image locally (it would be tagged latest-vllm-dev).
container/Dockerfile.vllm (1)
297-297: Stage rename looks good.FROM runtime AS dev matches the rest of this PR’s naming and the build script.
|
Is my understanding correct that this would always do an interval doubling regardless of success or failure? And eventually the scrape interval would be fixed at 9.8s regardless. Would this be responsive enough for monitoring? |
Our Prometheus is polling at 6 second interval, so it should be acceptable. Let's see if anyone complains and we can adjust this. |
Split NATS_TIMEOUT_AND_INITIAL_DELAY_MS into separate constants: - NATS_TIMEOUT_MS (300ms) for NATS operation timeout - INITIAL_WAIT_MS (500ms) for initial scraping interval - MAX_WAIT_MS (9800ms) for maximum interval The scraping interval now starts at 500ms and doubles after each scrape up to 9.8 seconds, reducing load while maintaining responsiveness. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Remove INITIAL_WAIT_MS constant and exponential backoff - Use fixed 9.8s interval (MAX_WAIT_MS) for all scrapes - Perform first scrape immediately for faster metrics availability - Inline NATS timeout constant (500ms) - Update tests to use 500ms waits since metrics are available immediately Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
7bab88e to
d72f48c
Compare
grahamking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and simple, love it.
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Overview:
This PR optimizes NATS metrics scraping by implementing a simplified interval strategy that reduces system load while maintaining responsive metrics collection. The changes replace complex exponential backoff logic with a straightforward fixed-interval approach.
Details:
Where should the reviewer start?
lib/runtime/src/component.rs- Main changes to the NATS metrics scraping logic in thestart_scraping_nats_service_component_metricsmethodlib/runtime/src/metrics.rs- Updated test timing to reflect the new immediate metrics availabilityRelated Issues: