fix: reduce nats stats query frequency #2847

keivenchang · 2025-09-03T21:42:02Z

Overview:

This PR optimizes NATS metrics scraping by implementing a simplified interval strategy that reduces system load while maintaining responsive metrics collection. The changes replace complex exponential backoff logic with a straightforward fixed-interval approach.

Details:

Simplified scraping intervals: Removed exponential backoff complexity and implemented a fixed 9.8-second interval (MAX_WAIT_MS) for all NATS metrics scrapes
Immediate first scrape: Metrics are now available immediately on startup rather than waiting for the first interval
Reduced timeout: NATS operation timeout incr from 300ms to 500ms for faster failure detection
Updated test timing: Modified test waits from 1 second to 500ms since metrics are now available immediately
Code cleanup: Inlined timeout constants and removed unnecessary INITIAL_WAIT_MS constant

Where should the reviewer start?

lib/runtime/src/component.rs - Main changes to the NATS metrics scraping logic in the start_scraping_nats_service_component_metrics method
lib/runtime/src/metrics.rs - Updated test timing to reflect the new immediate metrics availability

Related Issues:

coderabbitai · 2025-09-03T21:48:31Z

Walkthrough

Shifts devcontainer naming from local-dev to dev across docs/config and build stages. Updates container build logic to use TARGET=dev for UID/GID args. Modifies Component::start_scraping_nats_service_component_metrics to use exponential backoff (500ms → max ~9.8s) and a 300ms NATS timeout, replacing previous fixed-interval constants.

Changes

Cohort / File(s)	Summary of Changes
Devcontainer docs & config rename `.devcontainer/README.md`, `.devcontainer/devcontainer.json`	Updated image reference from `dynamo:latest-vllm-local-dev` to `dynamo:latest-vllm`; README build target from `--target local-dev` to `--target dev`; removed note about local-dev being VLLM-only.
Container build target/stage rename `container/Dockerfile.vllm`, `container/build.sh`	Dockerfile stage alias renamed `local-dev` → `dev`; build script condition updated to append USER_UID/USER_GID when `TARGET=dev` (was `local-dev`).
Runtime NATS scraping backoff `lib/runtime/src/component.rs`	Replaced fixed-frequency scraping with exponential backoff: initial wait 500ms, doubling per iteration up to ~9.8s; introduced `NATS_TIMEOUT_MS = 300`; removed old delay constants; interval reinitialized when wait changes; updated doc comment.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Component
  participant T as Interval (exp backoff)
  participant N as NATS

  Note over C,T: Start scraping NATS service component metrics
  C->>T: init wait = 500ms
  loop until shutdown
    T-->>C: tick
    C->>N: request metrics (timeout=300ms)
    alt success or timeout/error
      N-->>C: response or timeout
      Note over C: compute next wait = min(wait*2, ~9.8s)
      C->>T: reinit interval if wait changed
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

build: default to using dev instead of local-dev for vllm build #2834 — Performs the same local-dev → dev rename across devcontainer, Dockerfile, and build scripts.
fix: move metrics registration to service creation #2664 — Also adjusts Component NATS metrics startup/scraping behavior in the same Rust function.
fix: replace metrics callback with background scraping to prevent tim… #2480 — Introduces/changes exponential backoff and scraping timeout logic in the runtime component metrics flow.

Poem

I nudge my nose at targets—dev, not local-dev today,
Hop to the build with carrots packed, then bound away.
Scrapes tick slow, then slower still—backoff in the breeze,
NATS replies or not, I wait with twitchy ease.
Ship it, squeak it—thump!—a tidy warren, please.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

.devcontainer/README.md (2)
44-46: Update the Mermaid diagram to match the new image/tag.

The diagram still shows dynamo:latest-vllm-local-dev. Suggest:
-        IMAGE["Docker Image<br/>dynamo:latest-vllm-local-dev"]
+        IMAGE["Docker Image<br/>dynamo:latest-vllm"]
375-383: Fix the troubleshooting command that still uses --target local-dev.
-   ./container/build.sh --target local-dev
+   ./container/build.sh --target dev
container/build.sh (1)
606-610: Tag mismatch for dev builds: produce dynamo:latest-vllm (no -dev) when TARGET=dev.

Devcontainer expects dynamo:latest-vllm. With --target dev, we currently tag latest-vllm-dev. Fix by omitting the suffix when TARGET=dev:
 LATEST_TAG="--tag dynamo:latest-${FRAMEWORK,,}"
-if [ -n "${TARGET}" ]; then
-    LATEST_TAG="${LATEST_TAG}-${TARGET}"
-fi
+if [ -n "${TARGET}" ] && [ "${TARGET}" != "dev" ]; then
+    LATEST_TAG="${LATEST_TAG}-${TARGET}"
+fi

🧹 Nitpick comments (7)

.devcontainer/README.md (1)

240-259: Re-check SGLANG instructions for local-dev vs dev naming.

If the SGLANG Dockerfile stage has also been renamed to dev, update these commands and the example image accordingly; if not, add a clarifying note that SGLANG still uses local-dev. Avoid mixed conventions across frameworks.

Would you like me to scan the repo and list all remaining “local-dev” references to update?
container/build.sh (1)
353-359: Optional: keep versioned tag without -dev for TARGET=dev as well.

If you also want the versioned tag to match devcontainer pull patterns, mirror the LATEST_TAG logic here too:
-        if [ -n "${TARGET}" ]; then
-            TAG="${TAG}-${TARGET}"
-        fi
+        if [ -n "${TARGET}" ] && [ "${TARGET}" != "dev" ]; then
+            TAG="${TAG}-${TARGET}"
+        fi
container/Dockerfile.vllm (1)
343-345: Prefer no default for USER_UID/USER_GID to avoid silent permission mismatches.

Team practice (see prior learnings) was to require explicit UID/GID. Consider removing defaults:
-ARG USER_UID=1000
-ARG USER_GID=1000
+ARG USER_UID
+ARG USER_GID
build.sh already injects these for TARGET=dev.
lib/runtime/src/component.rs (4)
276-279: Doc/comment consistency: clarify when we back off.

Here it says “regardless of success or failure,” but the PR summary says “after each successful scrape.” Please pick one behavior and document it precisely.

280-283: Nit: names imply milliseconds but values are Durations.

Consider dropping the “_MS” suffix to avoid confusion:
-const NATS_TIMEOUT_MS: std::time::Duration = std::time::Duration::from_millis(300);
-const INITIAL_WAIT_MS: std::time::Duration = std::time::Duration::from_millis(500);
-const MAX_WAIT_MS: std::time::Duration = std::time::Duration::from_millis(9800);
+const NATS_TIMEOUT: std::time::Duration = std::time::Duration::from_millis(300);
+const INITIAL_WAIT: std::time::Duration = std::time::Duration::from_millis(500);
+const MAX_WAIT: std::time::Duration = std::time::Duration::from_millis(9_800);
298-301: Stale comment: scraping “every 5 seconds.”

This no longer applies with exponential backoff. Update or remove to avoid confusion.

316-333: Optional: add jitter and a reset policy.

To avoid thundering herd across many components and to regain responsiveness after idle periods, consider:

Small random jitter (e.g., ±10%) on current_wait.

Reset current_wait back to INITIAL_WAIT after N errored scrapes in a row or when stats indicate load changes.

Happy to draft this if you want it in this PR.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c920cbd and 43cbb41.

📒 Files selected for processing (5)

.devcontainer/README.md (1 hunks)
.devcontainer/devcontainer.json (1 hunks)
container/Dockerfile.vllm (1 hunks)
container/build.sh (1 hunks)
lib/runtime/src/component.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

📚 Learning: 2025-09-03T01:10:12.599Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

Applied to files:

container/build.sh
.devcontainer/README.md
.devcontainer/devcontainer.json

📚 Learning: 2025-08-30T20:43:49.632Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

.devcontainer/README.md
.devcontainer/devcontainer.json

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

.devcontainer/README.md
.devcontainer/devcontainer.json

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

.devcontainer/README.md
.devcontainer/devcontainer.json

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (3)

container/build.sh (1)

468-471: Good: pass USER_UID/USER_GID only for TARGET=dev.

This aligns the UID/GID mapping with dev builds without affecting runtime images.

.devcontainer/devcontainer.json (1)

10-10: Ensure the built tag matches this image reference.

With my build.sh patch, --target dev will produce dynamo:latest-vllm as expected. Without it, this would fail to find the image locally (it would be tagged latest-vllm-dev).

container/Dockerfile.vllm (1)

297-297: Stage rename looks good.

FROM runtime AS dev matches the rest of this PR’s naming and the build script.

.devcontainer/README.md

lib/runtime/src/component.rs

PeaBrane · 2025-09-04T04:08:37Z

Is my understanding correct that this would always do an interval doubling regardless of success or failure? And eventually the scrape interval would be fixed at 9.8s regardless. Would this be responsive enough for monitoring?

keivenchang · 2025-09-04T20:48:26Z

Is my understanding correct that this would always do an interval doubling regardless of success or failure? And eventually the scrape interval would be fixed at 9.8s regardless. Would this be responsive enough for monitoring?

Our Prometheus is polling at 6 second interval, so it should be acceptable. Let's see if anyone complains and we can adjust this.

Split NATS_TIMEOUT_AND_INITIAL_DELAY_MS into separate constants: - NATS_TIMEOUT_MS (300ms) for NATS operation timeout - INITIAL_WAIT_MS (500ms) for initial scraping interval - MAX_WAIT_MS (9800ms) for maximum interval The scraping interval now starts at 500ms and doubles after each scrape up to 9.8 seconds, reducing load while maintaining responsiveness. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

- Remove INITIAL_WAIT_MS constant and exponential backoff - Use fixed 9.8s interval (MAX_WAIT_MS) for all scrapes - Perform first scrape immediately for faster metrics availability - Inline NATS timeout constant (500ms) - Update tests to use 500ms waits since metrics are available immediately Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

grahamking

Nice and simple, love it.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: nnshah1 <neelays@nvidia.com>

keivenchang requested review from a team, alec-flowers, ishandhanani, nnshah1, ptarasiewiczNV, richardhuo-nv, rmccorm4 and tanmayv25 as code owners September 3, 2025 21:42

pull-request-size bot added the size/M label Sep 3, 2025

github-actions bot added the fix label Sep 3, 2025

keivenchang requested a review from grahamking September 3, 2025 21:42

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

.devcontainer/README.md Show resolved Hide resolved

lib/runtime/src/component.rs Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB September 4, 2025 20:45 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 4, 2025 20:46 Inactive

keivenchang added 2 commits September 4, 2025 20:53

keivenchang force-pushed the keivenchang/reduce-nats-stats-query-frequency branch from 7bab88e to d72f48c Compare September 4, 2025 20:53

pull-request-size bot added size/S and removed size/M labels Sep 4, 2025

copy-pr-bot bot temporarily deployed to GITLAB September 4, 2025 20:53 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 4, 2025 20:56 Inactive

grahamking approved these changes Sep 4, 2025

View reviewed changes

keivenchang merged commit 4df2e2d into main Sep 4, 2025
15 of 16 checks passed

keivenchang deleted the keivenchang/reduce-nats-stats-query-frequency branch September 4, 2025 22:43

dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025

fix: reduce nats stats query frequency (#2847)

3fa9d73

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: reduce nats stats query frequency #2847

fix: reduce nats stats query frequency #2847

Uh oh!

keivenchang commented Sep 3, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 3, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

PeaBrane commented Sep 4, 2025

Uh oh!

keivenchang commented Sep 4, 2025

Uh oh!

grahamking left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: reduce nats stats query frequency #2847

fix: reduce nats stats query frequency #2847

Uh oh!

Conversation

keivenchang commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Uh oh!

coderabbitai bot commented Sep 3, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PeaBrane commented Sep 4, 2025

Uh oh!

keivenchang commented Sep 4, 2025

Uh oh!

grahamking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keivenchang commented Sep 3, 2025 •

edited

Loading