Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Sep 3, 2025

Overview:

This PR optimizes NATS metrics scraping by implementing a simplified interval strategy that reduces system load while maintaining responsive metrics collection. The changes replace complex exponential backoff logic with a straightforward fixed-interval approach.

Details:

  • Simplified scraping intervals: Removed exponential backoff complexity and implemented a fixed 9.8-second interval (MAX_WAIT_MS) for all NATS metrics scrapes
  • Immediate first scrape: Metrics are now available immediately on startup rather than waiting for the first interval
  • Reduced timeout: NATS operation timeout incr from 300ms to 500ms for faster failure detection
  • Updated test timing: Modified test waits from 1 second to 500ms since metrics are now available immediately
  • Code cleanup: Inlined timeout constants and removed unnecessary INITIAL_WAIT_MS constant

Where should the reviewer start?

  • lib/runtime/src/component.rs - Main changes to the NATS metrics scraping logic in the start_scraping_nats_service_component_metrics method
  • lib/runtime/src/metrics.rs - Updated test timing to reflect the new immediate metrics availability

Related Issues:

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Walkthrough

Shifts devcontainer naming from local-dev to dev across docs/config and build stages. Updates container build logic to use TARGET=dev for UID/GID args. Modifies Component::start_scraping_nats_service_component_metrics to use exponential backoff (500ms → max ~9.8s) and a 300ms NATS timeout, replacing previous fixed-interval constants.

Changes

Cohort / File(s) Summary of Changes
Devcontainer docs & config rename
.devcontainer/README.md, .devcontainer/devcontainer.json
Updated image reference from dynamo:latest-vllm-local-dev to dynamo:latest-vllm; README build target from --target local-dev to --target dev; removed note about local-dev being VLLM-only.
Container build target/stage rename
container/Dockerfile.vllm, container/build.sh
Dockerfile stage alias renamed local-devdev; build script condition updated to append USER_UID/USER_GID when TARGET=dev (was local-dev).
Runtime NATS scraping backoff
lib/runtime/src/component.rs
Replaced fixed-frequency scraping with exponential backoff: initial wait 500ms, doubling per iteration up to ~9.8s; introduced NATS_TIMEOUT_MS = 300; removed old delay constants; interval reinitialized when wait changes; updated doc comment.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Component
  participant T as Interval (exp backoff)
  participant N as NATS

  Note over C,T: Start scraping NATS service component metrics
  C->>T: init wait = 500ms
  loop until shutdown
    T-->>C: tick
    C->>N: request metrics (timeout=300ms)
    alt success or timeout/error
      N-->>C: response or timeout
      Note over C: compute next wait = min(wait*2, ~9.8s)
      C->>T: reinit interval if wait changed
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

I nudge my nose at targets—dev, not local-dev today,
Hop to the build with carrots packed, then bound away.
Scrapes tick slow, then slower still—backoff in the breeze,
NATS replies or not, I wait with twitchy ease.
Ship it, squeak it—thump!—a tidy warren, please.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
.devcontainer/README.md (2)

44-46: Update the Mermaid diagram to match the new image/tag.

The diagram still shows dynamo:latest-vllm-local-dev. Suggest:

-        IMAGE["Docker Image<br/>dynamo:latest-vllm-local-dev"]
+        IMAGE["Docker Image<br/>dynamo:latest-vllm"]

375-383: Fix the troubleshooting command that still uses --target local-dev.

-   ./container/build.sh --target local-dev
+   ./container/build.sh --target dev
container/build.sh (1)

606-610: Tag mismatch for dev builds: produce dynamo:latest-vllm (no -dev) when TARGET=dev.

Devcontainer expects dynamo:latest-vllm. With --target dev, we currently tag latest-vllm-dev. Fix by omitting the suffix when TARGET=dev:

 LATEST_TAG="--tag dynamo:latest-${FRAMEWORK,,}"
-if [ -n "${TARGET}" ]; then
-    LATEST_TAG="${LATEST_TAG}-${TARGET}"
-fi
+if [ -n "${TARGET}" ] && [ "${TARGET}" != "dev" ]; then
+    LATEST_TAG="${LATEST_TAG}-${TARGET}"
+fi
🧹 Nitpick comments (7)
.devcontainer/README.md (1)

240-259: Re-check SGLANG instructions for local-dev vs dev naming.

If the SGLANG Dockerfile stage has also been renamed to dev, update these commands and the example image accordingly; if not, add a clarifying note that SGLANG still uses local-dev. Avoid mixed conventions across frameworks.

Would you like me to scan the repo and list all remaining “local-dev” references to update?

container/build.sh (1)

353-359: Optional: keep versioned tag without -dev for TARGET=dev as well.

If you also want the versioned tag to match devcontainer pull patterns, mirror the LATEST_TAG logic here too:

-        if [ -n "${TARGET}" ]; then
-            TAG="${TAG}-${TARGET}"
-        fi
+        if [ -n "${TARGET}" ] && [ "${TARGET}" != "dev" ]; then
+            TAG="${TAG}-${TARGET}"
+        fi
container/Dockerfile.vllm (1)

343-345: Prefer no default for USER_UID/USER_GID to avoid silent permission mismatches.

Team practice (see prior learnings) was to require explicit UID/GID. Consider removing defaults:

-ARG USER_UID=1000
-ARG USER_GID=1000
+ARG USER_UID
+ARG USER_GID

build.sh already injects these for TARGET=dev.

lib/runtime/src/component.rs (4)

276-279: Doc/comment consistency: clarify when we back off.

Here it says “regardless of success or failure,” but the PR summary says “after each successful scrape.” Please pick one behavior and document it precisely.


280-283: Nit: names imply milliseconds but values are Durations.

Consider dropping the “_MS” suffix to avoid confusion:

-const NATS_TIMEOUT_MS: std::time::Duration = std::time::Duration::from_millis(300);
-const INITIAL_WAIT_MS: std::time::Duration = std::time::Duration::from_millis(500);
-const MAX_WAIT_MS: std::time::Duration = std::time::Duration::from_millis(9800);
+const NATS_TIMEOUT: std::time::Duration = std::time::Duration::from_millis(300);
+const INITIAL_WAIT: std::time::Duration = std::time::Duration::from_millis(500);
+const MAX_WAIT: std::time::Duration = std::time::Duration::from_millis(9_800);

298-301: Stale comment: scraping “every 5 seconds.”

This no longer applies with exponential backoff. Update or remove to avoid confusion.


316-333: Optional: add jitter and a reset policy.

To avoid thundering herd across many components and to regain responsiveness after idle periods, consider:

  • Small random jitter (e.g., ±10%) on current_wait.
  • Reset current_wait back to INITIAL_WAIT after N errored scrapes in a row or when stats indicate load changes.

Happy to draft this if you want it in this PR.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c920cbd and 43cbb41.

📒 Files selected for processing (5)
  • .devcontainer/README.md (1 hunks)
  • .devcontainer/devcontainer.json (1 hunks)
  • container/Dockerfile.vllm (1 hunks)
  • container/build.sh (1 hunks)
  • lib/runtime/src/component.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
📚 Learning: 2025-09-03T01:10:12.599Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

Applied to files:

  • container/build.sh
  • .devcontainer/README.md
  • .devcontainer/devcontainer.json
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • .devcontainer/README.md
  • .devcontainer/devcontainer.json
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • .devcontainer/README.md
  • .devcontainer/devcontainer.json
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • .devcontainer/README.md
  • .devcontainer/devcontainer.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (3)
container/build.sh (1)

468-471: Good: pass USER_UID/USER_GID only for TARGET=dev.

This aligns the UID/GID mapping with dev builds without affecting runtime images.

.devcontainer/devcontainer.json (1)

10-10: Ensure the built tag matches this image reference.

With my build.sh patch, --target dev will produce dynamo:latest-vllm as expected. Without it, this would fail to find the image locally (it would be tagged latest-vllm-dev).

container/Dockerfile.vllm (1)

297-297: Stage rename looks good.

FROM runtime AS dev matches the rest of this PR’s naming and the build script.

@PeaBrane
Copy link
Contributor

PeaBrane commented Sep 4, 2025

Is my understanding correct that this would always do an interval doubling regardless of success or failure? And eventually the scrape interval would be fixed at 9.8s regardless. Would this be responsive enough for monitoring?

@keivenchang
Copy link
Contributor Author

Is my understanding correct that this would always do an interval doubling regardless of success or failure? And eventually the scrape interval would be fixed at 9.8s regardless. Would this be responsive enough for monitoring?

Our Prometheus is polling at 6 second interval, so it should be acceptable. Let's see if anyone complains and we can adjust this.

Split NATS_TIMEOUT_AND_INITIAL_DELAY_MS into separate constants:
- NATS_TIMEOUT_MS (300ms) for NATS operation timeout
- INITIAL_WAIT_MS (500ms) for initial scraping interval
- MAX_WAIT_MS (9800ms) for maximum interval

The scraping interval now starts at 500ms and doubles after each
scrape up to 9.8 seconds, reducing load while maintaining responsiveness.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Remove INITIAL_WAIT_MS constant and exponential backoff
- Use fixed 9.8s interval (MAX_WAIT_MS) for all scrapes
- Perform first scrape immediately for faster metrics availability
- Inline NATS timeout constant (500ms)
- Update tests to use 500ms waits since metrics are available immediately

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang force-pushed the keivenchang/reduce-nats-stats-query-frequency branch from 7bab88e to d72f48c Compare September 4, 2025 20:53
@pull-request-size pull-request-size bot added size/S and removed size/M labels Sep 4, 2025
Copy link
Contributor

@grahamking grahamking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and simple, love it.

@keivenchang keivenchang merged commit 4df2e2d into main Sep 4, 2025
15 of 16 checks passed
@keivenchang keivenchang deleted the keivenchang/reduce-nats-stats-query-frequency branch September 4, 2025 22:43
dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants