Skip to content

Conversation

@GavinZhu-GMI
Copy link
Contributor

@GavinZhu-GMI GavinZhu-GMI commented Sep 8, 2025

Overview:

This fix solves deployment where health probes were failing because services never transitioned to ready state.

Details:

Files Modified:

  1. lib/runtime/src/config.rs
  • Added HealthTransitionPolicy enum with 4 options:
    • Manual - No auto-transition (original behavior)
    • TimeBasedReady { after_seconds } - Auto-ready after uptime
    • EndpointBasedReady - Ready when endpoints are registered
    • Custom { auto_ready_after_seconds, require_endpoints_ready } - Flexible logic
  • Added health_transition_policy field to RuntimeConfig struct
  • Added environment variable mapping for:
    • DYN_SYSTEM_HEALTH_TRANSITION_POLICY
    • DYN_SYSTEM_AUTO_READY_AFTER_SECONDS
  • Default policy: TimeBasedReady { after_seconds: 30 } (much better than Manual!)
  1. lib/runtime/src/lib.rs
  • Updated SystemHealth struct to include health_transition_policy field
  • Added check_and_update_health_status() method with transition logic for all policies
  • Added get_health_status_with_transition_check() method for the health handler to use
  • Updated constructor to accept health_transition_policy parameter
  1. lib/runtime/src/system_status_server.rs
  • Updated health_handler() to use mutable lock and call the transition check method
  • Fixed the core issue: Health status now automatically transitions from "notready" to "ready"
  1. lib/runtime/src/distributed.rs
  • Updated SystemHealth::new() call to include the health_transition_policy parameter

Where should the reviewer start?

  1. from health_handler in system_status_server.rs I have state transition check.
  2. check that how to config should be in dynamo/lib/runtime/src/config.rs and dynamo/lib/runtime/src/lib.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Added a configurable health transition policy to control when the service becomes “ready” (supports manual, time-based, endpoint-based, or custom rules).
    • Health checks now automatically evaluate and transition readiness according to the configured policy.
    • Policy can be set via environment/config, with a sensible default that becomes ready after a short delay.
  • Chores

    • Updated container dependencies by replacing a legacy GPU monitoring library with its maintained alternative.

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Sep 8, 2025

👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Sep 8, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

Walkthrough

Replaces pynvml with nvidia-ml-py in container dependencies. Introduces HealthTransitionPolicy in runtime config and SystemHealth, adds policy-driven health status transitions, updates initialization to pass the policy, and modifies the health HTTP handler to perform transition checks before reporting status.

Changes

Cohort / File(s) Summary
Dependencies
container/deps/requirements.txt
Replace pynvml with nvidia-ml-py; no other dependency changes.
Runtime Config: HealthTransitionPolicy
lib/runtime/src/config.rs
Add public enum HealthTransitionPolicy (with default TimeBasedReady 30s). Add health_transition_policy field to RuntimeConfig, defaulting via builder and Default. Load from env/config; apply override from DYN_SYSTEM_AUTO_READY_AFTER_SECONDS.
SystemHealth core logic
lib/runtime/src/lib.rs
Extend SystemHealth with health_transition_policy. Update SystemHealth::new(...) signature to accept policy. Add check_and_update_health_status() and get_health_status_with_transition_check() implementing policy-driven transitions (Manual, TimeBasedReady, EndpointBasedReady, Custom).
Runtime wiring
lib/runtime/src/distributed.rs
Pass config.health_transition_policy.clone() into SystemHealth::new(...).
HTTP handler
lib/runtime/src/system_status_server.rs
Use mutable lock and call get_health_status_with_transition_check() before reporting health; uptime retrieval unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Server as system_status_server
  participant SH as SystemHealth
  participant Cfg as RuntimeConfig

  Note over Cfg: Startup
  Cfg->>Server: health_transition_policy
  Server->>SH: SystemHealth::new(..., policy, ...)

  Note over Client,Server: Health endpoint request
  Client->>Server: GET /health
  Server->>SH: get_health_status_with_transition_check()
  activate SH
  SH->>SH: check_and_update_health_status()
  alt Policy: Manual
    Note right of SH: No auto transition
  else Policy: TimeBasedReady
    Note right of SH: Ready after N seconds
  else Policy: EndpointBasedReady
    Note right of SH: Ready when endpoints report ready
  else Policy: Custom
    Note right of SH: Optional time + endpoint readiness
  end
  SH-->>Server: (healthy, endpoints)
  deactivate SH
  Server-->>Client: JSON health status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A rabbit taps the status tree,
“From manual hush to ready glee!”
Policies bloom—time, endpoints, custom—
Health hops forward, brisk and winsome.
Dependencies nudge, NV springs by—
I twitch my nose, declare: “All’s spry!” 🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/runtime/src/lib.rs (1)

221-229: Policy is ignored when computing “healthy” — readiness may stay false despite time-based transition.

get_health_status always gates on use_endpoint_health_status when non-empty, even if the policy (TimeBasedReady or Custom with require_endpoints_ready=false) intends readiness to be time-driven. This can negate the PR’s goal and keep probes failing.

Incorporate HealthTransitionPolicy into the healthy computation:

-        let healthy = if !self.use_endpoint_health_status.is_empty() {
-            self.use_endpoint_health_status.iter().all(|endpoint| {
-                self.endpoint_health
-                    .get(endpoint)
-                    .is_some_and(|status| *status == HealthStatus::Ready)
-            })
-        } else {
-            self.system_health == HealthStatus::Ready
-        };
+        let healthy = match &self.health_transition_policy {
+            HealthTransitionPolicy::EndpointBasedReady => {
+                if !self.use_endpoint_health_status.is_empty() {
+                    self.use_endpoint_health_status.iter().all(|endpoint| {
+                        self.endpoint_health
+                            .get(endpoint)
+                            .is_some_and(|status| *status == HealthStatus::Ready)
+                    })
+                } else {
+                    // No endpoints configured — fall back to system health.
+                    self.system_health == HealthStatus::Ready
+                }
+            }
+            HealthTransitionPolicy::Custom { require_endpoints_ready, .. }
+                if *require_endpoints_ready =>
+            {
+                if !self.use_endpoint_health_status.is_empty() {
+                    self.use_endpoint_health_status.iter().all(|endpoint| {
+                        self.endpoint_health
+                            .get(endpoint)
+                            .is_some_and(|status| *status == HealthStatus::Ready)
+                    })
+                } else {
+                    // Explicitly required endpoints but none configured — treat as not ready
+                    // to avoid false positives and surface misconfiguration.
+                    false
+                }
+            }
+            _ => {
+                // Manual, TimeBasedReady, or Custom without endpoint requirement
+                self.system_health == HealthStatus::Ready
+            }
+        };

Follow-up: consider logging a warning when require_endpoints_ready=true but no endpoints are configured.

🧹 Nitpick comments (5)
lib/runtime/src/config.rs (3)

62-85: Enum design looks good; clarify env encoding and add Display for better logs.

Nice policy surface. Two small improvements:

  • Document how to set complex enum variants via env (e.g., JSON for time_based_ready/custom).
  • Add fmt::Display to produce readable logs (e.g., TimeBasedReady(30s)).

Add a Display impl:

@@
 impl Default for HealthTransitionPolicy {
     fn default() -> Self {
         // Better default: auto-ready after 30 seconds for simple services
         Self::TimeBasedReady { after_seconds: 30 }
     }
 }
+
+impl fmt::Display for HealthTransitionPolicy {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        match self {
+            HealthTransitionPolicy::Manual => write!(f, "manual"),
+            HealthTransitionPolicy::TimeBasedReady { after_seconds } => {
+                write!(f, "time_based_ready(after_seconds={})", after_seconds)
+            }
+            HealthTransitionPolicy::EndpointBasedReady => write!(f, "endpoint_based_ready"),
+            HealthTransitionPolicy::Custom { auto_ready_after_seconds, require_endpoints_ready } => {
+                match auto_ready_after_seconds {
+                    Some(s) => write!(f, "custom(auto_ready_after_seconds={}, require_endpoints_ready={})", s, require_endpoints_ready),
+                    None => write!(f, "custom(auto_ready_after_seconds=None, require_endpoints_ready={})", require_endpoints_ready),
+                }
+            }
+        }
+    }
+}

Optionally derive Copy since all fields are Copy, if convenient.


141-147: Expose policy in config Display output for observability.

Including health_transition_policy in fmt::Display helps trace effective settings.

Minimal change using Debug until Display is added:

@@
         write!(f, ", system_live_path={}", self.system_live_path)?;
+        write!(f, ", health_transition_policy={:?}", self.health_transition_policy)?;

239-251: Clarify precedence and log overrides when AUTO_READY_AFTER_SECONDS is set.

AUTO_READY_AFTER_SECONDS unconditionally overrides any value from HEALTH_TRANSITION_POLICY. Make this explicit in logs to avoid confusion; also trim whitespace before parse.

-        if let Ok(seconds_str) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") {
-            if !seconds_str.is_empty() {
-                if let Ok(seconds) = seconds_str.parse::<u64>() {
-                    tracing::info!("Using DYN_SYSTEM_AUTO_READY_AFTER_SECONDS={} for health transition policy", seconds);
-                    config.health_transition_policy = HealthTransitionPolicy::TimeBasedReady { after_seconds: seconds };
+        if let Ok(raw) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") {
+            let seconds_str = raw.trim();
+            if !seconds_str.is_empty() {
+                if let Ok(seconds) = seconds_str.parse::<u64>() {
+                    let prev = config.health_transition_policy.clone();
+                    config.health_transition_policy = HealthTransitionPolicy::TimeBasedReady { after_seconds: seconds };
+                    tracing::info!(
+                        "Overriding health_transition_policy {:?} → TimeBasedReady(after_seconds={}) due to DYN_SYSTEM_AUTO_READY_AFTER_SECONDS",
+                        prev, seconds
+                    );
                 } else {
                     tracing::warn!("Invalid value for DYN_SYSTEM_AUTO_READY_AFTER_SECONDS: '{}', expected a number", seconds_str);
                 }
             }
         }

Consider adding tests to pin behavior:

  • HEALTH_TRANSITION_POLICY=manual → Manual.
  • HEALTH_TRANSITION_POLICY=manual + AUTO_READY_AFTER_SECONDS=5 → TimeBasedReady(5).
lib/runtime/src/system_status_server.rs (1)

174-176: Hold the health lock for the minimum time.

Scope the mutex guard to just the reads/updates to avoid holding it while building the response.

-    let mut system_health = state.drt().system_health.lock().unwrap();
-    let (healthy, endpoints) = system_health.get_health_status_with_transition_check();
-    let uptime = Some(system_health.uptime());
+    let (healthy, endpoints, uptime) = {
+        let mut sh = state.drt().system_health.lock().unwrap();
+        let (h, eps) = sh.get_health_status_with_transition_check();
+        let up = sh.uptime();
+        (h, eps, up)
+    };
+    let uptime = Some(uptime);

If readiness needs to progress without probe traffic, consider a background interval that calls the transition check to advance state even when /health isn’t polled.

lib/runtime/src/distributed.rs (1)

78-85: Optional: make HealthTransitionPolicy Copy to avoid clones.

If the enum is small and non-owning, deriving Copy (and Clone) on HealthTransitionPolicy lets you pass it by value here and elsewhere without allocations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and a4acd2c.

📒 Files selected for processing (5)
  • container/deps/requirements.txt (1 hunks)
  • lib/runtime/src/config.rs (6 hunks)
  • lib/runtime/src/distributed.rs (1 hunks)
  • lib/runtime/src/lib.rs (5 hunks)
  • lib/runtime/src/system_status_server.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/runtime/src/system_status_server.rs (1)
components/metrics/src/lib.rs (1)
  • endpoints (563-563)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/lib.rs (2)
  • new (105-126)
  • new (283-288)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (9)
lib/runtime/src/config.rs (3)

218-221: Env→field mapping LGTM.

Mapping DYN_SYSTEM_HEALTH_TRANSITION_POLICY → health_transition_policy is correct and consistent with existing keys.


271-271: Default for single_threaded includes policy — LGTM.


301-301: Default policy set to TimeBasedReady(30s) — sensible default.

lib/runtime/src/distributed.rs (1)

78-87: Plumbing policy into SystemHealth::new is correct.

The config field is passed through cleanly and the call-site matches the new constructor signature.

lib/runtime/src/lib.rs (5)

68-68: Import looks good.

Bringing HealthTransitionPolicy into scope here is expected.


97-102: New field on SystemHealth is fine.

No ordering/ownership issues; fits existing struct layout.


135-165: EndpointBasedReady semantics: code requires ALL endpoints ready; comment implies “at least one.”

The comment “Ready when service has registered at least one endpoint” conflicts with the implementation that requires all endpoints to be Ready. Please clarify intended semantics and align code/comment.

  • If “any endpoint ready” is intended, change all(...) to any(...).
  • If “all endpoints ready” is intended, update the comment accordingly.

Option A (any endpoint ready):

-            HealthTransitionPolicy::EndpointBasedReady => {
-                // Ready when service has registered at least one endpoint
-                if !self.endpoint_health.is_empty() {
-                    let all_endpoints_ready = self.endpoint_health.values()
-                        .all(|status| *status == HealthStatus::Ready);
-
-                    if all_endpoints_ready {
+            HealthTransitionPolicy::EndpointBasedReady => {
+                // Ready when at least one endpoint reports Ready
+                if !self.endpoint_health.is_empty() {
+                    let any_endpoint_ready = self.endpoint_health.values()
+                        .any(|status| *status == HealthStatus::Ready);
+
+                    if any_endpoint_ready {
                         tracing::info!("Auto-transitioning to ready - all {} endpoints are ready (policy: endpoint_based_ready)",
                                      self.endpoint_health.len());
                         self.system_health = HealthStatus::Ready;
                     }
                 }
             },

197-205: Wrapper that runs transition check before reporting health: good.

This ensures readiness can progress without external nudges.


105-126: Constructor signature change verified: no remaining old-arity calls.

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: GavinZhu-GMI <gavin.z@gmicloud.ai>
@nnshah1
Copy link
Contributor

nnshah1 commented Sep 10, 2025

@tzulingk for visability as it relates to health check.

@GavinZhu-GMI for simple services instead of a time based check - I think either providing an api to set the health explicitly or default to ready - would be better

Time based seems not to indicate the service is 'ready' so may be misleading - both too long or too short.

Do you have a scenario in mind for when to set it if we provided an explicit api -

@grahamking
Copy link
Contributor

Can you explain the problem? How could we reproduce what you are seeing that motivated this PR?

@GavinZhu-GMI
Copy link
Contributor Author

Can you explain the problem? How could we reproduce what you are seeing that motivated this PR?

Sure. I am trying to use dynamo do PD disagg serving on two H200 nodes. The pod of prefill and decode just kept restarting without being healthy. Then after read the code I hard coded DYN_SYSTEM_STARTING_HEALTH_STATUS in the env to ready state, then the service can get healthy when all the endpoints are online.
Hence I think maybe some policy might be helpful if we need the service to become healthy in a dynamic way.

@GavinZhu-GMI
Copy link
Contributor Author

@tzulingk for visability as it relates to health check.

@GavinZhu-GMI for simple services instead of a time based check - I think either providing an api to set the health explicitly or default to ready - would be better

Time based seems not to indicate the service is 'ready' so may be misleading - both too long or too short.

Do you have a scenario in mind for when to set it if we provided an explicit api -

Sure, time based check is indeed misleading.
I believe no env is implemented in PR #2903 will jump to it and give it a go after it got merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants