fix: fixed failing health probes to enable state transition #2925

GavinZhu-GMI · 2025-09-08T03:32:58Z

Overview:

This fix solves deployment where health probes were failing because services never transitioned to ready state.

Details:

Files Modified:

lib/runtime/src/config.rs

Added HealthTransitionPolicy enum with 4 options:
- Manual - No auto-transition (original behavior)
- TimeBasedReady { after_seconds } - Auto-ready after uptime
- EndpointBasedReady - Ready when endpoints are registered
- Custom { auto_ready_after_seconds, require_endpoints_ready } - Flexible logic
Added health_transition_policy field to RuntimeConfig struct
Added environment variable mapping for:
- DYN_SYSTEM_HEALTH_TRANSITION_POLICY
- DYN_SYSTEM_AUTO_READY_AFTER_SECONDS
Default policy: TimeBasedReady { after_seconds: 30 } (much better than Manual!)

lib/runtime/src/lib.rs

Updated SystemHealth struct to include health_transition_policy field
Added check_and_update_health_status() method with transition logic for all policies
Added get_health_status_with_transition_check() method for the health handler to use
Updated constructor to accept health_transition_policy parameter

lib/runtime/src/system_status_server.rs

Updated health_handler() to use mutable lock and call the transition check method
Fixed the core issue: Health status now automatically transitions from "notready" to "ready"

lib/runtime/src/distributed.rs

Updated SystemHealth::new() call to include the health_transition_policy parameter

Where should the reviewer start?

from health_handler in system_status_server.rs I have state transition check.
check that how to config should be in dynamo/lib/runtime/src/config.rs and dynamo/lib/runtime/src/lib.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [BUG]: Health Status Transition Issue: Services Stuck in "notready" State #2916

Summary by CodeRabbit

New Features
- Added a configurable health transition policy to control when the service becomes “ready” (supports manual, time-based, endpoint-based, or custom rules).
- Health checks now automatically evaluate and transition readiness according to the configured policy.
- Policy can be set via environment/config, with a sensible default that becomes ready after a short delay.
Chores
- Updated container dependencies by replacing a legacy GPU monitoring library with its maintained alternative.

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

copy-pr-bot · 2025-09-08T03:33:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-09-08T03:33:11Z

👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-09-08T03:39:16Z

Walkthrough

Replaces pynvml with nvidia-ml-py in container dependencies. Introduces HealthTransitionPolicy in runtime config and SystemHealth, adds policy-driven health status transitions, updates initialization to pass the policy, and modifies the health HTTP handler to perform transition checks before reporting status.

Changes

Cohort / File(s)	Summary
Dependencies `container/deps/requirements.txt`	Replace `pynvml` with `nvidia-ml-py`; no other dependency changes.
Runtime Config: HealthTransitionPolicy `lib/runtime/src/config.rs`	Add public enum `HealthTransitionPolicy` (with default TimeBasedReady 30s). Add `health_transition_policy` field to `RuntimeConfig`, defaulting via builder and Default. Load from env/config; apply override from `DYN_SYSTEM_AUTO_READY_AFTER_SECONDS`.
SystemHealth core logic `lib/runtime/src/lib.rs`	Extend `SystemHealth` with `health_transition_policy`. Update `SystemHealth::new(...)` signature to accept policy. Add `check_and_update_health_status()` and `get_health_status_with_transition_check()` implementing policy-driven transitions (Manual, TimeBasedReady, EndpointBasedReady, Custom).
Runtime wiring `lib/runtime/src/distributed.rs`	Pass `config.health_transition_policy.clone()` into `SystemHealth::new(...)`.
HTTP handler `lib/runtime/src/system_status_server.rs`	Use mutable lock and call `get_health_status_with_transition_check()` before reporting health; uptime retrieval unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Server as system_status_server
  participant SH as SystemHealth
  participant Cfg as RuntimeConfig

  Note over Cfg: Startup
  Cfg->>Server: health_transition_policy
  Server->>SH: SystemHealth::new(..., policy, ...)

  Note over Client,Server: Health endpoint request
  Client->>Server: GET /health
  Server->>SH: get_health_status_with_transition_check()
  activate SH
  SH->>SH: check_and_update_health_status()
  alt Policy: Manual
    Note right of SH: No auto transition
  else Policy: TimeBasedReady
    Note right of SH: Ready after N seconds
  else Policy: EndpointBasedReady
    Note right of SH: Ready when endpoints report ready
  else Policy: Custom
    Note right of SH: Optional time + endpoint readiness
  end
  SH-->>Server: (healthy, endpoints)
  deactivate SH
  Server-->>Client: JSON health status

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: health check changes based on endpoint served #1996 — Also introduces HealthTransitionPolicy and wires it through SystemHealth initialization, overlapping with config and constructor changes.
feat: cross process instrumentation #2243 — Modifies SystemHealth API around endpoint health handling, adjacent to the new transition logic and method signatures.

Poem

A rabbit taps the status tree,
“From manual hush to ready glee!”
Policies bloom—time, endpoints, custom—
Health hops forward, brisk and winsome.
Dependencies nudge, NV springs by—
I twitch my nose, declare: “All’s spry!” 🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/runtime/src/lib.rs (1)

221-229: Policy is ignored when computing “healthy” — readiness may stay false despite time-based transition.

get_health_status always gates on use_endpoint_health_status when non-empty, even if the policy (TimeBasedReady or Custom with require_endpoints_ready=false) intends readiness to be time-driven. This can negate the PR’s goal and keep probes failing.

Incorporate HealthTransitionPolicy into the healthy computation:

-        let healthy = if !self.use_endpoint_health_status.is_empty() {
-            self.use_endpoint_health_status.iter().all(|endpoint| {
-                self.endpoint_health
-                    .get(endpoint)
-                    .is_some_and(|status| *status == HealthStatus::Ready)
-            })
-        } else {
-            self.system_health == HealthStatus::Ready
-        };
+        let healthy = match &self.health_transition_policy {
+            HealthTransitionPolicy::EndpointBasedReady => {
+                if !self.use_endpoint_health_status.is_empty() {
+                    self.use_endpoint_health_status.iter().all(|endpoint| {
+                        self.endpoint_health
+                            .get(endpoint)
+                            .is_some_and(|status| *status == HealthStatus::Ready)
+                    })
+                } else {
+                    // No endpoints configured — fall back to system health.
+                    self.system_health == HealthStatus::Ready
+                }
+            }
+            HealthTransitionPolicy::Custom { require_endpoints_ready, .. }
+                if *require_endpoints_ready =>
+            {
+                if !self.use_endpoint_health_status.is_empty() {
+                    self.use_endpoint_health_status.iter().all(|endpoint| {
+                        self.endpoint_health
+                            .get(endpoint)
+                            .is_some_and(|status| *status == HealthStatus::Ready)
+                    })
+                } else {
+                    // Explicitly required endpoints but none configured — treat as not ready
+                    // to avoid false positives and surface misconfiguration.
+                    false
+                }
+            }
+            _ => {
+                // Manual, TimeBasedReady, or Custom without endpoint requirement
+                self.system_health == HealthStatus::Ready
+            }
+        };

Follow-up: consider logging a warning when require_endpoints_ready=true but no endpoints are configured.

🧹 Nitpick comments (5)

lib/runtime/src/config.rs (3)

62-85: Enum design looks good; clarify env encoding and add Display for better logs.

Nice policy surface. Two small improvements:

Document how to set complex enum variants via env (e.g., JSON for time_based_ready/custom).
Add fmt::Display to produce readable logs (e.g., TimeBasedReady(30s)).

Add a Display impl:

@@
 impl Default for HealthTransitionPolicy {
     fn default() -> Self {
         // Better default: auto-ready after 30 seconds for simple services
         Self::TimeBasedReady { after_seconds: 30 }
     }
 }
+
+impl fmt::Display for HealthTransitionPolicy {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        match self {
+            HealthTransitionPolicy::Manual => write!(f, "manual"),
+            HealthTransitionPolicy::TimeBasedReady { after_seconds } => {
+                write!(f, "time_based_ready(after_seconds={})", after_seconds)
+            }
+            HealthTransitionPolicy::EndpointBasedReady => write!(f, "endpoint_based_ready"),
+            HealthTransitionPolicy::Custom { auto_ready_after_seconds, require_endpoints_ready } => {
+                match auto_ready_after_seconds {
+                    Some(s) => write!(f, "custom(auto_ready_after_seconds={}, require_endpoints_ready={})", s, require_endpoints_ready),
+                    None => write!(f, "custom(auto_ready_after_seconds=None, require_endpoints_ready={})", require_endpoints_ready),
+                }
+            }
+        }
+    }
+}

Optionally derive Copy since all fields are Copy, if convenient.

141-147: Expose policy in config Display output for observability.

Including health_transition_policy in fmt::Display helps trace effective settings.

Minimal change using Debug until Display is added:

@@
         write!(f, ", system_live_path={}", self.system_live_path)?;
+        write!(f, ", health_transition_policy={:?}", self.health_transition_policy)?;

239-251: Clarify precedence and log overrides when AUTO_READY_AFTER_SECONDS is set.

AUTO_READY_AFTER_SECONDS unconditionally overrides any value from HEALTH_TRANSITION_POLICY. Make this explicit in logs to avoid confusion; also trim whitespace before parse.

-        if let Ok(seconds_str) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") {
-            if !seconds_str.is_empty() {
-                if let Ok(seconds) = seconds_str.parse::<u64>() {
-                    tracing::info!("Using DYN_SYSTEM_AUTO_READY_AFTER_SECONDS={} for health transition policy", seconds);
-                    config.health_transition_policy = HealthTransitionPolicy::TimeBasedReady { after_seconds: seconds };
+        if let Ok(raw) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") {
+            let seconds_str = raw.trim();
+            if !seconds_str.is_empty() {
+                if let Ok(seconds) = seconds_str.parse::<u64>() {
+                    let prev = config.health_transition_policy.clone();
+                    config.health_transition_policy = HealthTransitionPolicy::TimeBasedReady { after_seconds: seconds };
+                    tracing::info!(
+                        "Overriding health_transition_policy {:?} → TimeBasedReady(after_seconds={}) due to DYN_SYSTEM_AUTO_READY_AFTER_SECONDS",
+                        prev, seconds
+                    );
                 } else {
                     tracing::warn!("Invalid value for DYN_SYSTEM_AUTO_READY_AFTER_SECONDS: '{}', expected a number", seconds_str);
                 }
             }
         }

Consider adding tests to pin behavior:

HEALTH_TRANSITION_POLICY=manual → Manual.
HEALTH_TRANSITION_POLICY=manual + AUTO_READY_AFTER_SECONDS=5 → TimeBasedReady(5).

lib/runtime/src/system_status_server.rs (1)

174-176: Hold the health lock for the minimum time.

Scope the mutex guard to just the reads/updates to avoid holding it while building the response.
-    let mut system_health = state.drt().system_health.lock().unwrap();
-    let (healthy, endpoints) = system_health.get_health_status_with_transition_check();
-    let uptime = Some(system_health.uptime());
+    let (healthy, endpoints, uptime) = {
+        let mut sh = state.drt().system_health.lock().unwrap();
+        let (h, eps) = sh.get_health_status_with_transition_check();
+        let up = sh.uptime();
+        (h, eps, up)
+    };
+    let uptime = Some(uptime);
If readiness needs to progress without probe traffic, consider a background interval that calls the transition check to advance state even when /health isn’t polled.

lib/runtime/src/distributed.rs (1)

78-85: Optional: make HealthTransitionPolicy Copy to avoid clones.

If the enum is small and non-owning, deriving Copy (and Clone) on HealthTransitionPolicy lets you pass it by value here and elsewhere without allocations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and a4acd2c.

📒 Files selected for processing (5)

container/deps/requirements.txt (1 hunks)
lib/runtime/src/config.rs (6 hunks)
lib/runtime/src/distributed.rs (1 hunks)
lib/runtime/src/lib.rs (5 hunks)
lib/runtime/src/system_status_server.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

lib/runtime/src/system_status_server.rs (1)

components/metrics/src/lib.rs (1)

endpoints (563-563)

lib/runtime/src/distributed.rs (1)

lib/runtime/src/lib.rs (2)

new (105-126)

new (283-288)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)

🔇 Additional comments (9)

lib/runtime/src/config.rs (3)

218-221: Env→field mapping LGTM.

Mapping DYN_SYSTEM_HEALTH_TRANSITION_POLICY → health_transition_policy is correct and consistent with existing keys.

271-271: Default for single_threaded includes policy — LGTM.

301-301: Default policy set to TimeBasedReady(30s) — sensible default.

lib/runtime/src/distributed.rs (1)

78-87: Plumbing policy into SystemHealth::new is correct.

The config field is passed through cleanly and the call-site matches the new constructor signature.
lib/runtime/src/lib.rs (5)
68-68: Import looks good.

Bringing HealthTransitionPolicy into scope here is expected.

97-102: New field on SystemHealth is fine.

No ordering/ownership issues; fits existing struct layout.

135-165: EndpointBasedReady semantics: code requires ALL endpoints ready; comment implies “at least one.”

The comment “Ready when service has registered at least one endpoint” conflicts with the implementation that requires all endpoints to be Ready. Please clarify intended semantics and align code/comment.

If “any endpoint ready” is intended, change all(...) to any(...).

If “all endpoints ready” is intended, update the comment accordingly.

Option A (any endpoint ready):
-            HealthTransitionPolicy::EndpointBasedReady => {
-                // Ready when service has registered at least one endpoint
-                if !self.endpoint_health.is_empty() {
-                    let all_endpoints_ready = self.endpoint_health.values()
-                        .all(|status| *status == HealthStatus::Ready);
-
-                    if all_endpoints_ready {
+            HealthTransitionPolicy::EndpointBasedReady => {
+                // Ready when at least one endpoint reports Ready
+                if !self.endpoint_health.is_empty() {
+                    let any_endpoint_ready = self.endpoint_health.values()
+                        .any(|status| *status == HealthStatus::Ready);
+
+                    if any_endpoint_ready {
                         tracing::info!("Auto-transitioning to ready - all {} endpoints are ready (policy: endpoint_based_ready)",
                                      self.endpoint_health.len());
                         self.system_health = HealthStatus::Ready;
                     }
                 }
             },
197-205: Wrapper that runs transition check before reporting health: good.

This ensures readiness can progress without external nudges.

105-126: Constructor signature change verified: no remaining old-arity calls.

container/deps/requirements.txt

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

Signed-off-by: GavinZhu-GMI <gavin.z@gmicloud.ai>

nnshah1 · 2025-09-10T17:52:01Z

@tzulingk for visability as it relates to health check.

@GavinZhu-GMI for simple services instead of a time based check - I think either providing an api to set the health explicitly or default to ready - would be better

Time based seems not to indicate the service is 'ready' so may be misleading - both too long or too short.

Do you have a scenario in mind for when to set it if we provided an explicit api -

grahamking · 2025-09-10T20:28:56Z

Can you explain the problem? How could we reproduce what you are seeing that motivated this PR?

GavinZhu-GMI · 2025-09-11T00:59:23Z

Can you explain the problem? How could we reproduce what you are seeing that motivated this PR?

Sure. I am trying to use dynamo do PD disagg serving on two H200 nodes. The pod of prefill and decode just kept restarting without being healthy. Then after read the code I hard coded DYN_SYSTEM_STARTING_HEALTH_STATUS in the env to ready state, then the service can get healthy when all the endpoints are online.
Hence I think maybe some policy might be helpful if we need the service to become healthy in a dynamic way.

GavinZhu-GMI · 2025-09-11T01:31:14Z

@tzulingk for visability as it relates to health check.

@GavinZhu-GMI for simple services instead of a time based check - I think either providing an api to set the health explicitly or default to ready - would be better

Time based seems not to indicate the service is 'ready' so may be misleading - both too long or too short.

Do you have a scenario in mind for when to set it if we provided an explicit api -

Sure, time based check is indeed misleading.
I believe no env is implemented in PR #2903 will jump to it and give it a go after it got merged.

fix: fixed failing health probes to enable state transition

a4acd2c

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

GavinZhu-GMI requested review from a team, alec-flowers, ishandhanani, nnshah1, ptarasiewiczNV, richardhuo-nv, rmccorm4 and tanmayv25 as code owners September 8, 2025 03:32

pull-request-size bot added the size/L label Sep 8, 2025

github-actions bot added the fix label Sep 8, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Sep 8, 2025

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

container/deps/requirements.txt Outdated Show resolved Hide resolved

chore: fix format

64652c3

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

GavinZhu-GMI force-pushed the fix_health branch from 57c7e3f to 64652c3 Compare September 8, 2025 03:59

GavinZhu-GMI added 4 commits September 8, 2025 05:46

fix: fixed foldable for clippy check

fcc1844

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

fix: fixed format after clippy

0cf69b8

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

Merge branch 'main' into fix_health

20c5291

Signed-off-by: GavinZhu-GMI <gavin.z@gmicloud.ai>

Merge branch 'main' into fix_health

698bba5

GavinZhu-GMI closed this Sep 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fixed failing health probes to enable state transition #2925

fix: fixed failing health probes to enable state transition #2925

Uh oh!

GavinZhu-GMI commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 8, 2025

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

coderabbitai bot commented Sep 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

nnshah1 commented Sep 10, 2025

Uh oh!

grahamking commented Sep 10, 2025

Uh oh!

GavinZhu-GMI commented Sep 11, 2025

Uh oh!

GavinZhu-GMI commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: fixed failing health probes to enable state transition #2925

fix: fixed failing health probes to enable state transition #2925

Uh oh!

Conversation

GavinZhu-GMI commented Sep 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 8, 2025

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

coderabbitai bot commented Sep 8, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nnshah1 commented Sep 10, 2025

Uh oh!

grahamking commented Sep 10, 2025

Uh oh!

GavinZhu-GMI commented Sep 11, 2025

Uh oh!

GavinZhu-GMI commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GavinZhu-GMI commented Sep 8, 2025 •

edited by coderabbitai bot

Loading