-
Notifications
You must be signed in to change notification settings - Fork 676
fix: fixed failing health probes to enable state transition #2925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
|
👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughReplaces pynvml with nvidia-ml-py in container dependencies. Introduces HealthTransitionPolicy in runtime config and SystemHealth, adds policy-driven health status transitions, updates initialization to pass the policy, and modifies the health HTTP handler to perform transition checks before reporting status. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Server as system_status_server
participant SH as SystemHealth
participant Cfg as RuntimeConfig
Note over Cfg: Startup
Cfg->>Server: health_transition_policy
Server->>SH: SystemHealth::new(..., policy, ...)
Note over Client,Server: Health endpoint request
Client->>Server: GET /health
Server->>SH: get_health_status_with_transition_check()
activate SH
SH->>SH: check_and_update_health_status()
alt Policy: Manual
Note right of SH: No auto transition
else Policy: TimeBasedReady
Note right of SH: Ready after N seconds
else Policy: EndpointBasedReady
Note right of SH: Ready when endpoints report ready
else Policy: Custom
Note right of SH: Optional time + endpoint readiness
end
SH-->>Server: (healthy, endpoints)
deactivate SH
Server-->>Client: JSON health status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/runtime/src/lib.rs (1)
221-229: Policy is ignored when computing “healthy” — readiness may stay false despite time-based transition.get_health_status always gates on use_endpoint_health_status when non-empty, even if the policy (TimeBasedReady or Custom with require_endpoints_ready=false) intends readiness to be time-driven. This can negate the PR’s goal and keep probes failing.
Incorporate HealthTransitionPolicy into the healthy computation:
- let healthy = if !self.use_endpoint_health_status.is_empty() { - self.use_endpoint_health_status.iter().all(|endpoint| { - self.endpoint_health - .get(endpoint) - .is_some_and(|status| *status == HealthStatus::Ready) - }) - } else { - self.system_health == HealthStatus::Ready - }; + let healthy = match &self.health_transition_policy { + HealthTransitionPolicy::EndpointBasedReady => { + if !self.use_endpoint_health_status.is_empty() { + self.use_endpoint_health_status.iter().all(|endpoint| { + self.endpoint_health + .get(endpoint) + .is_some_and(|status| *status == HealthStatus::Ready) + }) + } else { + // No endpoints configured — fall back to system health. + self.system_health == HealthStatus::Ready + } + } + HealthTransitionPolicy::Custom { require_endpoints_ready, .. } + if *require_endpoints_ready => + { + if !self.use_endpoint_health_status.is_empty() { + self.use_endpoint_health_status.iter().all(|endpoint| { + self.endpoint_health + .get(endpoint) + .is_some_and(|status| *status == HealthStatus::Ready) + }) + } else { + // Explicitly required endpoints but none configured — treat as not ready + // to avoid false positives and surface misconfiguration. + false + } + } + _ => { + // Manual, TimeBasedReady, or Custom without endpoint requirement + self.system_health == HealthStatus::Ready + } + };Follow-up: consider logging a warning when require_endpoints_ready=true but no endpoints are configured.
🧹 Nitpick comments (5)
lib/runtime/src/config.rs (3)
62-85: Enum design looks good; clarify env encoding and add Display for better logs.Nice policy surface. Two small improvements:
- Document how to set complex enum variants via env (e.g., JSON for time_based_ready/custom).
- Add fmt::Display to produce readable logs (e.g., TimeBasedReady(30s)).
Add a Display impl:
@@ impl Default for HealthTransitionPolicy { fn default() -> Self { // Better default: auto-ready after 30 seconds for simple services Self::TimeBasedReady { after_seconds: 30 } } } + +impl fmt::Display for HealthTransitionPolicy { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + match self { + HealthTransitionPolicy::Manual => write!(f, "manual"), + HealthTransitionPolicy::TimeBasedReady { after_seconds } => { + write!(f, "time_based_ready(after_seconds={})", after_seconds) + } + HealthTransitionPolicy::EndpointBasedReady => write!(f, "endpoint_based_ready"), + HealthTransitionPolicy::Custom { auto_ready_after_seconds, require_endpoints_ready } => { + match auto_ready_after_seconds { + Some(s) => write!(f, "custom(auto_ready_after_seconds={}, require_endpoints_ready={})", s, require_endpoints_ready), + None => write!(f, "custom(auto_ready_after_seconds=None, require_endpoints_ready={})", require_endpoints_ready), + } + } + } + } +}Optionally derive Copy since all fields are Copy, if convenient.
141-147: Expose policy in config Display output for observability.Including health_transition_policy in fmt::Display helps trace effective settings.
Minimal change using Debug until Display is added:
@@ write!(f, ", system_live_path={}", self.system_live_path)?; + write!(f, ", health_transition_policy={:?}", self.health_transition_policy)?;
239-251: Clarify precedence and log overrides when AUTO_READY_AFTER_SECONDS is set.AUTO_READY_AFTER_SECONDS unconditionally overrides any value from HEALTH_TRANSITION_POLICY. Make this explicit in logs to avoid confusion; also trim whitespace before parse.
- if let Ok(seconds_str) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") { - if !seconds_str.is_empty() { - if let Ok(seconds) = seconds_str.parse::<u64>() { - tracing::info!("Using DYN_SYSTEM_AUTO_READY_AFTER_SECONDS={} for health transition policy", seconds); - config.health_transition_policy = HealthTransitionPolicy::TimeBasedReady { after_seconds: seconds }; + if let Ok(raw) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") { + let seconds_str = raw.trim(); + if !seconds_str.is_empty() { + if let Ok(seconds) = seconds_str.parse::<u64>() { + let prev = config.health_transition_policy.clone(); + config.health_transition_policy = HealthTransitionPolicy::TimeBasedReady { after_seconds: seconds }; + tracing::info!( + "Overriding health_transition_policy {:?} → TimeBasedReady(after_seconds={}) due to DYN_SYSTEM_AUTO_READY_AFTER_SECONDS", + prev, seconds + ); } else { tracing::warn!("Invalid value for DYN_SYSTEM_AUTO_READY_AFTER_SECONDS: '{}', expected a number", seconds_str); } } }Consider adding tests to pin behavior:
- HEALTH_TRANSITION_POLICY=manual → Manual.
- HEALTH_TRANSITION_POLICY=manual + AUTO_READY_AFTER_SECONDS=5 → TimeBasedReady(5).
lib/runtime/src/system_status_server.rs (1)
174-176: Hold the health lock for the minimum time.Scope the mutex guard to just the reads/updates to avoid holding it while building the response.
- let mut system_health = state.drt().system_health.lock().unwrap(); - let (healthy, endpoints) = system_health.get_health_status_with_transition_check(); - let uptime = Some(system_health.uptime()); + let (healthy, endpoints, uptime) = { + let mut sh = state.drt().system_health.lock().unwrap(); + let (h, eps) = sh.get_health_status_with_transition_check(); + let up = sh.uptime(); + (h, eps, up) + }; + let uptime = Some(uptime);If readiness needs to progress without probe traffic, consider a background interval that calls the transition check to advance state even when /health isn’t polled.
lib/runtime/src/distributed.rs (1)
78-85: Optional: make HealthTransitionPolicy Copy to avoid clones.If the enum is small and non-owning, deriving Copy (and Clone) on HealthTransitionPolicy lets you pass it by value here and elsewhere without allocations.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
container/deps/requirements.txt(1 hunks)lib/runtime/src/config.rs(6 hunks)lib/runtime/src/distributed.rs(1 hunks)lib/runtime/src/lib.rs(5 hunks)lib/runtime/src/system_status_server.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/runtime/src/system_status_server.rs (1)
components/metrics/src/lib.rs (1)
endpoints(563-563)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/lib.rs (2)
new(105-126)new(283-288)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (9)
lib/runtime/src/config.rs (3)
218-221: Env→field mapping LGTM.Mapping DYN_SYSTEM_HEALTH_TRANSITION_POLICY → health_transition_policy is correct and consistent with existing keys.
271-271: Default for single_threaded includes policy — LGTM.
301-301: Default policy set to TimeBasedReady(30s) — sensible default.lib/runtime/src/distributed.rs (1)
78-87: Plumbing policy into SystemHealth::new is correct.The config field is passed through cleanly and the call-site matches the new constructor signature.
lib/runtime/src/lib.rs (5)
68-68: Import looks good.Bringing HealthTransitionPolicy into scope here is expected.
97-102: New field on SystemHealth is fine.No ordering/ownership issues; fits existing struct layout.
135-165: EndpointBasedReady semantics: code requires ALL endpoints ready; comment implies “at least one.”The comment “Ready when service has registered at least one endpoint” conflicts with the implementation that requires all endpoints to be Ready. Please clarify intended semantics and align code/comment.
- If “any endpoint ready” is intended, change all(...) to any(...).
- If “all endpoints ready” is intended, update the comment accordingly.
Option A (any endpoint ready):
- HealthTransitionPolicy::EndpointBasedReady => { - // Ready when service has registered at least one endpoint - if !self.endpoint_health.is_empty() { - let all_endpoints_ready = self.endpoint_health.values() - .all(|status| *status == HealthStatus::Ready); - - if all_endpoints_ready { + HealthTransitionPolicy::EndpointBasedReady => { + // Ready when at least one endpoint reports Ready + if !self.endpoint_health.is_empty() { + let any_endpoint_ready = self.endpoint_health.values() + .any(|status| *status == HealthStatus::Ready); + + if any_endpoint_ready { tracing::info!("Auto-transitioning to ready - all {} endpoints are ready (policy: endpoint_based_ready)", self.endpoint_health.len()); self.system_health = HealthStatus::Ready; } } },
197-205: Wrapper that runs transition check before reporting health: good.This ensures readiness can progress without external nudges.
105-126: Constructor signature change verified: no remaining old-arity calls.
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
57c7e3f to
64652c3
Compare
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: GavinZhu-GMI <gavin.z@gmicloud.ai>
|
@tzulingk for visability as it relates to health check. @GavinZhu-GMI for simple services instead of a time based check - I think either providing an api to set the health explicitly or default to ready - would be better Time based seems not to indicate the service is 'ready' so may be misleading - both too long or too short. Do you have a scenario in mind for when to set it if we provided an explicit api - |
|
Can you explain the problem? How could we reproduce what you are seeing that motivated this PR? |
Sure. I am trying to use dynamo do PD disagg serving on two H200 nodes. The pod of prefill and decode just kept restarting without being healthy. Then after read the code I hard coded DYN_SYSTEM_STARTING_HEALTH_STATUS in the env to ready state, then the service can get healthy when all the endpoints are online. |
Sure, time based check is indeed misleading. |
Overview:
This fix solves deployment where health probes were failing because services never transitioned to ready state.
Details:
Files Modified:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores