-
Notifications
You must be signed in to change notification settings - Fork 675
feat: health check changes based on endpoint served #1996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we collected all the Health logic into a struct with methods on it.
I'd like to be able to:
- be able to call Health::mark_as_ready() manually,
- have a default heath state - could be NotReady or Ready
- a default of ready makes the mark_as_ready() a no-op
- you could set a list of (component.endpoint) that need to be live which could auto trigger.
sounds good - so thinking of an explict set instead of implicit - @grahamking - reasonable - I can start changes to make more of a trait |
|
Here's what I'm thinking:
|
WalkthroughThe changes introduce a comprehensive health status tracking system to the runtime. A new Changes
Sequence Diagram(s)sequenceDiagram
participant Config as RuntimeConfig
participant Dist as DistributedRuntime
participant Health as SystemHealth
participant Http as HTTP Server
participant Endpoint as PushEndpoint
Config->>Dist: Provide starting_health_status, use_endpoint_health_status
Dist->>Health: Initialize SystemHealth with config
Dist->>Http: Share Arc<Mutex<SystemHealth>>
Endpoint->>Health: Set endpoint health to Ready at start
Http->>Health: Query get_health_status()
Health-->>Http: Return overall and endpoint health
Endpoint->>Health: Set endpoint health to NotReady on shutdown
Estimated code review effort3 (~45 minutes) Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)
55-58: LGTM! Correct health status lifecycle management.The health status updates at startup (Ready) and shutdown (NotReady) are properly placed and follow the expected endpoint lifecycle pattern.
Minor optimization suggestion: Consider using
&endpoint_nameinstead of cloning in both health update calls to avoid unnecessary string allocations.system_health .lock() .await - .set_endpoint_health_status(endpoint_name.clone(), HealthStatus::Ready); + .set_endpoint_health_status(&endpoint_name, HealthStatus::Ready);And similarly for the shutdown call (assuming the SystemHealth method signature accepts
&str).Also applies to: 114-117
lib/runtime/src/config.rs (1)
127-136: Improve formatting consistency in Display implementation.The new fields are added to the display output, but the formatting is inconsistent with existing fields. Consider adding proper separators for better readability.
Apply this diff to improve formatting consistency:
- write!(f, "system_enabled={}", self.system_enabled)?; - write!( - f, - "use_endpoint_health_status={:?}", - self.use_endpoint_health_status - )?; - write!( - f, - "starting_health_status={:?}", - self.starting_health_status - )?; + write!(f, "system_enabled={}, ", self.system_enabled)?; + write!(f, "starting_health_status={:?}, ", self.starting_health_status)?; + write!( + f, + "use_endpoint_health_status={:?}", + self.use_endpoint_health_status + )?;lib/runtime/src/lib.rs (1)
79-140: Well-designed health tracking system with minor improvements needed.The
SystemHealthimplementation provides a solid foundation for tracking system and endpoint health. The logic correctly handles both system-wide and per-endpoint health tracking.Consider these improvements for better consistency and robustness:
endpoint.clone(), if *ready == HealthStatus::Ready { "ready".to_string() } else { - "notready".to_string() + "not_ready".to_string() },Also consider adding error handling for cases where an endpoint in
use_endpoint_health_statusdoesn't exist inendpoint_health:if !self.use_endpoint_health_status.is_empty() { healthy = self.use_endpoint_health_status.iter().all(|endpoint| { self.endpoint_health .get(endpoint) - .map_or(false, |status| *status == HealthStatus::Ready) + .map_or_else(|| { + tracing::warn!("Endpoint {} not found in health map", endpoint); + false + }, |status| *status == HealthStatus::Ready) });lib/runtime/src/http_server.rs (2)
151-173: Excellent enhancement to health endpoint functionality.The refactored health handler provides much more useful information with:
- Structured JSON response including status, uptime, and per-endpoint health
- Appropriate HTTP status codes (200 for healthy, 503 for unhealthy)
- Proper tracing instrumentation for debugging
Consider using debug level instead of trace for the response logging to make it more accessible during normal debugging:
- tracing::trace!("Response {}", response.to_string()); + tracing::debug!("Health response: {}", response);
288-364: Comprehensive test coverage with opportunities for improvement.The parameterized test effectively covers both healthy and unhealthy scenarios across multiple endpoints. The use of
rstestfor parameterization andtemp_env::async_with_varsfor environment isolation is excellent.Consider these improvements for better maintainability:
- Replace println! with proper test logging:
- println!("[test] Waiting for server to start..."); + tracing::info!("Waiting for server to start...");
- Reduce sleep duration for faster tests:
- sleep(std::time::Duration::from_millis(1000)).await; + sleep(std::time::Duration::from_millis(100)).await;
- Consider extracting server setup into a test helper function to reduce code duplication and improve readability.
The test logic is solid and provides excellent coverage of the health endpoint functionality.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
lib/runtime/Cargo.toml(1 hunks)lib/runtime/src/component.rs(1 hunks)lib/runtime/src/component/endpoint.rs(1 hunks)lib/runtime/src/config.rs(8 hunks)lib/runtime/src/distributed.rs(3 hunks)lib/runtime/src/http_server.rs(4 hunks)lib/runtime/src/lib.rs(3 hunks)lib/runtime/src/pipeline/network/ingress/push_endpoint.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (7)
lib/runtime/Cargo.toml (2)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/src/component/endpoint.rs (2)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.
lib/runtime/src/lib.rs (2)
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/runtime/src/distributed.rs (4)
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
lib/runtime/src/http_server.rs (2)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
lib/runtime/src/config.rs (1)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (6)
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (13)
lib/runtime/Cargo.toml (1)
72-72: LGTM! Dev dependency feature addition is appropriate.The addition of the
async_closurefeature to thetemp-envcrate aligns with the health status testing enhancements mentioned in the AI summary.lib/runtime/src/component.rs (1)
32-34: LGTM! Clean import addition for health status support.The refactoring to multi-line imports improves readability and the addition of
HealthStatusimport supports the new health tracking functionality.lib/runtime/src/component/endpoint.rs (1)
113-117: LGTM! Proper integration of health tracking into endpoint startup.The addition of endpoint name and system health parameters to the
push_endpoint.start()call correctly integrates the new health tracking system with endpoint lifecycle management.lib/runtime/src/distributed.rs (1)
68-86: LGTM! Clean SystemHealth initialization and integration.The SystemHealth initialization properly uses configuration values and is correctly wrapped in Arc<Mutex<_>> for thread-safe sharing across the distributed runtime. The early config retrieval avoids redundant calls later in the function.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)
44-49: LGTM! Proper health status lifecycle integration.The method signature update correctly accepts the necessary parameters for health tracking integration.
lib/runtime/src/config.rs (6)
51-56: LGTM - Well-designed enum for health status.The
HealthStatusenum is properly implemented with appropriate derives and serde configuration for lowercase serialization, making it user-friendly in configuration files.
98-112: LGTM - Well-structured configuration fields.The new health-related configuration fields are properly defined with:
- Clear documentation explaining their purpose
- Sensible defaults (NotReady for starting status, empty vector for endpoints)
- Correct builder attributes for serialization control
- Consistent naming with existing fields
170-171: LGTM - Correct environment variable mapping.The new environment variable mappings follow the established pattern and correctly map to the corresponding struct fields.
208-209: LGTM - Consistent default values.The
single_threaded()method correctly initializes the new health-related fields with appropriate defaults.
235-236: LGTM - Consistent default values.The
Defaultimplementation correctly initializes the new health-related fields with the same defaults as other constructors.
413-433: Confirm JSON array parsing for Vec in environment variablesI didn’t find any existing examples of loading a JSON-formatted array from an environment variable into a
Vec<String>in our config code. Please verify that figment’s Env provider (used byRuntimeConfig::from_settings()) indeed supports deserializing a JSON array string (e.g."[\"ready\"]") intoVec<String>. If it does not, consider one of the following:
- Switch to a comma-delimited string (e.g.
"ready,other") and split it manually or via a custom caster.- Add custom parsing logic to handle JSON arrays for
use_endpoint_health_status.Tagging for manual verification to ensure this test will actually pass in CI.
lib/runtime/src/lib.rs (1)
170-171: LGTM - Appropriate thread-safe integration.The
system_healthfield is correctly added toDistributedRuntimewithArc<Mutex<SystemHealth>>for safe concurrent access across async contexts.lib/runtime/src/http_server.rs (1)
16-16: LGTM - Necessary imports for enhanced health functionality.The new imports support the structured JSON health response and health status tracking functionality.
Also applies to: 21-22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question regarding what looks to be duplicate endpoint_name, or redundant endpoint_name being pass to the PushEndpoint
Overview:
Adds DYN_SYSTEM_STARTING_HEALTH_STATUS to set the starting health status for a dynamo process
Adds DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS to set a list of endpoints that must be healthy to indicate overall health.
Adds struct with ability to set health status and set endpoint health status and get current health status.
Endpoints are automatically set to Ready when served and to the starting health status when initialized.
Example:
In another shell
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)