Skip to content

Conversation

@nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented Jul 17, 2025

Overview:

Adds DYN_SYSTEM_STARTING_HEALTH_STATUS to set the starting health status for a dynamo process

Adds DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS to set a list of endpoints that must be healthy to indicate overall health.

Adds struct with ability to set health status and set endpoint health status and get current health status.

Endpoints are automatically set to Ready when served and to the starting health status when initialized.

Example:

export DYN_SYSTEM_ENABLED="true"
export DYN_SYSTEM_STARTING_HEALTH_STATUS="notready"
export DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS="[\"generate\"]"
python3 examples/vllm/components/main.py --model deepseek-ai/DeepSeek-R1-Distill-Llama-8

In another shell

curl localhost:9090/health -v
 1240 * Host localhost:9090 was resolved.
 1241 * IPv6: ::1
 1242 * IPv4: 127.0.0.1
 1243 *   Trying [::1]:9090...
 1244 * connect to ::1 port 9090 from ::1 port 51428 failed: Connection refused
 1245 *   Trying 127.0.0.1:9090...
 1246 * Connected to localhost (127.0.0.1) port 9090
 1247 > GET /health HTTP/1.1
 1248 > Host: localhost:9090
 1249 > User-Agent: curl/8.5.0
 1250 > Accept: */*
 1251 >
 1252 < HTTP/1.1 503 Service Unavailable
 1253 < content-type: text/plain; charset=utf-8
 1254 < content-length: 92
 1255 < date: Mon, 21 Jul 2025 14:59:44 GMT
 1256 <
 1257 * Connection #0 to host localhost left intact
...
 
 curl localhost:9090/health -v
 1259 * Host localhost:9090 was resolved.
 1260 * IPv6: ::1
 1261 * IPv4: 127.0.0.1
 1262 *   Trying [::1]:9090...
 1263 * connect to ::1 port 9090 from ::1 port 34510 failed: Connection refused
 1264 *   Trying 127.0.0.1:9090...
 1265 * Connected to localhost (127.0.0.1) port 9090
 1266 > GET /health HTTP/1.1
 1267 > Host: localhost:9090
 1268 > User-Agent: curl/8.5.0
 1269 > Accept: */*
 1270 >
 1271 < HTTP/1.1 200 OK
 1272 < content-type: text/plain; charset=utf-8
 1273 < content-length: 112
 1274 < date: Mon, 21 Jul 2025 14:59:59 GMT
 1275 <
 1276 * Connection #0 to host localhost left intact

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nnshah1 nnshah1 changed the title draft: initial commit of health check changes feat: initial commit of health check changes Jul 17, 2025
@github-actions github-actions bot added the feat label Jul 17, 2025
Copy link
Contributor

@ryanolson ryanolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we collected all the Health logic into a struct with methods on it.

I'd like to be able to:

  • be able to call Health::mark_as_ready() manually,
  • have a default heath state - could be NotReady or Ready
    • a default of ready makes the mark_as_ready() a no-op
  • you could set a list of (component.endpoint) that need to be live which could auto trigger.

@nnshah1
Copy link
Contributor Author

nnshah1 commented Jul 18, 2025

Can we collected all the Health logic into a struct with methods on it.

I'd like to be able to:

  • be able to call Health::mark_as_ready() manually,

  • have a default heath state - could be NotReady or Ready

    • a default of ready makes the mark_as_ready() a no-op
  • you could set a list of (component.endpoint) that need to be live which could auto trigger.

sounds good - so thinking of an explict set instead of implicit - @grahamking - reasonable - I can start changes to make more of a trait

@nnshah1
Copy link
Contributor Author

nnshah1 commented Jul 18, 2025

Here's what I'm thinking:

  1. add a default_health_state and enum
  2. add a struct on runtime that can be used to alter health state
  3. add a list on the struct to auto set when served

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jul 18, 2025
@nnshah1 nnshah1 requested a review from ryanolson July 21, 2025 15:08
@nnshah1 nnshah1 marked this pull request as ready for review July 21, 2025 15:08
@nnshah1 nnshah1 requested a review from a team as a code owner July 21, 2025 15:08
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 21, 2025

Walkthrough

The changes introduce a comprehensive health status tracking system to the runtime. A new SystemHealth struct and HealthStatus enum are added, with integration across configuration, runtime initialization, HTTP server health endpoints, and endpoint lifecycle management. Health status is now tracked per endpoint and system-wide, with corresponding updates to configuration, tests, and method signatures.

Changes

File(s) Change Summary
lib/runtime/Cargo.toml Added async_closure feature to temp-env dev-dependency.
lib/runtime/src/component.rs Expanded import block to include config::HealthStatus.
lib/runtime/src/component/endpoint.rs Updated push_endpoint.start call to pass endpoint name and system health, reflecting updated method signature.
lib/runtime/src/config.rs Introduced HealthStatus enum and two new fields in RuntimeConfig for health status, with serialization, env var mapping, builder pattern, and tests for env var loading.
lib/runtime/src/distributed.rs Modified DistributedRuntime::new to initialize and store a SystemHealth instance using config values, updating the struct and constructor logic.
lib/runtime/src/http_server.rs Updated health handler to return JSON with health details and status code reflecting health; added parameterized tests for health endpoints; added imports and tracing.
lib/runtime/src/lib.rs Added SystemHealth struct with methods for tracking and updating health; extended DistributedRuntime with a system_health field.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs Modified PushEndpoint::start to accept endpoint name and system health; updates health status before and after main loop.

Sequence Diagram(s)

sequenceDiagram
    participant Config as RuntimeConfig
    participant Dist as DistributedRuntime
    participant Health as SystemHealth
    participant Http as HTTP Server
    participant Endpoint as PushEndpoint

    Config->>Dist: Provide starting_health_status, use_endpoint_health_status
    Dist->>Health: Initialize SystemHealth with config
    Dist->>Http: Share Arc<Mutex<SystemHealth>>
    Endpoint->>Health: Set endpoint health to Ready at start
    Http->>Health: Query get_health_status()
    Health-->>Http: Return overall and endpoint health
    Endpoint->>Health: Set endpoint health to NotReady on shutdown
Loading

Estimated code review effort

3 (~45 minutes)

Poem

In the warren, health we track,
With endpoints ready, never slack.
System’s pulse now easy to see—
JSON sings our status, healthy as can be!
Each bunny hops with peace of mind,
For health and uptime, well-defined.
🐇💚


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)

55-58: LGTM! Correct health status lifecycle management.

The health status updates at startup (Ready) and shutdown (NotReady) are properly placed and follow the expected endpoint lifecycle pattern.

Minor optimization suggestion: Consider using &endpoint_name instead of cloning in both health update calls to avoid unnecessary string allocations.

 system_health
     .lock()
     .await
-    .set_endpoint_health_status(endpoint_name.clone(), HealthStatus::Ready);
+    .set_endpoint_health_status(&endpoint_name, HealthStatus::Ready);

And similarly for the shutdown call (assuming the SystemHealth method signature accepts &str).

Also applies to: 114-117

lib/runtime/src/config.rs (1)

127-136: Improve formatting consistency in Display implementation.

The new fields are added to the display output, but the formatting is inconsistent with existing fields. Consider adding proper separators for better readability.

Apply this diff to improve formatting consistency:

-        write!(f, "system_enabled={}", self.system_enabled)?;
-        write!(
-            f,
-            "use_endpoint_health_status={:?}",
-            self.use_endpoint_health_status
-        )?;
-        write!(
-            f,
-            "starting_health_status={:?}",
-            self.starting_health_status
-        )?;
+        write!(f, "system_enabled={}, ", self.system_enabled)?;
+        write!(f, "starting_health_status={:?}, ", self.starting_health_status)?;
+        write!(
+            f,
+            "use_endpoint_health_status={:?}",
+            self.use_endpoint_health_status
+        )?;
lib/runtime/src/lib.rs (1)

79-140: Well-designed health tracking system with minor improvements needed.

The SystemHealth implementation provides a solid foundation for tracking system and endpoint health. The logic correctly handles both system-wide and per-endpoint health tracking.

Consider these improvements for better consistency and robustness:

                 endpoint.clone(),
                 if *ready == HealthStatus::Ready {
                     "ready".to_string()
                 } else {
-                    "notready".to_string()
+                    "not_ready".to_string()
                 },

Also consider adding error handling for cases where an endpoint in use_endpoint_health_status doesn't exist in endpoint_health:

         if !self.use_endpoint_health_status.is_empty() {
             healthy = self.use_endpoint_health_status.iter().all(|endpoint| {
                 self.endpoint_health
                     .get(endpoint)
-                    .map_or(false, |status| *status == HealthStatus::Ready)
+                    .map_or_else(|| {
+                        tracing::warn!("Endpoint {} not found in health map", endpoint);
+                        false
+                    }, |status| *status == HealthStatus::Ready)
             });
lib/runtime/src/http_server.rs (2)

151-173: Excellent enhancement to health endpoint functionality.

The refactored health handler provides much more useful information with:

  • Structured JSON response including status, uptime, and per-endpoint health
  • Appropriate HTTP status codes (200 for healthy, 503 for unhealthy)
  • Proper tracing instrumentation for debugging

Consider using debug level instead of trace for the response logging to make it more accessible during normal debugging:

-    tracing::trace!("Response {}", response.to_string());
+    tracing::debug!("Health response: {}", response);

288-364: Comprehensive test coverage with opportunities for improvement.

The parameterized test effectively covers both healthy and unhealthy scenarios across multiple endpoints. The use of rstest for parameterization and temp_env::async_with_vars for environment isolation is excellent.

Consider these improvements for better maintainability:

  1. Replace println! with proper test logging:
-                println!("[test] Waiting for server to start...");
+                tracing::info!("Waiting for server to start...");
  1. Reduce sleep duration for faster tests:
-                sleep(std::time::Duration::from_millis(1000)).await;
+                sleep(std::time::Duration::from_millis(100)).await;
  1. Consider extracting server setup into a test helper function to reduce code duplication and improve readability.

The test logic is solid and provides excellent coverage of the health endpoint functionality.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb6de94 and 2465345.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • lib/runtime/Cargo.toml (1 hunks)
  • lib/runtime/src/component.rs (1 hunks)
  • lib/runtime/src/component/endpoint.rs (1 hunks)
  • lib/runtime/src/config.rs (8 hunks)
  • lib/runtime/src/distributed.rs (3 hunks)
  • lib/runtime/src/http_server.rs (4 hunks)
  • lib/runtime/src/lib.rs (3 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (7)
lib/runtime/Cargo.toml (2)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

lib/runtime/src/component/endpoint.rs (2)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

lib/runtime/src/lib.rs (2)

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/runtime/src/distributed.rs (4)

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

lib/runtime/src/http_server.rs (2)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

lib/runtime/src/config.rs (1)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (6)

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (13)
lib/runtime/Cargo.toml (1)

72-72: LGTM! Dev dependency feature addition is appropriate.

The addition of the async_closure feature to the temp-env crate aligns with the health status testing enhancements mentioned in the AI summary.

lib/runtime/src/component.rs (1)

32-34: LGTM! Clean import addition for health status support.

The refactoring to multi-line imports improves readability and the addition of HealthStatus import supports the new health tracking functionality.

lib/runtime/src/component/endpoint.rs (1)

113-117: LGTM! Proper integration of health tracking into endpoint startup.

The addition of endpoint name and system health parameters to the push_endpoint.start() call correctly integrates the new health tracking system with endpoint lifecycle management.

lib/runtime/src/distributed.rs (1)

68-86: LGTM! Clean SystemHealth initialization and integration.

The SystemHealth initialization properly uses configuration values and is correctly wrapped in Arc<Mutex<_>> for thread-safe sharing across the distributed runtime. The early config retrieval avoids redundant calls later in the function.

lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)

44-49: LGTM! Proper health status lifecycle integration.

The method signature update correctly accepts the necessary parameters for health tracking integration.

lib/runtime/src/config.rs (6)

51-56: LGTM - Well-designed enum for health status.

The HealthStatus enum is properly implemented with appropriate derives and serde configuration for lowercase serialization, making it user-friendly in configuration files.


98-112: LGTM - Well-structured configuration fields.

The new health-related configuration fields are properly defined with:

  • Clear documentation explaining their purpose
  • Sensible defaults (NotReady for starting status, empty vector for endpoints)
  • Correct builder attributes for serialization control
  • Consistent naming with existing fields

170-171: LGTM - Correct environment variable mapping.

The new environment variable mappings follow the established pattern and correctly map to the corresponding struct fields.


208-209: LGTM - Consistent default values.

The single_threaded() method correctly initializes the new health-related fields with appropriate defaults.


235-236: LGTM - Consistent default values.

The Default implementation correctly initializes the new health-related fields with the same defaults as other constructors.


413-433: Confirm JSON array parsing for Vec in environment variables

I didn’t find any existing examples of loading a JSON-formatted array from an environment variable into a Vec<String> in our config code. Please verify that figment’s Env provider (used by RuntimeConfig::from_settings()) indeed supports deserializing a JSON array string (e.g. "[\"ready\"]") into Vec<String>. If it does not, consider one of the following:

  • Switch to a comma-delimited string (e.g. "ready,other") and split it manually or via a custom caster.
  • Add custom parsing logic to handle JSON arrays for use_endpoint_health_status.

Tagging for manual verification to ensure this test will actually pass in CI.

lib/runtime/src/lib.rs (1)

170-171: LGTM - Appropriate thread-safe integration.

The system_health field is correctly added to DistributedRuntime with Arc<Mutex<SystemHealth>> for safe concurrent access across async contexts.

lib/runtime/src/http_server.rs (1)

16-16: LGTM - Necessary imports for enhanced health functionality.

The new imports support the structured JSON health response and health status tracking functionality.

Also applies to: 21-22

@nnshah1 nnshah1 changed the title feat: initial commit of health check changes feat: health check changes based on endpoint served Jul 21, 2025
Copy link
Contributor

@ryanolson ryanolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question regarding what looks to be duplicate endpoint_name, or redundant endpoint_name being pass to the PushEndpoint

@nnshah1 nnshah1 requested a review from ryanolson July 22, 2025 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants