feat: Failure Detection while Responses are returning #1671

kthui · 2025-06-26T21:51:46Z

Overview:

The router will detect if a stream is closed before all responses are received. If not, an error will be propagated to the stream consumer.

With this change, the Python script is informed when not all responses are received.

Additionally, if the error originate from networking, the instance is inhibited for future requests until ETCD can update.

Details:

Added a new trait IsError, for network/router to create an error response and check if a response is in error.
All response type must implement the new IsError trait.
Temporarily added a new StreamItemWrapper for client to determine if all responses are received - this will be replaced with Server-Sent Events (SSE) for detecting proper stream ending in a future PR.
If not all responses are received, an error response is propagated back up to the client and the instance is inhibited until ETCD can update.
Some fixes on Python binding regarding return generic type.

Testing setup with examples/fault_tolerance:

The client.py sends a request to the processor.py and the processor.py relays the request to worker.py, which the worker.py yields 9 responses for each request received, and the yielded responses are transmitted back to the processor.py and then client.py, in order. The client.py only sends the next request after all 9 responses are received from the current request.

While the worker.py was yielding responses, the worker.py was stopped prematurely. The processor.py is expected to be notified of the incomplete stream via a Python exception.

Before change:

...
2025-07-04T00:56:42.340Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-07-04T00:56:42.443Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"1"}   
2025-07-04T00:56:42.545Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"2"}   
2025-07-04T00:56:42.646Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"3"}   
2025-07-04T00:56:42.747Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"4"}   
^C2025-07-04T00:58:12.516Z  INFO sighandler.signal: Got signal SIG_INT
...

The processor.py ended at response 4 believing it is the last response, but the stream ended prematurely in reality.

After change:

...
2025-07-04T00:59:55.779Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-07-04T00:59:55.882Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"1"}   
2025-07-04T00:59:55.983Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"2"}   
2025-07-04T00:59:56.084Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"3"}   
2025-07-04T00:59:56.185Z  INFO processor._generate_with_migration: [Processor:1] Processor sending: {"text":"4"}   
2025-07-04T00:59:56.239Z  WARN processor._generate_with_migration: [Processor:1] Processor error while streaming response: Stream ended before generation completed
...
^C2025-07-04T00:59:58.718Z  INFO sighandler.signal: Got signal SIG_INT 
...

The processor received an exception stating Stream ended before generation completed.

Note the examples/fault_tolerance can be found in the commit history.

Where should the reviewer start?

Start with the new is_error.rs containing the new IsError trait, and then move on to Annotated<...> struct. Then, move on to streaming fault detection in the router implementation. Finally, check out minor change on LLMEngineOutput.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to ai-dynamo/enhancements#15

Summary by CodeRabbit

New Features
- Improved error handling across streaming and routing components, with standardized error reporting and detection.
- Streams now explicitly signal completion, making it easier to detect the end of data transmission.
Refactor
- Simplified and unified routing logic for more consistent fault detection and instance management.
- Enhanced type safety by requiring error-handling capabilities for streamed data.
Tests
- Added and updated unit tests to verify error handling and stream completion behavior.

copy-pr-bot · 2025-06-26T21:51:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ponse

examples/fault_tolerance/processor.py

ryanolson

If we are trying to create a restartable request where we have streamed some responses back to the user, then we should define a RestartableStatefulAsyncEngine.

Ultimately, to restart an autoregressive request at some future state, you need to accumulate the responses, update the request, then reissue it with.

This is particularly problematic however with the prompt templating and requires that the prompt template can properly render partial assistant messages. Most cannot.

We have the capability to do this in our prompt rendering engine.

https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/preprocessor/prompt/template/oai.rs#L48-L57

Notice how we conditionally return a bool for should_add_generation_prompt based on if the last "Message" object is a User --> True or not-a-User which we return False.

Essentially, if the prompt template doesn't support this, we can't reproduce the proper token sequence via the prompt template / tokenization path.

We could take the sequence of tokens and reissue them without needing to re-render the request.

Regardless, we still need to capture state and produce a new request for restart.

Before doing this generally, let's do it for LLM requests as that's the core nature of our framework. Then generalize to trait that our objects need to implement to be "restartable".

I would suggest doing this for an engine that take a common llm request and common llm response object.

we'll probably want to use a scan combinator to collect state.

I think your PR defines this EOS - end of stream - sentinel; however, that should be conditioned upon the response stream itself or the backend producing it.

for LLMs, we have a stop_condition which we should always have sent on the last message in the stream.

seeing a stream terminate without seeing a stop_condition would be a possible trigger for a "restart".

lib/runtime/src/protocols/annotated.rs

examples/fault_tolerance/processor.py

…ming response" This reverts commit 20f0e8a.

This reverts commit 2be5476.

This reverts commit 9b97349.

…lete-final

…red in another PR

coderabbitai · 2025-07-04T01:14:12Z

Walkthrough

The changes introduce a standardized error handling trait (IsError) and implement it across several core data structures, including annotated types and LLM engine outputs. Streaming and routing logic is refactored to explicitly track stream completion using a wrapper struct, enhance error propagation, and simplify router methods. Python bindings and stream processing are updated to handle annotated types directly.

Changes

File(s)	Change Summary
lib/runtime/src/protocols/is_error.rs lib/runtime/src/protocols.rs	Added `IsError` trait and module declaration, defining error conversion and checking methods.
lib/runtime/src/protocols/annotated.rs	Implemented `IsError` for `Annotated<R>`, derived `Clone`, and added tests.
lib/llm/src/protocols/common/llm_backend.rs	Implemented `IsError` for `LLMEngineOutput` and added corresponding tests.
lib/runtime/src/pipeline/network.rs	Introduced `StreamItemWrapper<U>` struct for explicit stream completion signaling.
lib/runtime/src/pipeline/network/egress/addressed_router.rs	Refactored streaming logic to use `StreamItemWrapper`, track completion, yield errors, and require `IsError` on `U`.
lib/runtime/src/pipeline/network/egress/push_router.rs	Added `IsError` bound to `U`, simplified routing methods, unified fault detection, and updated trait implementations.
lib/runtime/src/pipeline/network/ingress/push_handler.rs	Wrapped responses in `StreamItemWrapper`, signaled stream completion, and adjusted error handling during streaming.
lib/bindings/python/rust/lib.rs	Updated client and stream processing to handle `RsAnnotated` types directly, removing redundant deserialization steps.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PushRouter
    participant AddressedPushRouter
    participant Stream
    participant Receiver

    Client->>PushRouter: Send request
    PushRouter->>AddressedPushRouter: Route and generate stream
    AddressedPushRouter->>Stream: Wrap responses in StreamItemWrapper
    Stream->>Receiver: Send StreamItemWrapper<U> (data or completion)
    Receiver->>Client: Receive annotated data or error

Poem

A rabbit hopped through the code so wide,
Wrapping streams with wrappers, with errors to guide.
Annotated errors now hop in a row,
Routers detect faults wherever they go.
With traits and wrappers, the streams end clear—
Hooray for the changes, let’s all give a cheer!
🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Clippy (1.86.0)

Updating crates.io index
Updating git repository `https://github.com/EricLBuehler/mistral.rs.git`

error: failed to get mistralrs as a dependency of package dynamo-engine-mistralrs v0.3.1 (/lib/engines/mistralrs)

Caused by:
failed to load source for dependency mistralrs

Caused by:
Unable to update https://github.com/EricLBuehler/mistral.rs.git#d38a7e19

Caused by:
failed to create directory /usr/local/git/db/mistral.rs-d7a5d833e16ad691

Caused by:
Permission denied (os error 13)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

lib/llm/src/protocols/common/llm_backend.rs (1)
138-150: LGTM: Correct IsError implementation with minor suggestion.

The implementation properly integrates with the existing LLMEngineOutput structure:

from_err leverages the existing error() constructor

err correctly maps FinishReason::Error to boxed errors

Returns None for non-error finish reasons

Consider using err.to_string() instead of format!("{:?}", err) for more user-friendly error messages.
-        LLMEngineOutput::error(format!("{:?}", err))
+        LLMEngineOutput::error(err.to_string())
lib/runtime/src/pipeline/network/egress/push_router.rs (1)

184-224: Fault detection correctly identifies unresponsive instances.

The implementation properly handles NoResponders errors using the correct async-nats error handling pattern. However, the TODO and commented code suggest that stream-level fault detection is still pending.

Would you like me to investigate alternative approaches for stream-level fault detection that might avoid the compiler crash?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dda59e3 and a8a9bd4.

📒 Files selected for processing (9)

lib/bindings/python/rust/lib.rs (3 hunks)
lib/llm/src/protocols/common/llm_backend.rs (3 hunks)
lib/runtime/src/pipeline/network.rs (1 hunks)
lib/runtime/src/pipeline/network/egress/addressed_router.rs (3 hunks)
lib/runtime/src/pipeline/network/egress/push_router.rs (5 hunks)
lib/runtime/src/pipeline/network/ingress/push_handler.rs (1 hunks)
lib/runtime/src/protocols.rs (1 hunks)
lib/runtime/src/protocols/annotated.rs (4 hunks)
lib/runtime/src/protocols/is_error.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

lib/runtime/src/protocols.rs (2)

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

lib/runtime/src/protocols/is_error.rs (1)

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

lib/bindings/python/rust/lib.rs (2)

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

lib/llm/src/protocols/common/llm_backend.rs (1)

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

lib/runtime/src/pipeline/network/egress/addressed_router.rs (6)

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.

lib/runtime/src/pipeline/network/egress/push_router.rs (5)

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.

Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.

🧬 Code Graph Analysis (2)

lib/runtime/src/protocols.rs (1)

lib/runtime/src/protocols/annotated.rs (1)

is_error (132-134)

lib/runtime/src/pipeline/network/egress/addressed_router.rs (2)

lib/runtime/src/protocols/annotated.rs (3)

is_error (132-134)

from_err (154-156)

err (158-169)

lib/runtime/src/protocols/is_error.rs (4)

from_err (20-20)

from_err (44-48)

err (23-23)

err (49-51)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - vllm

🔇 Additional comments (16)

lib/runtime/src/protocols.rs (1)

22-22: LGTM: Clean module declaration.

The new is_error module is properly declared and exposed publicly, following the established pattern in the protocols module.

lib/runtime/src/pipeline/network.rs (1)

327-333: LGTM: Well-designed wrapper struct for stream completion signaling.

The StreamItemWrapper<U> struct effectively addresses the immediate need for explicit stream completion detection. The design is sound:

Optional data field allows for completion-only messages

skip_serializing_if attribute optimizes serialization

Clear naming and documentation

The TODO comment appropriately indicates this is a temporary solution pending SSE implementation.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (2)

100-116: LGTM: Proper stream item wrapping with error handling.

The implementation correctly wraps each response in StreamItemWrapper with appropriate completion flags. The error handling is well-designed:

Sets send_complete_final = false on send failure

Calls context.stop_generating() to halt upstream processing

Breaks the loop to prevent further processing

This ensures proper cleanup when stream transmission fails.

117-130: LGTM: Explicit stream completion signaling.

The completion signaling logic is correctly implemented:

Only sends completion signal if no errors occurred during streaming

Uses data: None, complete_final: true to clearly indicate stream end

Logs errors but doesn't propagate them (appropriate for cleanup phase)

This provides the explicit end-of-stream detection needed for robust streaming.

lib/runtime/src/protocols/is_error.rs (2)

18-34: LGTM: Excellent trait design for standardized error handling.

The IsError trait provides a clean, consistent interface for error handling across the codebase:

from_err and err methods enable bidirectional error conversion

Default implementations of is_ok and is_err reduce boilerplate

Uses Box<dyn Error> for maximum flexibility

Follows Rust conventions similar to Result<T, E>

This design will enable consistent error handling across different response types.

36-61: LGTM: Comprehensive test coverage validates trait behavior.

The test implementation demonstrates proper usage of the trait:

Tests both error and success states

Verifies error message preservation through conversions

Validates default implementations of is_ok and is_err

Uses anyhow::Error for realistic error handling scenarios

The test coverage provides confidence in the trait's correctness.

lib/llm/src/protocols/common/llm_backend.rs (2)

21-21: LGTM: Clean import of IsError trait.

The import is properly scoped and enables the trait implementation for LLMEngineOutput.

163-184: LGTM: Thorough test coverage validates IsError implementation.

The test comprehensively validates the IsError trait implementation:

Tests success states (stop finish reason)

Tests error states with message preservation

Tests from_err constructor integration

Validates all trait methods (is_ok, is_err, err)

This provides confidence that the implementation behaves correctly across all scenarios.

lib/runtime/src/protocols/annotated.rs (2)

150-170: LGTM! Clean implementation of the IsError trait.

The implementation correctly handles error conversion and extraction, with appropriate fallback to "unknown error" when comments are missing.

193-215: Good test coverage for the IsError implementation.

The tests comprehensively verify all paths: normal data, error from string, and error from boxed error.

lib/bindings/python/rust/lib.rs (2)

217-217: Type update aligns with the new error-aware streaming protocol.

The change to use RsAnnotated<serde_json::Value> as the output type is consistent with the IsError trait requirement in PushRouter.

Also applies to: 488-493

759-765: Simplified stream processing by removing redundant deserialization.

Good optimization - the stream now provides annotated values directly, eliminating the need for deserialization.

lib/runtime/src/pipeline/network/egress/addressed_router.rs (2)

84-84: Trait bound addition is consistent with the error handling requirements.

The IsError trait bound on U enables proper error construction in the stream processing logic.

164-208: Comprehensive stream processing with proper error handling.

The implementation correctly handles all edge cases:

Protocol violations (data after completion)

Deserialization errors

Empty responses

Premature stream closure

The stateful tracking of is_complete_final works correctly within the filter_map closure.

lib/runtime/src/pipeline/network/egress/push_router.rs (2)

98-98: Trait bound addition enables error handling in the routing layer.

The IsError trait bound on U is required by the AddressedPushRouter and enables proper error construction.

Also applies to: 231-231

113-131: Clean refactoring that eliminates code duplication.

The routing methods now have clear separation of concerns:

Instance selection logic specific to each routing strategy

Common fault detection delegated to generate_with_fault_detection

Also applies to: 135-152, 160-174

ryanolson

I pushed a branch with has a python wrapper with a protocol for the return type. Using this we can evaluate the Python return object and decide to conditionally wrap it.

Let's set up a time to discuss

lib/runtime/src/pipeline/network/ingress/push_handler.rs

lib/runtime/src/pipeline/network.rs

…rapper and IsError to MaybeError

kthui self-assigned this Jun 26, 2025

pull-request-size bot added the size/L label Jun 26, 2025

github-actions bot added the feat label Jun 26, 2025

kthui force-pushed the jacky-ft-complete-final branch from c1254e3 to 4b1ea1b Compare June 26, 2025 21:56

kthui added 4 commits June 26, 2025 16:33

[tmp] Add fault tolerance test

ae288e2

[WIP] Add complete final to Annotated

9b97349

[WIP] Implement complete final to Python binding

2be5476

[WIP] Implement instance inhibit for node failure while streaming res…

20f0e8a

…ponse

kthui force-pushed the jacky-ft-complete-final branch from 4b1ea1b to 20f0e8a Compare June 26, 2025 23:51

rmccorm4 reviewed Jun 27, 2025

View reviewed changes

examples/fault_tolerance/processor.py Outdated Show resolved Hide resolved

[tmp] Demonstrate migration with fault tolerance test

b3c680b

ryanolson requested changes Jul 1, 2025

View reviewed changes

lib/runtime/src/protocols/annotated.rs Outdated Show resolved Hide resolved

examples/fault_tolerance/processor.py Outdated Show resolved Hide resolved

kthui added 9 commits July 3, 2025 10:54

Revert "[WIP] Implement instance inhibit for node failure while strea…

cff8127

…ming response" This reverts commit 20f0e8a.

Revert "[WIP] Implement complete final to Python binding"

3e0cc46

This reverts commit 2be5476.

Revert "[WIP] Add complete final to Annotated"

4506332

This reverts commit 9b97349.

Merge branch 'main' of github.com:ai-dynamo/dynamo into jacky-ft-comp…

44b5b95

…lete-final

Fix Python binding inconsistent response generic type

031f90c

Add response streaming error handling and IsError trait

ddcfdad

Refactor push_router.rs routing algorithm

d92f826

Update IsError trait to work with any StdError

061e08c

Implement IsError for LLMEngineOutput

8ea92fb

pull-request-size bot added size/XL and removed size/L labels Jul 3, 2025

kthui added 3 commits July 3, 2025 16:56

Merge branch 'main' of github.com:ai-dynamo/dynamo into jacky-ft-comp…

6dddd12

…lete-final

test: IsError trait

8dd59ad

temporary: Remove examples/fault_tolerance - e2e testing will be cove…

540342b

…red in another PR

pull-request-size bot added size/L and removed size/XL labels Jul 4, 2025

Merge branch 'main' into jacky-ft-complete-final

a8a9bd4

kthui marked this pull request as ready for review July 4, 2025 01:10

kthui requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, grahamking, jthomson04, kkranen, oandreeva-nv, paulhendricks and tmonty12 as code owners July 4, 2025 01:10

kthui requested a review from ryanolson July 4, 2025 01:13

coderabbitai bot reviewed Jul 4, 2025

View reviewed changes

kthui enabled auto-merge (squash) July 4, 2025 01:56

kthui requested a review from rmccorm4 July 4, 2025 01:56

Add response stream network error instance inhibition

9bfea78

ryanolson requested changes Jul 7, 2025

View reviewed changes

lib/runtime/src/pipeline/network/ingress/push_handler.rs Outdated Show resolved Hide resolved

lib/runtime/src/pipeline/network.rs Outdated Show resolved Hide resolved

kthui requested review from nnshah1 and ryanolson July 7, 2025 16:19

kthui added 2 commits July 7, 2025 09:58

docs: Clarify use-case of stream wrapper and rename to NetworkStreamW…

50b839e

…rapper and IsError to MaybeError

docs: Safeguard comments from cargo check

1e0b7a2

kthui disabled auto-merge July 7, 2025 18:10

ryanolson approved these changes Jul 7, 2025

View reviewed changes

kthui merged commit b4ddca9 into main Jul 7, 2025
10 checks passed

kthui deleted the jacky-ft-complete-final branch July 7, 2025 21:00

kthui mentioned this pull request Jul 8, 2025

Network/Router Level Error Propagation during Response Streaming ai-dynamo/enhancements#15

Merged

atchernych pushed a commit that referenced this pull request Jul 9, 2025

feat: Failure Detection while Responses are returning (#1671)

84324b4

coderabbitai bot mentioned this pull request Aug 15, 2025

feat: router-level request rejection #2465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Failure Detection while Responses are returning #1671

feat: Failure Detection while Responses are returning #1671

Uh oh!

kthui commented Jun 26, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jun 26, 2025

Uh oh!

Uh oh!

ryanolson left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jul 4, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

ryanolson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: Failure Detection while Responses are returning #1671

feat: Failure Detection while Responses are returning #1671

Uh oh!

Conversation

kthui commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jun 26, 2025

Uh oh!

Uh oh!

ryanolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jul 4, 2025

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ryanolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kthui commented Jun 26, 2025 •

edited

Loading