-
Notifications
You must be signed in to change notification settings - Fork 676
fix: fix endpoint run to return error DIS-325 #2156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix endpoint run to return error DIS-325 #2156
Conversation
WalkthroughThis update introduces an "integration" feature flag to the LLM library, enhances error handling and adds integration tests for the endpoint service, refines event publishing/subscribing tests in both component and namespace modules, renames HTTP server metric namespaces, adjusts metrics test expectations, and significantly expands the soak test with request counting, configurable delays, and improved logging. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Integration Test
participant Runtime as DistributedRuntime
participant Endpoint as Endpoint Service
Test->>Runtime: Create runtime & engine config
Test->>Endpoint: Start run(endpoint, ...)
alt Valid endpoint
Endpoint-->>Test: Service runs
Test->>Endpoint: Connect client
Endpoint-->>Test: Connection successful
Test->>Endpoint: Abort service
else Invalid endpoint
Endpoint-->>Test: Return error (invalid endpoint)
end
sequenceDiagram
participant Client
participant PushRouter
participant Backend as RequestHandler
loop For each request
Client->>PushRouter: Send request
PushRouter->>Backend: Route request
alt Normal processing
Backend-->>PushRouter: Immediate response
else Delayed processing
Backend-->>PushRouter: Response after delay
end
PushRouter-->>Client: Deliver response
end
Backend-->>Client: Report backend_counter at shutdown
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/llm/src/entrypoint/input/endpoint.rs (1)
127-240: Well-structured integration tests!The tests effectively validate the error handling improvements:
- Valid endpoint test confirms successful service startup and client connectivity
- Invalid endpoint test verifies error propagation with appropriate validation keywords
The
#[ignore]attribute on the invalid endpoint test is appropriately documented. Consider creating a tracking issue for the DistributedRuntime drop problem to ensure it's addressed in the future.Would you like me to create a GitHub issue to track the DistributedRuntime drop issue mentioned in the test?
lib/runtime/tests/soak.rs (1)
146-193: Consider aligning the stream timeout with the request timeout.The changes improve test reliability:
- Reduced batch load is more reasonable
- PushRouter integration aligns with PR objectives
- Shorter request timeout (5s) helps detect issues faster
However, the stream timeout on line 177 remains at 30s while the request timeout is 5s. Consider reducing it for consistency.
- while let Some(_resp) = - tokio::time::timeout(Duration::from_secs(30), stream.next()) - .await - .context("stream timed out")? - {} + while let Some(_resp) = + tokio::time::timeout(Duration::from_secs(5), stream.next()) + .await + .context("stream timed out")? + {}
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
lib/llm/Cargo.toml(1 hunks)lib/llm/src/entrypoint/input/endpoint.rs(2 hunks)lib/runtime/src/component/component.rs(1 hunks)lib/runtime/src/component/namespace.rs(1 hunks)lib/runtime/src/http_server.rs(3 hunks)lib/runtime/src/metrics.rs(4 hunks)lib/runtime/tests/soak.rs(7 hunks)
🧰 Additional context used
🧠 Learnings (5)
lib/runtime/src/component/namespace.rs (4)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/src/component/component.rs (5)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/src/metrics.rs (1)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/llm/src/entrypoint/input/endpoint.rs (2)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/tests/soak.rs (9)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (17)
lib/runtime/src/component/namespace.rs (3)
102-103: Excellent improvement to test clarity and isolation.The change from generic "test" names to descriptive "test_namespace_publish" and "test_event" significantly improves test readability and prevents potential namespace collisions when integration tests run concurrently.
111-116: Consistent naming pattern applied across subscribe test.Good alignment with the publish test by using "test_namespace_subscribe" and "test_event" naming. This ensures each test has its own isolated namespace while maintaining consistency in event naming.
119-119: Event name consistency maintained in publish call.The event name "test_event" matches the subscription event name, ensuring the test correctly validates the publish-subscribe functionality.
lib/runtime/src/component/component.rs (3)
92-94: Well-structured component-level event testing.The changes correctly establish a component-level test with descriptive naming ("test_component_publish", "test_component", "test_event") and properly exercise the component's publish functionality rather than namespace-level publishing.
102-108: Proper test isolation and component-level subscription.Excellent improvements:
- Distinct namespace name prevents test collisions
- Component creation and subscription correctly targets component-level events
- Clear comment explaining the component-level subscription intent
110-111: Consistent component-level event flow.The publish operation correctly uses the component instance and matches the subscribed event name, ensuring the test validates the complete component-level publish-subscribe cycle.
lib/runtime/src/http_server.rs (3)
78-78: Namespace change aligns with standardization effort.The change from
"http_server"to"dynamo_system_server"is consistent with the broader metric naming standardization described in the PR objectives, specifically addressing namespace collision issues.
101-101: Metric name simplification is appropriate.Changing from
"dynamo_uptime_seconds"to"uptime_seconds"removes redundant prefixing since the comment on lines 98-99 indicates the namespace will provide the appropriate prefix automatically.
335-337: Test expectations correctly updated.The test assertions properly reflect the new namespace
"dynamo_system_server"and the resulting metric name"dynamo_system_server_uptime_seconds", maintaining consistency with the implementation changes.lib/runtime/src/metrics.rs (4)
800-800: Improved invalid namespace test case.Changing from
"test-namespace"to"@@123"provides a clearer example of an invalid namespace that gets completely stripped to an empty string, making the test case more explicit about the sanitization behavior.
813-822: Updated comments and assertions reflect the new test case.The comments and error messages correctly describe the behavior with the
"@@123"namespace, which gets sanitized to an empty string and causes metric creation to fail appropriately.
929-937: Consistent metric naming in namespace test expectations.The test now correctly expects only the properly namespaced
testnamespace_testintcountermetric, removing inconsistencies in the previous test expectations. This aligns with the standardized naming conventions.
1031-1033: DRT-level test expectations maintain consistency.The DRT output test correctly shows the namespaced metric
testnamespace_testintcounter, ensuring consistency across all hierarchical levels of the metrics system.lib/llm/Cargo.toml (1)
38-38: LGTM!The addition of the
integrationfeature flag follows Rust best practices for conditionally compiling integration tests.lib/llm/src/entrypoint/input/endpoint.rs (1)
87-112: Excellent error handling improvements!The enhanced error propagation properly captures and returns the actual errors from the endpoint service, which directly addresses the PR objective. The detailed logging for different scenarios (success, error, cancellation) will significantly improve debugging capabilities.
lib/runtime/tests/soak.rs (2)
69-108: Good enhancements to RequestHandler!The atomic counter provides thread-safe request tracking, and the configurable processing mode (normal vs delayed) enables testing different load scenarios. The 100ms delay in the async stream effectively simulates slow backend processing for queue saturation testing.
112-136: Clean implementation of configurable backend!Returning the handler enables post-test metrics inspection. The environment variable parsing with a default value provides good flexibility for different test scenarios.
WalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Integration Test
participant Entrypoint as run()
participant Runtime as DistributedRuntime
Test->>Entrypoint: Call run(endpoint, ...)
Entrypoint->>Runtime: Start runtime future (rt_fut)
Entrypoint-->>Test: Await rt_fut or cancellation
alt rt_fut completes successfully
Entrypoint-->>Test: Return error ("unexpected completion")
else rt_fut returns error
Entrypoint-->>Test: Propagate error
else Cancelled
Entrypoint-->>Test: Return Ok(())
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
lib/llm/src/entrypoint/input/endpoint.rs (1)
208-240: Consider tracking the DistributedRuntime drop issue.The test logic correctly validates error propagation for invalid endpoints. However, the ignored test indicates a potential resource cleanup issue that should be tracked.
Would you like me to create an issue to track the DistributedRuntime drop problem mentioned in the ignore attribute?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
lib/llm/Cargo.toml(1 hunks)lib/llm/src/entrypoint/input/endpoint.rs(2 hunks)lib/runtime/src/component/component.rs(1 hunks)lib/runtime/src/component/namespace.rs(1 hunks)lib/runtime/src/http_server.rs(3 hunks)lib/runtime/src/metrics.rs(4 hunks)lib/runtime/tests/soak.rs(7 hunks)
🧰 Additional context used
🧠 Learnings (5)
lib/runtime/src/component/namespace.rs (4)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
lib/runtime/src/component/component.rs (5)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: alec-flowers
PR: #1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions create_stored_blocks and create_stored_block_from_parts are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/src/metrics.rs (1)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/llm/src/entrypoint/input/endpoint.rs (2)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/runtime/tests/soak.rs (9)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: jthomson04
PR: #1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the wait_for_key_count function correctly uses exact equality (==) instead of greater-than-or-equal (>=) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (18)
lib/llm/Cargo.toml (1)
38-38: LGTM! Clean feature flag addition.The integration feature is properly defined as an empty feature flag, which is appropriate for conditionally compiling integration tests without additional dependencies.
lib/runtime/src/metrics.rs (3)
800-822: Improved test coverage with stricter validation.The test now uses a more extreme invalid namespace case (
"@@123") that gets completely sanitized to an empty string, properly validating that metric creation fails as expected. This strengthens the test coverage for edge cases in namespace validation.
929-939: Test expectations properly updated.The expected Prometheus output correctly reflects the new namespace handling, ensuring consistency between the test setup and validation.
1031-1035: Consistent test output formatting.The DRT output expectations are properly aligned with the namespace changes, maintaining test integrity.
lib/runtime/src/component/namespace.rs (2)
102-103: Improved test naming for clarity.The namespace and event names are now more descriptive and specific, which helps avoid potential namespace collisions and improves test readability.
111-120: Consistent naming across subscribe test.The subscribe test uses matching descriptive naming that aligns with the publish test improvements, maintaining consistency across the test suite.
lib/runtime/src/component/component.rs (2)
92-95: Better test naming and scoping.The component test now uses more descriptive names that avoid namespace collisions and clearly indicate the test scope.
102-112: Improved test scoping and consistency.The subscribe test now correctly creates the subscriber on the component rather than namespace, making the test more focused and accurate. The naming is also consistent with the improved conventions.
lib/runtime/src/http_server.rs (3)
78-78: Namespace renaming resolves collision issues.The change from "http_server" to "dynamo_system_server" aligns with the PR objective of resolving namespace collision problems and follows a more consistent naming convention.
101-101: Metric name simplified with proper namespace prefix.The metric name is now simplified to "uptime_seconds" since the namespace prefix will automatically provide "dynamo_system_server_", resulting in the full name "dynamo_system_server_uptime_seconds".
335-339: Test expectations updated correctly.The expected Prometheus output properly reflects the new namespace and metric naming, ensuring test consistency with the implementation changes.
lib/runtime/tests/soak.rs (4)
83-110: LGTM!The AsyncEngine implementation correctly tracks request counts and provides configurable processing modes for testing different load scenarios.
112-137: LGTM!The backend function correctly initializes the handler with configurable processing mode and returns it for post-shutdown inspection.
148-202: LGTM!The client implementation correctly uses PushRouter for better load distribution and provides helpful progress output during the soak test.
50-66: LGTM!The app function correctly captures and reports the final request count, providing valuable metrics for the soak test.
lib/llm/src/entrypoint/input/endpoint.rs (3)
87-112: LGTM!The error handling correctly propagates errors from the runtime future, providing clear error messages and proper logging. The comments effectively explain the rationale for this approach.
133-154: LGTM!The test helper function properly sets up the test environment with appropriate error handling and context.
156-206: LGTM!The valid endpoint test correctly verifies that the service starts and accepts connections. The error handling and cleanup approach is appropriate for integration testing.
d04d5ee to
47de232
Compare
Overview:
Fix endpoint error handling to return proper errors instead of hiding them.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Fixes DIS-325
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Refactor