-
Notifications
You must be signed in to change notification settings - Fork 690
feat: migrate requests when planner shutdown decode engine (vllm) #2280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…/planner-migrate-shutdown
…ynamo/dynamo into hzhou/planner-migrate-shutdown
WalkthroughThis change introduces configurable graceful shutdown behavior for endpoints in the vLLM backend, allowing selective waiting for in-flight requests during shutdown. It adds a Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Endpoint
participant Worker
Client->>Endpoint: Send request
Endpoint->>Worker: Dispatch request
Worker-->>Endpoint: Stream responses (async)
Endpoint-->>Client: Forward responses
Note over Endpoint: During shutdown:
alt graceful_shutdown = True
Endpoint->>Worker: Wait for in-flight requests to finish
Worker-->>Endpoint: Complete all responses
Endpoint-->>Client: All responses delivered before shutdown
else graceful_shutdown = False
Endpoint-->>Client: Immediately stop accepting new requests
Worker--x Endpoint: In-flight requests may be migrated or terminated
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/backends/vllm/src/dynamo/vllm/main.py (1)
149-151: Replace debug print with proper logging.The debug print statements with excessive exclamation marks appear temporary and are not suitable for production code. Consider using the existing logger instead.
- print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") - print(f"Migration limit: {config.migration_limit}") - print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") + logger.info(f"Migration limit: {config.migration_limit}")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
components/backends/vllm/deploy/disagg_planner.yaml(2 hunks)components/backends/vllm/src/dynamo/vllm/handlers.py(2 hunks)components/backends/vllm/src/dynamo/vllm/main.py(4 hunks)lib/bindings/python/rust/engine.rs(3 hunks)lib/bindings/python/rust/lib.rs(1 hunks)lib/bindings/python/src/dynamo/_core.pyi(1 hunks)lib/runtime/src/component/endpoint.rs(3 hunks)lib/runtime/src/pipeline/network/ingress/push_endpoint.rs(2 hunks)lib/runtime/src/pipeline/network/ingress/push_handler.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: Kubernetes v1.33 introduced the stopSignal field as part of the official container lifecycle specification, allowing customization of termination signals without rebuilding container images. This field is legitimately placed under lifecycle and is autogenerated correctly by controller-gen when upgrading from older Kubernetes API versions.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
📚 Learning: the `create_endpoint` method in `workermetricspublisher` has backward compatibility maintained throu...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
Applied to files:
lib/runtime/src/component/endpoint.rslib/bindings/python/src/dynamo/_core.pyilib/bindings/python/rust/lib.rs
📚 Learning: the sglang `async_encode` method does not support streaming options, so collecting all embeddings be...
Learnt from: t-ob
PR: ai-dynamo/dynamo#1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang `async_encode` method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.
Applied to files:
components/backends/vllm/src/dynamo/vllm/handlers.pycomponents/backends/vllm/src/dynamo/vllm/main.py
📚 Learning: the asyncenginecontextprovider trait in lib/runtime/src/engine.rs was intentionally changed from `se...
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rslib/bindings/python/rust/engine.rslib/bindings/python/rust/lib.rs
📚 Learning: in async-nats, the "no responders" error is represented as async_nats::client::requesterrorkind::nor...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: in lib/llm/src/kv_router/scoring.rs, peabrane prefers panic-based early failure over result-based er...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: the codebase uses async-nats version 0.40, not the older nats crate. error handling should use async...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rslib/bindings/python/rust/engine.rs
📚 Learning: in async-nats, the "no responders" error is represented as async_nats::error::requesterrorkind::nore...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: in lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating lo...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: the `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects...
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
Applied to files:
components/backends/vllm/src/dynamo/vllm/main.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (21)
components/backends/vllm/deploy/disagg_planner.yaml (2)
193-193: LGTM! Migration limit configuration added correctly.The
--migration-limit=3parameter is properly added to the VllmDecodeWorker command line arguments, aligning with the PR objective to introduce configurable graceful shutdown behavior.
243-243: LGTM! Consistent migration limit configuration.The
--migration-limit=3parameter is properly added to the VllmPrefillWorker, maintaining consistency with the VllmDecodeWorker configuration.lib/runtime/src/component/endpoint.rs (3)
44-46: LGTM! Well-structured graceful shutdown configuration.The
graceful_shutdownfield is properly added with clear documentation and a sensible default value oftrue. The builder pattern integration follows the existing conventions.
62-62: LGTM! Proper field extraction following established patterns.The
graceful_shutdownfield is correctly extracted using the dissolve pattern, maintaining consistency with the existing codebase structure.
116-116: LGTM! Proper integration with PushEndpoint builder.The
graceful_shutdownflag is correctly passed to the PushEndpoint builder, ensuring the configuration flows through to the endpoint implementation.lib/bindings/python/src/dynamo/_core.pyi (1)
219-227: LGTM! Type stub properly updated with backward compatibility.The
serve_endpointmethod signature is correctly updated with the optionalgraceful_shutdown: bool = Trueparameter. The documentation is clear and maintains consistency with the existing style. The default value ensures backward compatibility.lib/bindings/python/rust/lib.rs (2)
478-491: LGTM! Proper pyo3 parameter handling with sensible defaults.The
graceful_shutdownparameter is correctly implemented using pyo3 conventions:
- Proper signature annotation with default value
- Option type for optional parameter handling
- Appropriate default value fallback using
unwrap_or(true)This maintains backward compatibility while enabling explicit control over graceful shutdown behavior.
493-493: LGTM! Proper builder chain integration.The
graceful_shutdownflag is correctly integrated into the builder chain, ensuring the configuration flows through to the underlying endpoint implementation.lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)
34-35: LGTM! Consistent struct field addition.The
graceful_shutdownfield is properly added with appropriate builder default value and follows the established codebase patterns.
121-133: LGTM! Robust conditional shutdown implementation.The graceful shutdown logic is well-implemented with:
- Thread-safe atomic operations for tracking inflight requests
- Proper coordination using notify/wait pattern
- Clear logging for both graceful and immediate shutdown paths
- Sensible conditional logic that respects the
graceful_shutdownflagThis provides the desired flexibility in shutdown behavior while maintaining system reliability.
components/backends/vllm/src/dynamo/vllm/handlers.py (2)
53-80: LGTM! Proper cancellation handling for graceful shutdown.The try-catch block correctly handles
asyncio.CancelledErrorduring token generation and converts it to aGeneratorExitwith a descriptive message. This aligns with the broader graceful shutdown mechanism and will be properly propagated through the Rust error handling layer.
182-199: LGTM! Consistent cancellation handling for prefill workers.The implementation mirrors the decode worker pattern with appropriate prefill-specific messaging. The comment explaining that prefill requests cannot be migrated provides valuable context for the error handling behavior.
lib/bindings/python/rust/engine.rs (3)
137-138: LGTM! New error variant for Python generator exit.The
PyGeneratorExit(String)variant is properly added to theResponseProcessingErrorenum, maintaining consistency with existing error handling patterns.
231-233: LGTM! Consistent error message for downstream detection.The hardcoded message "Stream ended before generation completed" provides a consistent way for downstream components to detect generator exit conditions, as referenced in the push handler logic.
285-294: LGTM! Proper Python exception type detection.The implementation correctly uses PyO3's
is_instance_ofto distinguish betweenGeneratorExitand other Python exceptions, ensuring proper error categorization for downstream handling.lib/runtime/src/pipeline/network/ingress/push_handler.rs (3)
17-17: LGTM! Import for error inspection capability.The
MaybeErrortrait import enables error checking on response items in the stream processing logic.
109-109: LGTM! Trait bound for error checking.Adding the
MaybeErrortrait bound to generic typeUenables inspection of response errors in the stream processing loop.
224-231: LGTM! Proper stream termination handling.The logic correctly detects the "Stream ended before generation completed" error and appropriately suppresses the final completion message. The warning log provides good visibility into this shutdown scenario.
components/backends/vllm/src/dynamo/vllm/main.py (3)
33-36: LGTM! Clear documentation of shutdown behavior.The updated docstring clearly explains how the
graceful_shutdownflag affects endpoint behavior during shutdown, improving code maintainability and understanding.
116-120: LGTM! Appropriate graceful shutdown for prefill workers.The configuration correctly sets
graceful_shutdown=Truefor prefill endpoints with clear justification: prefill requests cannot be re-routed and should complete quickly due to their nature.
198-200: LGTM! Appropriate non-graceful shutdown for decode workers.The configuration correctly sets
graceful_shutdown=Falsefor decode endpoints with clear justification: decode requests support migration and can be long-running, making immediate shutdown with request transfer the preferred approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct GeneratorExit exception is raised when the request needs to be migrated to another instance.
Note: The migration requires #2270 to work.
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
…i-dynamo/dynamo into hzhou/planner-migrate-shutdown
…/planner-migrate-shutdown
Summary by CodeRabbit
New Features
Bug Fixes
Documentation