Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Aug 4, 2025

Summary by CodeRabbit

  • New Features

    • Added configurable graceful shutdown for endpoints, allowing control over whether in-flight requests are completed during shutdown.
    • Introduced a migration limit setting for worker components to manage workload transitions.
  • Bug Fixes

    • Improved handling of generator termination events to provide clearer error messaging and prevent unintended completion signals when streams end prematurely.
  • Documentation

    • Updated method and function docstrings to clarify shutdown behaviors and new parameters.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 4, 2025

Walkthrough

This change introduces configurable graceful shutdown behavior for endpoints in the vLLM backend, allowing selective waiting for in-flight requests during shutdown. It adds a --migration-limit argument to deployment YAML, updates endpoint serving logic and signatures to support the new shutdown flag, and improves error handling for Python generator exits across Python and Rust layers.

Changes

Cohort / File(s) Change Summary
Deployment Configuration
components/backends/vllm/deploy/disagg_planner.yaml
Added --migration-limit=3 argument to both VllmDecodeWorker and VllmPrefillWorker command invocations.
Python Async Generator Error Handling
components/backends/vllm/src/dynamo/vllm/handlers.py
Added explicit handling for asyncio.CancelledError in async generators, raising GeneratorExit with context-specific messages.
Endpoint Shutdown Control (Python)
components/backends/vllm/src/dynamo/vllm/main.py
Clarified shutdown docstrings; set graceful_shutdown=True for prefill and False for decode endpoints; added debug print for migration_limit.
Python-Rust FFI & API Surface
lib/bindings/python/rust/engine.rs, lib/bindings/python/rust/lib.rs, lib/bindings/python/src/dynamo/_core.pyi
Added PyGeneratorExit error variant; updated serve_endpoint to accept graceful_shutdown parameter (default True) in both Rust and Python stubs.
Endpoint and Pipeline Shutdown (Rust)
lib/runtime/src/component/endpoint.rs, lib/runtime/src/pipeline/network/ingress/push_endpoint.rs
Added graceful_shutdown field to endpoint config and push endpoint; shutdown logic now conditionally waits for in-flight requests based on this flag.
Stream Error Propagation (Rust)
lib/runtime/src/pipeline/network/ingress/push_handler.rs
Enhanced error handling: detects generator exit errors, prevents sending final completion messages when stream ends prematurely.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Endpoint
    participant Worker

    Client->>Endpoint: Send request
    Endpoint->>Worker: Dispatch request
    Worker-->>Endpoint: Stream responses (async)
    Endpoint-->>Client: Forward responses

    Note over Endpoint: During shutdown:
    alt graceful_shutdown = True
        Endpoint->>Worker: Wait for in-flight requests to finish
        Worker-->>Endpoint: Complete all responses
        Endpoint-->>Client: All responses delivered before shutdown
    else graceful_shutdown = False
        Endpoint-->>Client: Immediately stop accepting new requests
        Worker--x Endpoint: In-flight requests may be migrated or terminated
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • feat: add graceful shutdown in vllm_1 #1562: Implements graceful shutdown in the vllm_v1 worker, handling signals and runtime shutdown. Both PRs address shutdown behavior, but in different components of the vLLM system.

Poem

A rabbit hops with gentle might,
Tweaking shutdowns left and right—
With endpoints now both swift and kind,
Some wait, some leave requests behind.
Migration limits set with care,
Async streams now well aware—
The warren’s code runs smooth and bright! 🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/backends/vllm/src/dynamo/vllm/main.py (1)

149-151: Replace debug print with proper logging.

The debug print statements with excessive exclamation marks appear temporary and are not suitable for production code. Consider using the existing logger instead.

-        print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
-        print(f"Migration limit: {config.migration_limit}")
-        print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
+        logger.info(f"Migration limit: {config.migration_limit}")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26dc628 and c29a591.

📒 Files selected for processing (9)
  • components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
  • components/backends/vllm/src/dynamo/vllm/handlers.py (2 hunks)
  • components/backends/vllm/src/dynamo/vllm/main.py (4 hunks)
  • lib/bindings/python/rust/engine.rs (3 hunks)
  • lib/bindings/python/rust/lib.rs (1 hunks)
  • lib/bindings/python/src/dynamo/_core.pyi (1 hunks)
  • lib/runtime/src/component/endpoint.rs (3 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_handler.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: Kubernetes v1.33 introduced the stopSignal field as part of the official container lifecycle specification, allowing customization of termination signals without rebuilding container images. This field is legitimately placed under lifecycle and is autogenerated correctly by controller-gen when upgrading from older Kubernetes API versions.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
📚 Learning: the `create_endpoint` method in `workermetricspublisher` has backward compatibility maintained throu...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Applied to files:

  • lib/runtime/src/component/endpoint.rs
  • lib/bindings/python/src/dynamo/_core.pyi
  • lib/bindings/python/rust/lib.rs
📚 Learning: the sglang `async_encode` method does not support streaming options, so collecting all embeddings be...
Learnt from: t-ob
PR: ai-dynamo/dynamo#1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang `async_encode` method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.

Applied to files:

  • components/backends/vllm/src/dynamo/vllm/handlers.py
  • components/backends/vllm/src/dynamo/vllm/main.py
📚 Learning: the asyncenginecontextprovider trait in lib/runtime/src/engine.rs was intentionally changed from `se...
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
  • lib/bindings/python/rust/engine.rs
  • lib/bindings/python/rust/lib.rs
📚 Learning: in async-nats, the "no responders" error is represented as async_nats::client::requesterrorkind::nor...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: in lib/llm/src/kv_router/scoring.rs, peabrane prefers panic-based early failure over result-based er...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: the codebase uses async-nats version 0.40, not the older nats crate. error handling should use async...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
  • lib/bindings/python/rust/engine.rs
📚 Learning: in async-nats, the "no responders" error is represented as async_nats::error::requesterrorkind::nore...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: in lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating lo...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
📚 Learning: the `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects...
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

Applied to files:

  • components/backends/vllm/src/dynamo/vllm/main.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (21)
components/backends/vllm/deploy/disagg_planner.yaml (2)

193-193: LGTM! Migration limit configuration added correctly.

The --migration-limit=3 parameter is properly added to the VllmDecodeWorker command line arguments, aligning with the PR objective to introduce configurable graceful shutdown behavior.


243-243: LGTM! Consistent migration limit configuration.

The --migration-limit=3 parameter is properly added to the VllmPrefillWorker, maintaining consistency with the VllmDecodeWorker configuration.

lib/runtime/src/component/endpoint.rs (3)

44-46: LGTM! Well-structured graceful shutdown configuration.

The graceful_shutdown field is properly added with clear documentation and a sensible default value of true. The builder pattern integration follows the existing conventions.


62-62: LGTM! Proper field extraction following established patterns.

The graceful_shutdown field is correctly extracted using the dissolve pattern, maintaining consistency with the existing codebase structure.


116-116: LGTM! Proper integration with PushEndpoint builder.

The graceful_shutdown flag is correctly passed to the PushEndpoint builder, ensuring the configuration flows through to the endpoint implementation.

lib/bindings/python/src/dynamo/_core.pyi (1)

219-227: LGTM! Type stub properly updated with backward compatibility.

The serve_endpoint method signature is correctly updated with the optional graceful_shutdown: bool = True parameter. The documentation is clear and maintains consistency with the existing style. The default value ensures backward compatibility.

lib/bindings/python/rust/lib.rs (2)

478-491: LGTM! Proper pyo3 parameter handling with sensible defaults.

The graceful_shutdown parameter is correctly implemented using pyo3 conventions:

  • Proper signature annotation with default value
  • Option type for optional parameter handling
  • Appropriate default value fallback using unwrap_or(true)

This maintains backward compatibility while enabling explicit control over graceful shutdown behavior.


493-493: LGTM! Proper builder chain integration.

The graceful_shutdown flag is correctly integrated into the builder chain, ensuring the configuration flows through to the underlying endpoint implementation.

lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)

34-35: LGTM! Consistent struct field addition.

The graceful_shutdown field is properly added with appropriate builder default value and follows the established codebase patterns.


121-133: LGTM! Robust conditional shutdown implementation.

The graceful shutdown logic is well-implemented with:

  • Thread-safe atomic operations for tracking inflight requests
  • Proper coordination using notify/wait pattern
  • Clear logging for both graceful and immediate shutdown paths
  • Sensible conditional logic that respects the graceful_shutdown flag

This provides the desired flexibility in shutdown behavior while maintaining system reliability.

components/backends/vllm/src/dynamo/vllm/handlers.py (2)

53-80: LGTM! Proper cancellation handling for graceful shutdown.

The try-catch block correctly handles asyncio.CancelledError during token generation and converts it to a GeneratorExit with a descriptive message. This aligns with the broader graceful shutdown mechanism and will be properly propagated through the Rust error handling layer.


182-199: LGTM! Consistent cancellation handling for prefill workers.

The implementation mirrors the decode worker pattern with appropriate prefill-specific messaging. The comment explaining that prefill requests cannot be migrated provides valuable context for the error handling behavior.

lib/bindings/python/rust/engine.rs (3)

137-138: LGTM! New error variant for Python generator exit.

The PyGeneratorExit(String) variant is properly added to the ResponseProcessingError enum, maintaining consistency with existing error handling patterns.


231-233: LGTM! Consistent error message for downstream detection.

The hardcoded message "Stream ended before generation completed" provides a consistent way for downstream components to detect generator exit conditions, as referenced in the push handler logic.


285-294: LGTM! Proper Python exception type detection.

The implementation correctly uses PyO3's is_instance_of to distinguish between GeneratorExit and other Python exceptions, ensuring proper error categorization for downstream handling.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (3)

17-17: LGTM! Import for error inspection capability.

The MaybeError trait import enables error checking on response items in the stream processing logic.


109-109: LGTM! Trait bound for error checking.

Adding the MaybeError trait bound to generic type U enables inspection of response errors in the stream processing loop.


224-231: LGTM! Proper stream termination handling.

The logic correctly detects the "Stream ended before generation completed" error and appropriately suppresses the final completion message. The warning log provides good visibility into this shutdown scenario.

components/backends/vllm/src/dynamo/vllm/main.py (3)

33-36: LGTM! Clear documentation of shutdown behavior.

The updated docstring clearly explains how the graceful_shutdown flag affects endpoint behavior during shutdown, improving code maintainability and understanding.


116-120: LGTM! Appropriate graceful shutdown for prefill workers.

The configuration correctly sets graceful_shutdown=True for prefill endpoints with clear justification: prefill requests cannot be re-routed and should complete quickly due to their nature.


198-200: LGTM! Appropriate non-graceful shutdown for decode workers.

The configuration correctly sets graceful_shutdown=False for decode endpoints with clear justification: decode requests support migration and can be long-running, making immediate shutdown with request transfer the preferred approach.

Copy link
Contributor

@kthui kthui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct GeneratorExit exception is raised when the request needs to be migrated to another instance.

Note: The migration requires #2270 to work.

tedzhouhk and others added 4 commits August 4, 2025 21:27
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
@tedzhouhk tedzhouhk merged commit 36c4ef5 into main Aug 5, 2025
10 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/planner-migrate-shutdown branch August 5, 2025 19:24
jain-ria pushed a commit that referenced this pull request Aug 7, 2025
)

Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants