Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Aug 28, 2025

  1. create a better dataset and SLAs that can highlight sla planner improvement
  2. bug fix: workers needs to be directly run in k8s instead of shell wrapper to properly receive the SIGTERM signal
  3. bug fix: add a shutdown mode in DRT that will 1) disable endpoint, 2) wait for inflight requests to finish, 3) shutdown nats/etcd infra so that decode workers can still call prefill worker's generate function during this graceful period. (see mermaid figures below for more details) Currently, the order of shutdown is 1) disable endpoint and shutdown nats/etcd infra immediately, 2) wait for inflight requests to finish via TCP stream.
  4. bug fix: disable request migration in planner, because D's request might be calling P's generate endpoint when shutdown, if request fails at this period, P's endpoint will error out and stuck.
  5. bug fix: router shutdown chat/completion endpoints when workers scales from x -> x-1, x>=2.
  6. bug fix: use int in sin_synth's timestamp
  7. nit bug fix: wrong image pull secret in nixl benchmark
  8. improvement: endpoint will be served with graceful_shutdown automatically selected based on request migration limit
  9. improvement: planner load predictor now can ignore the idling period at the beginning, leading to more accurate prediction
  10. improvement: sla planner will print out more logs showing how the scaling decision is made
  11. TODO: add sla planner benchmark result (still running)
graph TD
    A["Main CancellationToken<br/>(controls NATS/ETCD)"] --> B["endpoint_shutdown_token<br/>(child of main)"]
    
    B --> C["Endpoint Token 1<br/>(no lease)"]
    B --> D["Endpoint Token 2<br/>(no lease)"]
    B --> E["Runtime Shutdown Token<br/>(for endpoint with lease)"]
    
    F["ETCD Lease Token"] --> G["Combined Token<br/>(cancelled if either parent cancels)"]
    E --> G
    
    style A fill:#ff9999
    style B fill:#ffcc99
    style G fill:#99ccff
    
    A -.->|"Phase 3: Cancels to<br/>shutdown NATS/ETCD"| A
    B -.->|"Phase 1: Cancels to stop<br/>accepting new requests"| B
    F -.->|"Cancels on<br/>lease expiry"| F
Loading
graph TD
    A[Main Token] --> B[endpoint_shutdown_token]
    A --> C[Lease Token]
    
    B --> D[runtime_shutdown_token]
    C --> E[lease_token]
    
    D --> F{Has Lease?}
    E --> F
    
    F -->|Yes| G[Combined Token<br/>Monitors both]
    F -->|No| H[Just runtime token]
    
    G --> I[Endpoint with lease]
    H --> J[Endpoint without lease]
    
    K[runtime.shutdown] -->|cancels| B
    L[Lease expires] -->|cancels| C
    
    B -->|triggers| G
    C -->|triggers| G
    
    style A fill:#f96,stroke:#333,stroke-width:4px
    style B fill:#69f,stroke:#333,stroke-width:4px
    style C fill:#9f6,stroke:#333,stroke-width:4px
    style G fill:#fcf,stroke:#333,stroke-width:3px
Loading
sequenceDiagram
    participant Signal
    participant Python Worker
    participant Runtime
    participant Tracker
    participant Endpoints
    participant NATS/ETCD
    
    Signal->>Python Worker: SIGTERM/SIGINT
    Python Worker->>Runtime: runtime.shutdown()
    
    Note over Runtime: Spawn shutdown coordination task
    
    rect rgb(255, 235, 205)
        Note over Runtime: Phase 1: Stop accepting new requests
        Runtime->>Endpoints: Cancel endpoint_shutdown_token
        Endpoints->>Endpoints: Stop accepting new requests<br/>(but keep processing in-flight)
    end
    
    rect rgb(205, 235, 255)
        Note over Runtime: Phase 2: Wait for graceful endpoints
        Runtime->>Tracker: Check active endpoints count
        
        loop While active_endpoints > 0
            Note over NATS/ETCD: Still alive for D→P calls
            Endpoints->>Endpoints: Complete in-flight requests
            Endpoints->>Tracker: unregister_endpoint()
            Tracker->>Runtime: notify_waiters()
            Runtime->>Tracker: Recheck count
        end
        
        Tracker->>Runtime: All endpoints completed (count = 0)
    end
    
    rect rgb(255, 205, 205)
        Note over Runtime: Phase 3: Shutdown infrastructure
        Runtime->>NATS/ETCD: Cancel main cancellation_token
        NATS/ETCD->>NATS/ETCD: Shutdown connections
    end
    
    Python Worker->>Signal: Process exits cleanly
Loading

Summary by CodeRabbit

  • New Features

    • Coordinated graceful shutdown with endpoint-level cancellation and phased runtime shutdown; endpoints register/unregister for graceful completion.
  • Refactor

    • Switched worker startup to explicit Python args; added detailed worker lifecycle logs; introduced throughput-based planning calculations and logging.
  • Bug Fixes

    • Standardized timestamps to integer milliseconds.
    • Avoided removing models when active instances still exist.
  • Tests

    • Added performance test deployments (aggregate, disaggregated, tensor-parallel, planner) and an image-cache DaemonSet.
  • Documentation

    • Updated planner README with new workloads, TTFT, and E2E perf test guidance.
  • Chores

    • Updated image pull secret naming in a benchmark manifest.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Copy link
Contributor

@hhzhang16 hhzhang16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved regarding planner changes only

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Changes span deployments, planner logic, benchmarking utilities, runtime shutdown/cancellation plumbing (Rust), and backend workers (VLLM/SGLang). Highlights: new graceful-shutdown tracker and cancellation token layering in runtime/endpoint, updated YAML invocations, timestamp cast to int, throughput-based planner calculations, adjusted model deletion behavior, and added perf test manifests/docs.

Changes

Cohort / File(s) Summary
Benchmark configs
benchmarks/nixl/nixl-benchmark-deployment.yaml
Rename imagePullSecrets from nvcrimagepullsecret to nvcr-imagepullsecret.
Benchmark generator
benchmarks/sin_load_generator/sin_synth.py
Output JSON now uses integer millisecond timestamp (int(t_req * 1000)); comment updated.
SGLang deploy args refactor
components/backends/sglang/deploy/disagg_planner.yaml
Switch from shell command to python3 with explicit args; adds disaggregation mode flags; no image/replica changes.
SGLang serve call change
components/backends/sglang/src/dynamo/sglang/main.py
Modified serve_endpoint call: replaced graceful_shutdown=False with a boolean expression using graceful_shutdown == config.migration_limit <= 0; potential undefined symbol.
vLLM deploy args refactor
components/backends/vllm/deploy/disagg_planner.yaml
Switch to python3 with arg list; remove --migration-limit=3; keep worker roles and probes.
vLLM worker runtime
components/backends/vllm/src/dynamo/vllm/main.py
Added detailed lifecycle logs; decode worker now uses graceful_shutdown = (config.migration_limit <= 0) for serve_endpoint.
Planner predictors
components/planner/src/dynamo/planner/utils/load_predictor.py
Coerce NaN to 0; skip initial zero to avoid idle period; Prophet buffer clarified and structured as dicts.
Planner scaling core
components/planner/src/dynamo/planner/utils/planner_core.py
Introduces explicit throughput variables for prefill/decode; logs intermediate throughput and replica counts; same return shape.
Model discovery delete behavior
lib/llm/src/discovery/watcher.rs
If active instances exist, early-return without removing per-type models or emitting updates; unchanged path when none remain.
Endpoint cancellation plumbing
lib/runtime/src/component/endpoint.rs, lib/runtime/src/pipeline/network/ingress/push_endpoint.rs
Add CancellationToken support to PushEndpoint builder and endpoint task; combined lease/runtime cancellation; register/unregister with graceful-shutdown tracker; refined logs.
Runtime shutdown layering
lib/runtime/src/runtime.rs, lib/runtime/src/lib.rs, lib/runtime/src/distributed.rs
Add endpoint_shutdown_token and GracefulShutdownTracker to Runtime; three-phase shutdown (stop endpoints, await graceful completion, cancel main token); expose tracker accessors; minor Mutex type fq-path change.
Graceful shutdown utilities
lib/runtime/src/utils.rs, lib/runtime/src/utils/graceful_shutdown.rs
New GracefulShutdownTracker (counter + Notify) with register/unregister/wait; re-export from utils.
Planner perf tests and docs
tests/planner/README.md, tests/planner/perf_test_configs/*
New perf deployment manifests (agg/disagg variants, TP2, image cache DaemonSet); updated Quick Start, datasets, TTFT, commands, and probes/images.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin as Admin/Orchestrator
  participant RT as Runtime
  participant GST as GracefulShutdownTracker
  participant EP as Endpoint Task
  participant PE as PushEndpoint

  Note over RT,GST: Startup
  Admin->>RT: start()
  RT->>RT: create endpoint_shutdown_token (child of main token)
  RT->>GST: new()
  RT->>EP: spawn with cancellation_token (lease ∨ runtime)
  EP->>GST: register_endpoint()
  EP->>PE: PushEndpoint::builder().cancellation_token(token).start()

  Note over RT,EP: Shutdown (3 phases)
  Admin->>RT: shutdown()
  RT->>RT: cancel endpoint_shutdown_token (stop new work)
  RT->>GST: wait_for_completion() (Phase 2)
  par Endpoint completes
    PE-->>EP: cancellation signal
    EP->>PE: stop().await
    EP->>GST: unregister_endpoint()
  and Possibly others
    EP-->>GST: unregister_endpoint()
  end
  GST-->>RT: completion when count==0
  RT->>RT: cancel main token (NATS/ETCD) (Phase 3)
  RT-->>Admin: shutdown complete
Loading
sequenceDiagram
  autonumber
  participant V as vLLM Decode Worker
  participant SE as serve_endpoint
  participant C as Config

  V->>C: read migration_limit
  alt migration_limit <= 0
    V->>SE: serve_endpoint(graceful_shutdown=true)
  else
    V->>SE: serve_endpoint(graceful_shutdown=false)
  end
  SE-->>V: runs until cancelled
Loading
sequenceDiagram
  autonumber
  participant W as ModelWatcher
  participant Store as Model Registry
  participant Bus as Update Stream

  W->>Store: on delete(model)
  alt active instances exist
    W-->>Store: return Ok(None)
    Note right of W: No per-type removals or updates emitted
  else
    W->>Store: remove per-type models
    W->>Bus: emit ModelUpdate::Removed
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Poem

In hush of nodes, I twitch an ear—
Tokens ripple, shutdown’s near.
Trackers count, then softly wait,
Endpoints bow to cancel’s gate.
Prefill, decode, they take their cue—
I thump: “All green!” and hop on through. 🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
components/planner/src/dynamo/planner/utils/load_predictor.py (2)

121-127: Bug: step_size is ignored; timestamps advance by 1s regardless.

Prophet inputs should reflect the configured cadence. Use step_size when building ds.

Apply:

@@
-        self.step_size = step_size
+        self.step_size = step_size
@@
-        timestamp = self.start_date + timedelta(seconds=self.curr_step)
+        timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size)
@@
-        next_timestamp = self.start_date + timedelta(seconds=self.curr_step)
+        next_timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size)

Also applies to: 131-132, 165-167


158-170: Harden Prophet path: guard fit/predict and fallback to last value.

Prophet can fail on edge windows; ARIMA already has a safe fallback—mirror that here.

Apply:

-        model = Prophet()
-
-        # Fit the model
-        model.fit(df)
-
-        # Create future dataframe for next timestamp
-        next_timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size)
-        future_df = pd.DataFrame({"ds": [next_timestamp]})
-
-        # Make prediction
-        forecast = model.predict(future_df)
-        return forecast["yhat"].iloc[0]
+        model = Prophet()
+        try:
+            model.fit(df)
+            next_timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size)
+            future_df = pd.DataFrame({"ds": [next_timestamp]})
+            forecast = model.predict(future_df)
+            return forecast["yhat"].iloc[0]
+        except Exception as e:
+            logger.warning(f"Prophet prediction failed: {e}, using last value")
+            return self.get_last_value()
tests/planner/perf_test_configs/disagg_8b_tp2.yaml (1)

1-148: Remove remaining shell-wrapped commands
Tests and Helm values still invoke containers via /bin/sh -c. Update them to call the binaries directly—for example, in

  • tests/planner/perf_test_configs/disagg_8b_tp2.yaml (lines 25–26, 44–45, 94–95, 144–145)
  • deploy/inference-gateway/helm/dynamo-gaie/values.yaml (line 80)
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)

59-75: Prioritize cancellation in biased select.

With biased;, earlier branches win. Place cancelled() before endpoint.next() to avoid accepting new requests when cancel fires.

-            let req = tokio::select! {
-                biased;
-
-                // await on service request
-                req = endpoint.next() => {
-                    req
-                }
-
-                // process shutdown
-                _ = self.cancellation_token.cancelled() => {
+            let req = tokio::select! {
+                biased;
+                // process shutdown
+                _ = self.cancellation_token.cancelled() => {
                     tracing::info!("PushEndpoint received cancellation signal, shutting down service");
                     if let Err(e) = endpoint.stop().await {
                         tracing::warn!("Failed to stop NATS service: {:?}", e);
                     }
                     break;
-                }
+                }
+                // await on service request
+                req = endpoint.next() => {
+                    req
+                }
             };
lib/runtime/src/component/endpoint.rs (1)

213-226: Ensure cleanup on etcd registration failure (cancel and await the task)

Currently you cancel but return early without awaiting the spawned endpoint task, which can leave resources alive longer and delay tracker decrement. Await the task after cancel for deterministic cleanup.

-        if let Some(etcd_client) = &etcd_client
-            && let Err(e) = etcd_client
-                .kv_create(
-                    &etcd_path,
-                    info,
-                    Some(lease_id),
-                )
-                .await
-        {
-            tracing::error!("Failed to register discoverable service: {:?}", e);
-            cancel_token.cancel();
-            return Err(error!("Failed to register discoverable service"));
-        }
-        task.await??;
+        // etcd registration: if it fails, cancel and ensure the endpoint task exits before returning
+        if let Some(etcd_client) = &etcd_client {
+            if let Err(e) = etcd_client
+                .kv_create(&etcd_path, info, Some(lease_id))
+                .await
+            {
+                tracing::error!("Failed to register discoverable service at {}: {:?}", etcd_path, e);
+                cancel_token.cancel();
+                // Ensure graceful cleanup and tracker unregistration
+                let _ = task.await;
+                return Err(error!("Failed to register discoverable service"));
+            }
+        }
+        task.await??;
♻️ Duplicate comments (2)
lib/runtime/src/lib.rs (1)

67-68: Tracker moved out of lib.rs — import looks right

Importing GracefulShutdownTracker from utils resolves earlier feedback about file placement. No concerns.

lib/runtime/src/component/endpoint.rs (1)

118-151: Lease/runtime combined cancellation is correct and lightweight

Using tokio::select! on the secondary runtime matches prior feedback and avoids blocking primary workers.

🧹 Nitpick comments (40)
tests/planner/README.md (7)

48-58: Fix TTFT inconsistency in example output.

Command now passes --ttft 0.2, but the “output” header still shows TTFT=0.1s. Update for consistency.

- TTFT=0.1s, ITL=0.01s
+ TTFT=0.2s, ITL=0.01s

70-88: Align narrative with new 5–45 rps dataset.

Text says “varying between 12 to 36 request/s” but the commands and description use 5–45. Update the sentence to avoid confusing readers.

- To test planner's performance for different request rates, we can generate a load dataset with request rate varying between 12 to 36 request/s.
+ To test planner's performance for different request rates, we generate a load dataset with request rate varying between 5 and 45 request/s.

Also confirm that the intended scale target (“planner should scale between 1P1D and 3P3D”) remains valid with the wider 5–45 rps range.


108-120: Minor wording: prefer “dry-run” as an adjective.

“SLA planner” used as a noun; hyphenate “dry-run” for clarity.

- For example, to dry run SLA planner for the previous FP8 8B on H200 using the generated `rr-5-45_i3000o300.jsonl` dataset,
+ For example, to dry-run the SLA planner for the previous FP8 8B on H200 using the generated `rr-5-45_i3000o300.jsonl` dataset,

142-143: Capitalize “SLA planner”.

Maintain consistent capitalization across the doc.

- 3. **End-to-End Perf Tests** (see instructions below) - Compare performance (goodput and goodput/GPU) on deployments with and without sla planner
+ 3. **End-to-End Perf Tests** (see instructions below) - Compare performance (goodput and goodput/GPU) on deployments with and without SLA planner

194-206: Spelling/wording nits (“dryrun” → “dry run”).

Minor grammar/polish for professionalism.

- In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
+ In this test, we compare performance (goodput and goodput/GPU) across the following four deployments using the aforementioned 8B FP8 model on H200 and the dataset used in the dry run:

211-216: Capitalize “SLA planner” and remove hyphen.

Use the same form throughout the doc.

- When running deployment with sla-planner, to reduce the image pulling time, deploy a `DaemonSet` to cache the image in advance:
+ When running a deployment with the SLA planner, to reduce image pull time, deploy a `DaemonSet` to cache the image in advance:

232-243: Context for result reproducibility.

Add a short note stating test date, genai-perf version, model build/commit, and cluster specs (GPU type/count, driver, CUDA, NCCL) used to produce the figure/table so others can reproduce.

Happy to draft a “Repro Environment” subsection if you share the exact versions.

components/planner/src/dynamo/planner/utils/planner_core.py (1)

299-311: Cache decode engine cap and guard against zero/invalid capacity from interpolator

Minor: reuse computed capacity in logging. Optional: defensive guard if interpolator ever returns 0/NaN to avoid div-by-zero.

Apply:

-        pred_decode_throughput = next_num_req * next_osl / self.args.adjustment_interval
-        next_num_d = math.ceil(
-            pred_decode_throughput
-            / pred_decode_thpt_per_gpu
-            / self.args.decode_engine_num_gpu
-        )
-
-        logger.info(
-            f"Decode calculation: {pred_decode_throughput:.2f}(d_thpt) / "
-            f"{pred_decode_thpt_per_gpu * self.args.decode_engine_num_gpu:.2f}(d_engine_cap) = "
-            f"{next_num_d}(num_d)"
-        )
+        pred_decode_throughput = next_num_req * next_osl / self.args.adjustment_interval
+        d_engine_cap = pred_decode_thpt_per_gpu * self.args.decode_engine_num_gpu
+        if not math.isfinite(d_engine_cap) or d_engine_cap <= 0:
+            logger.warning("Decode engine capacity <= 0; defaulting to min endpoint to avoid div-by-zero")
+            next_num_d = self.args.min_endpoint
+        else:
+            next_num_d = math.ceil(pred_decode_throughput / d_engine_cap)
+
+        logger.info(
+            f"Decode calculation: {pred_decode_throughput:.2f}(d_thpt) / "
+            f"{d_engine_cap:.2f}(d_engine_cap) = {next_num_d}(num_d)"
+        )
components/planner/src/dynamo/planner/utils/load_predictor.py (2)

45-47: Use pd.isna for broader NaN/None handling.

math.isnan misses None and some numpy dtypes. pd.isna handles NaN/None robustly.

Apply:

-        if math.isnan(value):
+        if pd.isna(value):
             value = 0

133-136: De-duplicate idle-skip logic between Base and Prophet.

Prophet.add_data_point reimplements NaN→0 and initial-zero skip. Factor this into a Base helper to keep one source of truth.

Example:

# In BasePredictor
def _normalize_or_skip(self, value):
    v = 0 if pd.isna(value) else value
    return None if len(self.data_buffer) == 0 and v == 0 else v

Then:

# BasePredictor.add_data_point
-        if pd.isna(value):
-            value = 0
-        if len(self.data_buffer) == 0 and value == 0:
-            return
-        else:
-            self.data_buffer.append(value)
+        v = self._normalize_or_skip(value)
+        if v is None:
+            return
+        self.data_buffer.append(v)
# ProphetPredictor.add_data_point
-        value = 0 if math.isnan(value) else value
-        if len(self.data_buffer) == 0 and value == 0:
-            return
-        self.data_buffer.append({"ds": timestamp, "y": value})
+        v = self._normalize_or_skip(value)
+        if v is None:
+            return
+        self.data_buffer.append({"ds": timestamp, "y": v})
benchmarks/sin_load_generator/sin_synth.py (2)

101-106: Fix typo in comment: “poison” → “Poisson”

Minor doc correctness nit.

-    # from a poison distribution with the following parameters:
+    # from a Poisson distribution with the following parameters:

54-54: Make ms timestamp truncation explicit

At benchmarks/sin_load_generator/sin_synth.py:54, either use math.floor(t_req * 1000) with a comment like # floored to ms or, if you keep int(), update the comment to # truncates to ms to clarify intent.

components/backends/vllm/deploy/disagg_planner.yaml (1)

124-130: Direct python exec LGTM; add readiness probe for steady-state gating

StartupProbe is present, but no readinessProbe. Add an HTTP readiness probe on /health so the Service only routes after the worker is ready.

         mainContainer:
           startupProbe:
             httpGet:
               path: /health
               port: 9090
             periodSeconds: 10
             failureThreshold: 60
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 10
+            timeoutSeconds: 2
tests/planner/perf_test_configs/image_cache_daemonset.yaml (1)

4-41: Harden the DaemonSet: run as non-root and drop privileges

Address CKV_K8S_20/23 by adding pod/container security contexts; this is low-risk for a sleep container and improves cluster posture.

 spec:
   selector:
     matchLabels:
       app: vllm-runtime-image-cache
   template:
     metadata:
       labels:
         app: vllm-runtime-image-cache
     spec:
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 65532
+        runAsGroup: 65532
+        fsGroup: 65532
       imagePullSecrets:
       - name: nvcr-imagepullsecret
       containers:
       - name: image-cache
         image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0902-01
+        securityContext:
+          allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: true
+          capabilities:
+            drop: ["ALL"]
         command:
         - /bin/sh
         - -c
         - "sleep infinity"
         resources:
           requests:
             cpu: "10m"
             memory: "64Mi"
           limits:
             cpu: "100m"
             memory: "128Mi"
components/backends/sglang/src/dynamo/sglang/main.py (1)

130-135: Typo in log message

“succesfully” → “successfully”.

-            logging.info("Metrics task succesfully cancelled")
+            logging.info("Metrics task successfully cancelled")
components/backends/sglang/deploy/disagg_planner.yaml (4)

126-127: Reassess --trust-remote-code usage

This flag is often required for Qwen, but it’s a security footgun in shared clusters. Consider gating it behind an env toggle or documenting why it’s safe in this image/namespace.

If Qwen3 requires it in your image, add a YAML comment noting the requirement to prevent later “hardening” PRs from breaking boot.

Also applies to: 157-158


56-66: Planner still uses /bin/sh -c; switch to exec form for consistency

Not critical (it’s not the long-running model worker), but aligning everything to exec form reduces shutdown variability and avoids zombie shells.

Apply:

-          command:
-            - /bin/sh
-            - -c
-          args:
-            - >-
-              python3 -m planner_sla
-              --environment=kubernetes
-              --backend=sglang
-              --adjustment-interval=60
-              --profile-results-dir=/workspace/profiling_results
+          command: ["python3"]
+          args:
+            - -m
+            - planner_sla
+            - --environment=kubernetes
+            - --backend=sglang
+            - --adjustment-interval=60
+            - --profile-results-dir=/workspace/profiling_results

96-101: Prometheus sidecar: avoid shell wrapper

Shell wrapping here is unnecessary. Use exec form.

-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.planner.prometheus"
+          command: ["python3"]
+          args: ["-m", "dynamo.planner.prometheus"]

29-47: Liveness/readiness are unconditional exit 0

These probes always pass and can mask real startup issues.

Consider a lightweight HTTP health endpoint (or a short Python check) instead of exit 0.

Also applies to: 73-91

tests/planner/perf_test_configs/disagg_8b_2p2d.yaml (2)

93-101: Frontend still uses shell wrapper; switch to exec form

Not strictly blocking, but aligns with the PR’s shutdown goals and avoids a PID 1 shell.

-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.frontend --http-port 8000 --kv-cache-block-size 128"
+          command: ["python3"]
+          args: ["-m", "dynamo.frontend", "--http-port", "8000", "--kv-cache-block-size", "128"]

53-67: Probe thresholds are generous; ensure terminationGracePeriodSeconds ≥ worst-case drain time

During scale-down, decode workers may take time to migrate/drain. Consider setting terminationGracePeriodSeconds to match your max OSL drain to avoid K8s SIGKILL.

Also applies to: 103-117

tests/planner/perf_test_configs/agg_8b.yaml (1)

40-48: Frontend can also use exec form

Optional consistency tweak.

-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.frontend --http-port 8000"
+          command: ["python3"]
+          args: ["-m", "dynamo.frontend", "--http-port", "8000"]
tests/planner/perf_test_configs/disagg_8b_3p1d.yaml (1)

40-48: Frontend: optional exec-form cleanup

For consistency across components.

-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.frontend --http-port 8000 --kv-cache-block-size 128"
+          command: ["python3"]
+          args: ["-m", "dynamo.frontend", "--http-port", "8000", "--kv-cache-block-size", "128"]
tests/planner/perf_test_configs/disagg_8b_planner.yaml (2)

29-38: Avoid exec/curl+jq in readiness; prefer httpGet probe.

Exec depends on curl/jq presence and is heavier. If /health returns non-200 when unready, switch to httpGet.

-      readinessProbe:
-        exec:
-          command:
-            - /bin/sh
-            - -c
-            - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
+      readinessProbe:
+        httpGet:
+          path: /health
+          port: 8000
         initialDelaySeconds: 60
         periodSeconds: 60
         timeoutSeconds: 30
         failureThreshold: 10

131-140: Expose Prometheus container port explicitly.

Not required but improves introspection and service mapping.

       extraPodSpec:
         mainContainer:
           image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0902-01
           workingDir: /workspace/components/backends/vllm
+          ports:
+            - name: metrics
+              containerPort: 8000
           command:
             - /bin/sh
             - -c
           args:
             - "python3 -m dynamo.planner.prometheus"
tests/planner/perf_test_configs/disagg_8b_tp2.yaml (3)

83-91: Add generous terminationGracePeriodSeconds for graceful drain (consistent with other config).

       extraPodSpec:
+        terminationGracePeriodSeconds: 600
         mainContainer:
           startupProbe:

Also applies to: 133-141, 176-196, 231-252


54-60: Liveness is too aggressive (failureThreshold: 1).

Single transient failure can kill pods under load spikes. Use 3.

-        failureThreshold: 1
+        failureThreshold: 3

Also applies to: 104-110


23-31: Frontend readiness: prefer httpGet over curl+jq.

Reduces image deps and probe overhead.

-      readinessProbe:
-        exec:
-          command:
-            - /bin/sh
-            - -c
-            - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
+      readinessProbe:
+        httpGet:
+          path: /health
+          port: 8000
         initialDelaySeconds: 60
         periodSeconds: 60
         timeoutSeconds: 30
         failureThreshold: 10
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)

69-69: Enrich cancellation log with context.

Include component/endpoint to aid triage.

-                    tracing::info!("PushEndpoint received cancellation signal, shutting down service");
+                    tracing::info!(
+                        component = %component_name_local,
+                        endpoint = %endpoint_name_local,
+                        "PushEndpoint received cancellation signal, shutting down service"
+                    );

80-82: Minor wording fix in warn log.

"suspect the requester has shut down" reads better.

-                        "Failed to respond to request; this may indicate the request has shutdown: {:?}",
+                        "Failed to respond to request; requester may have shut down: {:?}",
lib/runtime/src/lib.rs (1)

87-89: Document the new shutdown fields

Add brief doc-comments to clarify roles of the two tokens and the tracker.

Apply this diff:

   cancellation_token: CancellationToken,
-  endpoint_shutdown_token: CancellationToken,
-  graceful_shutdown_tracker: Arc<GracefulShutdownTracker>,
+  /// Token used to quiesce endpoints (stop accepting new work) during graceful shutdown.
+  endpoint_shutdown_token: CancellationToken,
+  /// Tracks in-flight endpoint work; awaited between endpoint quiesce and full runtime cancel.
+  graceful_shutdown_tracker: Arc<GracefulShutdownTracker>,
lib/runtime/src/runtime.rs (2)

34-36: Clean up unused imports

Ordering, OnceCell, signal, JoinHandle, and Mutex aren’t used in this file. Trim to avoid warnings.

Apply this diff:

-use once_cell::sync::OnceCell;
-use std::sync::{Arc, atomic::Ordering};
-use tokio::{signal, task::JoinHandle, sync::Mutex};
+use std::sync::Arc;

126-156: Tighten Phase 2 wait and fix misleading comment

  • Comment says “BEFORE cancelling tokens” but cancel happens inside the spawned task.
  • Call wait_for_completion unconditionally; it’s cheap when count==0 and removes a race where count hits zero just before the check.

Apply this diff:

-        // Spawn the shutdown coordination task BEFORE cancelling tokens
+        // Spawn the shutdown coordination task; tokens are cancelled within the task to enforce phase ordering
         let tracker = self.graceful_shutdown_tracker.clone();
         let main_token = self.cancellation_token.clone();
         let endpoint_token = self.endpoint_shutdown_token.clone();

         // Use the runtime handle to spawn the task
         let handle = self.primary();
         handle.spawn(async move {
             // Phase 1: Cancel endpoint shutdown token to stop accepting new requests
             tracing::info!("Phase 1: Cancelling endpoint shutdown token");
             endpoint_token.cancel();

             // Phase 2: Wait for all graceful endpoints to complete
             tracing::info!("Phase 2: Waiting for graceful endpoints to complete");
-
-            let count = tracker.get_count();
-            tracing::info!("Active graceful endpoints: {}", count);
-
-            if count != 0 {
-                tracker.wait_for_completion().await;
-            }
+            tracing::info!("Active graceful endpoints (at start of wait): {}", tracker.get_count());
+            tracker.wait_for_completion().await;

             // Phase 3: Now shutdown NATS/ETCD by cancelling the main token
             tracing::info!("Phase 3: All graceful endpoints completed, shutting down NATS/ETCD connections");
             main_token.cancel();
         });
lib/runtime/src/utils/graceful_shutdown.rs (4)

41-45: Use AcqRel for counter updates; reduce log verbosity

SeqCst is stronger than needed. AcqRel is sufficient here; consider trace-level logs for hot paths.

Apply this diff:

-    pub(crate) fn register_endpoint(&self) {
-        let count = self.active_endpoints.fetch_add(1, Ordering::SeqCst);
-        tracing::debug!("Endpoint registered, total active: {} -> {}", count, count + 1);
-    }
+    pub(crate) fn register_endpoint(&self) {
+        let count = self.active_endpoints.fetch_add(1, Ordering::AcqRel);
+        tracing::trace!("Endpoint registered, total active: {} -> {}", count, count + 1);
+    }

61-79: Lighter memory ordering in wait loop

Acquire loads are sufficient when paired with AcqRel updates; keeps semantics while reducing constraints.

Apply this diff:

-            let count = self.active_endpoints.load(Ordering::SeqCst);
+            let count = self.active_endpoints.load(Ordering::Acquire);

81-82: Remove stray comment

Leftover comment is misleading.

Apply this diff:

-    // This method is no longer needed since we can access the tracker directly
 }

41-55: Optional: RAII guard to ensure balanced register/unregister

Returning a guard from registration prevents leaks on early returns/panics; Drop handles the unregister.

// Add in this module:
pub(crate) struct EndpointGuard(Arc<GracefulShutdownTracker>);

impl Drop for EndpointGuard {
    fn drop(&mut self) {
        self.0.unregister_endpoint();
    }
}

impl GracefulShutdownTracker {
    pub(crate) fn register_guard(self: &Arc<Self>) -> EndpointGuard {
        self.register_endpoint();
        EndpointGuard(self.clone())
    }
}

Usage in endpoint code:

let guard = tracker.register_guard();
// do work; guard drop will unregister
lib/runtime/src/component/endpoint.rs (3)

118-151: Optional: factor token linking into a small helper

To reduce repetition and make intent obvious, consider a helper like link_cancellation(lease_token, runtime_shutdown_token) -> CancellationToken.


181-199: Spawn on the intended runtime, or update the comment

Comment says “launch in primary runtime” but uses tokio::spawn. If available, prefer the explicit primary().spawn to match intent.

-        let task = tokio::spawn(async move {
+        let task = endpoint.drt().runtime().primary().spawn(async move {

If primary().spawn isn’t available, change the comment to “launch on current runtime”.


181-199: Optional: ensure tracker unregisters even if the task panics

Consider an RAII guard (e.g., scopeguard) inside the spawned task so unregister executes on all exit paths, including panics.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled
  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 87a721a and 1e4b5c3.

⛔ Files ignored due to path filters (2)
  • tests/planner/figures/dryrun_plot.png is excluded by !**/*.png
  • tests/planner/figures/sla_planner_perf.png is excluded by !**/*.png
📒 Files selected for processing (23)
  • benchmarks/nixl/nixl-benchmark-deployment.yaml (1 hunks)
  • benchmarks/sin_load_generator/sin_synth.py (1 hunks)
  • components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
  • components/backends/sglang/src/dynamo/sglang/main.py (1 hunks)
  • components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
  • components/backends/vllm/src/dynamo/vllm/main.py (4 hunks)
  • components/planner/src/dynamo/planner/utils/load_predictor.py (2 hunks)
  • components/planner/src/dynamo/planner/utils/planner_core.py (2 hunks)
  • lib/llm/src/discovery/watcher.rs (0 hunks)
  • lib/runtime/src/component/endpoint.rs (3 hunks)
  • lib/runtime/src/distributed.rs (2 hunks)
  • lib/runtime/src/lib.rs (4 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1 hunks)
  • lib/runtime/src/runtime.rs (4 hunks)
  • lib/runtime/src/utils.rs (1 hunks)
  • lib/runtime/src/utils/graceful_shutdown.rs (1 hunks)
  • tests/planner/README.md (7 hunks)
  • tests/planner/perf_test_configs/agg_8b.yaml (1 hunks)
  • tests/planner/perf_test_configs/disagg_8b_2p2d.yaml (1 hunks)
  • tests/planner/perf_test_configs/disagg_8b_3p1d.yaml (1 hunks)
  • tests/planner/perf_test_configs/disagg_8b_planner.yaml (1 hunks)
  • tests/planner/perf_test_configs/disagg_8b_tp2.yaml (1 hunks)
  • tests/planner/perf_test_configs/image_cache_daemonset.yaml (1 hunks)
💤 Files with no reviewable changes (1)
  • lib/llm/src/discovery/watcher.rs
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-06-03T15:26:55.732Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

Applied to files:

  • benchmarks/nixl/nixl-benchmark-deployment.yaml
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.

Applied to files:

  • lib/runtime/src/component/endpoint.rs
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Applied to files:

  • lib/runtime/src/lib.rs
📚 Learning: 2025-06-17T00:50:44.845Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Applied to files:

  • lib/runtime/src/lib.rs
🧬 Code graph analysis (7)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/runtime.rs (1)
  • graceful_shutdown_tracker (122-124)
components/planner/src/dynamo/planner/utils/planner_core.py (1)
components/planner/src/dynamo/planner/utils/perf_interpolation.py (2)
  • interpolate_thpt_per_gpu (52-54)
  • interpolate_thpt_per_gpu (141-145)
components/backends/vllm/src/dynamo/vllm/main.py (5)
components/backends/trtllm/src/dynamo/trtllm/main.py (2)
  • init (105-325)
  • graceful_shutdown (48-51)
examples/multimodal/components/worker.py (6)
  • init (434-466)
  • generate (181-184)
  • generate (199-228)
  • generate (265-395)
  • graceful_shutdown (398-407)
  • clear_kv_blocks (186-191)
lib/bindings/python/rust/lib.rs (2)
  • serve_endpoint (522-548)
  • generate (706-718)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • serve_endpoint (220-230)
  • generate (1222-1245)
components/backends/vllm/src/dynamo/vllm/handlers.py (4)
  • generate (40-41)
  • generate (133-209)
  • generate (216-255)
  • clear_kv_blocks (43-48)
components/backends/sglang/src/dynamo/sglang/main.py (6)
components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (1)
  • generate (56-94)
components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (1)
  • generate (32-33)
components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (1)
  • generate (49-70)
components/backends/vllm/src/dynamo/vllm/main.py (1)
  • graceful_shutdown (54-63)
components/backends/trtllm/src/dynamo/trtllm/main.py (1)
  • graceful_shutdown (48-51)
lib/llm/src/local_model.rs (1)
  • migration_limit (150-153)
lib/runtime/src/utils/graceful_shutdown.rs (1)
lib/runtime/src/runtime.rs (1)
  • new (40-67)
lib/runtime/src/component/endpoint.rs (5)
lib/runtime/src/component.rs (9)
  • endpoint (229-236)
  • subject (492-494)
  • etcd_path (212-215)
  • etcd_path (450-457)
  • etcd_path (593-595)
  • new (566-572)
  • name (221-223)
  • name (424-426)
  • name (597-602)
lib/runtime/src/distributed.rs (2)
  • etcd_client (256-258)
  • new (54-171)
lib/runtime/src/runtime.rs (1)
  • new (40-67)
lib/runtime/src/utils/graceful_shutdown.rs (1)
  • new (34-39)
lib/runtime/src/lib.rs (2)
  • new (107-126)
  • new (211-216)
lib/runtime/src/runtime.rs (2)
lib/runtime/src/utils/graceful_shutdown.rs (1)
  • new (34-39)
lib/runtime/src/distributed.rs (4)
  • new (54-171)
  • child_token (260-262)
  • graceful_shutdown_tracker (264-266)
  • shutdown (198-200)
🪛 Checkov (3.2.334)
tests/planner/perf_test_configs/image_cache_daemonset.yaml

[MEDIUM] 4-41: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[MEDIUM] 4-41: Minimize the admission of root containers

(CKV_K8S_23)

🪛 LanguageTool
tests/planner/README.md

[grammar] ~108-~108: There might be a mistake here.
Context: ...o_output_plot> ``` For example, to dry run SLA planner for the previous FP8 8B on ...

(QB_NEW_EN)


[grammar] ~196-~196: Ensure spelling is correct
Context: ...8 model on H200 and the dataset used in dryrun: - Config 1 with inefficient P/D ratio: 3xT...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~238-~238: There might be a mistake here.
Context: ... Improvement | Goodput/GPU Improvement | |---------------|-----------------|-----...

(QB_NEW_EN)


[grammar] ~239-~239: There might be a mistake here.
Context: ...-------------|-------------------------| | Inefficient P/D ratio | 725% | 600% | ...

(QB_NEW_EN)


[grammar] ~240-~240: There might be a mistake here.
Context: ... | Inefficient P/D ratio | 725% | 600% | | Inefficient parallelization mapping | ...

(QB_NEW_EN)


[grammar] ~241-~241: There might be a mistake here.
Context: ... parallelization mapping | 311% | 249% | | Best static deployment | 52% | 29% | ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (25)
tests/planner/README.md (1)

220-230: Fix bash multiline comments and ensure URL includes scheme

components/planner/src/dynamo/planner/utils/load_predictor.py (1)

48-53: Idle-period skip in Base looks good.

This cleanly trims initial zeros to ignore cold-start noise.

benchmarks/nixl/nixl-benchmark-deployment.yaml (1)

18-18: Secret rename looks good; please verify namespace presence

Confirm that Secret "nvcr-imagepullsecret" exists in the target namespace used for this Deployment to avoid ImagePullBackOff at rollout time.

components/backends/vllm/deploy/disagg_planner.yaml (1)

126-130: Confirm migration-limit default is disabled when flag removed

Planner-side request migration is being disabled in this PR; verify backend default for migration_limit is effectively 0 once the flag is removed. If not, set it explicitly to 0.

Optional explicit pin (only if needed):

             - --model
             - Qwen/Qwen3-0.6B
+            - --migration-limit
+            - "0"

Also applies to: 151-156

components/backends/sglang/deploy/disagg_planner.yaml (2)

114-131: Good move: switch workers to exec form for proper SIGTERM delivery

Using command: ["python3"] with explicit args eliminates the shell wrapper and ensures the Python PID receives SIGTERM directly.


145-162: Prefill worker uses exec form as well — consistent and correct

Same benefit for graceful shutdown and signal handling here. Looks good.

components/backends/vllm/src/dynamo/vllm/main.py (5)

75-83: Signal handlers + runtime shutdown path look solid

Registering SIGTERM/SIGINT and delegating to graceful_shutdown matches the new shutdown design.


85-93: Helpful lifecycle logging

The added debug logs around init paths will make phased shutdowns and stuck tasks easier to diagnose.


153-169: Prefill endpoint graceful=True is consistent with non-migratable prefill

Matches the stated behavior that prefill should drain quickly; logging around start/finish is useful.


220-238: KV event publisher wiring is conditioned correctly

Only constructed when prefix caching is enabled; IDs/ports derived from endpoint lease and DP rank look consistent with existing patterns.


269-272: No None guard needed for migration_limit
The dataclass in components/backends/vllm/src/dynamo/vllm/args.py line 49 sets migration_limit: int = 0, and it’s always assigned from args.migration_limit, so config.migration_limit can’t be None.

tests/planner/perf_test_configs/disagg_8b_planner.yaml (3)

176-196: Good: direct python entrypoint (no shell wrapper).

This ensures SIGTERM reaches the worker process and matches the PR’s stated fix.


231-252: Good: prefill worker uses direct python entrypoint and long TGP.

Consistent with graceful shutdown goals.


11-16: Add Planner scrape config
Prometheus only scrapes localhost:8000 and the frontend—append
{"job_name":"planner","static_configs":[{"targets":["vllm-disagg-planner-planner:9085"]}]}
to the DYNAMO_SERVICE_CONFIG JSON so it pulls planner metrics. Confirm your controller’s service DNS if it differs.

lib/runtime/src/utils.rs (1)

18-18: LGTM: clean module add + re-export.

Public surface is minimal and discoverable via utils root.

Also applies to: 27-28

lib/runtime/src/distributed.rs (1)

28-29: LGTM: expose GracefulShutdownTracker via DRT.

Keeps tracker centralized while allowing distributed components to coordinate shutdown.

Also applies to: 264-266

lib/runtime/src/lib.rs (1)

279-279: Good switch to tokio::sync::Mutex for async path

Using Tokio’s Mutex here avoids accidental blocking in async contexts. LGTM.

lib/runtime/src/runtime.rs (3)

29-30: Import of GracefulShutdownTracker is correct

Matches the new Runtime field and accessors.


47-49: Endpoint token derived from main token — good

This enables phased cancel: endpoints first, infra later. No issues.


116-120: Verify downstream cancellation timing

child_token() now derives from endpoint_shutdown_token, so all child tokens will cancel on endpoint quiesce (Phase 1) rather than only at full shutdown. Confirm that each consumer (e.g. system_status_server.rs, worker.rs, worker_monitor.rs, transports/zmq.rs, distributed.rs, etc.) is intended to stop at quiesce.

lib/runtime/src/utils/graceful_shutdown.rs (1)

19-23: Solid, minimal tracker structure

Public type with private fields is appropriate; interior mutability via AtomicUsize/Notify removes the need for external locks.

lib/runtime/src/component/endpoint.rs (4)

17-17: Good addition: explicit CancellationToken import

This aligns with the new coordinated shutdown flow.


121-129: Pre-extracting values to avoid borrow/move issues is solid

Reduces capture complexity and avoids accidental borrows inside spawned tasks.


162-168: Passing the cancellation token into PushEndpoint is the right integration point

Keeps cancellation concerns centralized.


204-209: Using precomputed identity/transport fields for discovery is clean

Avoids re-deriving these inside the async task and keeps etcd payload stable.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk merged commit 7da510c into main Sep 3, 2025
11 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/sla-test branch September 3, 2025 00:05
tedzhouhk added a commit that referenced this pull request Sep 3, 2025
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants