-
Notifications
You must be signed in to change notification settings - Fork 693
chore: many bug fixes and improvements when testing planner #2776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
177de7f to
da108c0
Compare
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Signed-off-by: hongkuan <hongkuanz@nvidia.com>
hhzhang16
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved regarding planner changes only
WalkthroughChanges span deployments, planner logic, benchmarking utilities, runtime shutdown/cancellation plumbing (Rust), and backend workers (VLLM/SGLang). Highlights: new graceful-shutdown tracker and cancellation token layering in runtime/endpoint, updated YAML invocations, timestamp cast to int, throughput-based planner calculations, adjusted model deletion behavior, and added perf test manifests/docs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Admin as Admin/Orchestrator
participant RT as Runtime
participant GST as GracefulShutdownTracker
participant EP as Endpoint Task
participant PE as PushEndpoint
Note over RT,GST: Startup
Admin->>RT: start()
RT->>RT: create endpoint_shutdown_token (child of main token)
RT->>GST: new()
RT->>EP: spawn with cancellation_token (lease ∨ runtime)
EP->>GST: register_endpoint()
EP->>PE: PushEndpoint::builder().cancellation_token(token).start()
Note over RT,EP: Shutdown (3 phases)
Admin->>RT: shutdown()
RT->>RT: cancel endpoint_shutdown_token (stop new work)
RT->>GST: wait_for_completion() (Phase 2)
par Endpoint completes
PE-->>EP: cancellation signal
EP->>PE: stop().await
EP->>GST: unregister_endpoint()
and Possibly others
EP-->>GST: unregister_endpoint()
end
GST-->>RT: completion when count==0
RT->>RT: cancel main token (NATS/ETCD) (Phase 3)
RT-->>Admin: shutdown complete
sequenceDiagram
autonumber
participant V as vLLM Decode Worker
participant SE as serve_endpoint
participant C as Config
V->>C: read migration_limit
alt migration_limit <= 0
V->>SE: serve_endpoint(graceful_shutdown=true)
else
V->>SE: serve_endpoint(graceful_shutdown=false)
end
SE-->>V: runs until cancelled
sequenceDiagram
autonumber
participant W as ModelWatcher
participant Store as Model Registry
participant Bus as Update Stream
W->>Store: on delete(model)
alt active instances exist
W-->>Store: return Ok(None)
Note right of W: No per-type removals or updates emitted
else
W->>Store: remove per-type models
W->>Bus: emit ModelUpdate::Removed
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
components/planner/src/dynamo/planner/utils/load_predictor.py (2)
121-127: Bug: step_size is ignored; timestamps advance by 1s regardless.Prophet inputs should reflect the configured cadence. Use step_size when building ds.
Apply:
@@ - self.step_size = step_size + self.step_size = step_size @@ - timestamp = self.start_date + timedelta(seconds=self.curr_step) + timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size) @@ - next_timestamp = self.start_date + timedelta(seconds=self.curr_step) + next_timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size)Also applies to: 131-132, 165-167
158-170: Harden Prophet path: guard fit/predict and fallback to last value.Prophet can fail on edge windows; ARIMA already has a safe fallback—mirror that here.
Apply:
- model = Prophet() - - # Fit the model - model.fit(df) - - # Create future dataframe for next timestamp - next_timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size) - future_df = pd.DataFrame({"ds": [next_timestamp]}) - - # Make prediction - forecast = model.predict(future_df) - return forecast["yhat"].iloc[0] + model = Prophet() + try: + model.fit(df) + next_timestamp = self.start_date + timedelta(seconds=self.curr_step * self.step_size) + future_df = pd.DataFrame({"ds": [next_timestamp]}) + forecast = model.predict(future_df) + return forecast["yhat"].iloc[0] + except Exception as e: + logger.warning(f"Prophet prediction failed: {e}, using last value") + return self.get_last_value()tests/planner/perf_test_configs/disagg_8b_tp2.yaml (1)
1-148: Remove remaining shell-wrapped commands
Tests and Helm values still invoke containers via/bin/sh -c. Update them to call the binaries directly—for example, in
tests/planner/perf_test_configs/disagg_8b_tp2.yaml(lines 25–26, 44–45, 94–95, 144–145)deploy/inference-gateway/helm/dynamo-gaie/values.yaml(line 80)lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)
59-75: Prioritize cancellation in biased select.With
biased;, earlier branches win. Placecancelled()beforeendpoint.next()to avoid accepting new requests when cancel fires.- let req = tokio::select! { - biased; - - // await on service request - req = endpoint.next() => { - req - } - - // process shutdown - _ = self.cancellation_token.cancelled() => { + let req = tokio::select! { + biased; + // process shutdown + _ = self.cancellation_token.cancelled() => { tracing::info!("PushEndpoint received cancellation signal, shutting down service"); if let Err(e) = endpoint.stop().await { tracing::warn!("Failed to stop NATS service: {:?}", e); } break; - } + } + // await on service request + req = endpoint.next() => { + req + } };lib/runtime/src/component/endpoint.rs (1)
213-226: Ensure cleanup on etcd registration failure (cancel and await the task)Currently you cancel but return early without awaiting the spawned endpoint task, which can leave resources alive longer and delay tracker decrement. Await the task after cancel for deterministic cleanup.
- if let Some(etcd_client) = &etcd_client - && let Err(e) = etcd_client - .kv_create( - &etcd_path, - info, - Some(lease_id), - ) - .await - { - tracing::error!("Failed to register discoverable service: {:?}", e); - cancel_token.cancel(); - return Err(error!("Failed to register discoverable service")); - } - task.await??; + // etcd registration: if it fails, cancel and ensure the endpoint task exits before returning + if let Some(etcd_client) = &etcd_client { + if let Err(e) = etcd_client + .kv_create(&etcd_path, info, Some(lease_id)) + .await + { + tracing::error!("Failed to register discoverable service at {}: {:?}", etcd_path, e); + cancel_token.cancel(); + // Ensure graceful cleanup and tracker unregistration + let _ = task.await; + return Err(error!("Failed to register discoverable service")); + } + } + task.await??;
♻️ Duplicate comments (2)
lib/runtime/src/lib.rs (1)
67-68: Tracker moved out of lib.rs — import looks rightImporting GracefulShutdownTracker from utils resolves earlier feedback about file placement. No concerns.
lib/runtime/src/component/endpoint.rs (1)
118-151: Lease/runtime combined cancellation is correct and lightweightUsing tokio::select! on the secondary runtime matches prior feedback and avoids blocking primary workers.
🧹 Nitpick comments (40)
tests/planner/README.md (7)
48-58: Fix TTFT inconsistency in example output.Command now passes
--ttft 0.2, but the “output” header still showsTTFT=0.1s. Update for consistency.- TTFT=0.1s, ITL=0.01s + TTFT=0.2s, ITL=0.01s
70-88: Align narrative with new 5–45 rps dataset.Text says “varying between 12 to 36 request/s” but the commands and description use 5–45. Update the sentence to avoid confusing readers.
- To test planner's performance for different request rates, we can generate a load dataset with request rate varying between 12 to 36 request/s. + To test planner's performance for different request rates, we generate a load dataset with request rate varying between 5 and 45 request/s.Also confirm that the intended scale target (“planner should scale between 1P1D and 3P3D”) remains valid with the wider 5–45 rps range.
108-120: Minor wording: prefer “dry-run” as an adjective.“SLA planner” used as a noun; hyphenate “dry-run” for clarity.
- For example, to dry run SLA planner for the previous FP8 8B on H200 using the generated `rr-5-45_i3000o300.jsonl` dataset, + For example, to dry-run the SLA planner for the previous FP8 8B on H200 using the generated `rr-5-45_i3000o300.jsonl` dataset,
142-143: Capitalize “SLA planner”.Maintain consistent capitalization across the doc.
- 3. **End-to-End Perf Tests** (see instructions below) - Compare performance (goodput and goodput/GPU) on deployments with and without sla planner + 3. **End-to-End Perf Tests** (see instructions below) - Compare performance (goodput and goodput/GPU) on deployments with and without SLA planner
194-206: Spelling/wording nits (“dryrun” → “dry run”).Minor grammar/polish for professionalism.
- In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun: + In this test, we compare performance (goodput and goodput/GPU) across the following four deployments using the aforementioned 8B FP8 model on H200 and the dataset used in the dry run:
211-216: Capitalize “SLA planner” and remove hyphen.Use the same form throughout the doc.
- When running deployment with sla-planner, to reduce the image pulling time, deploy a `DaemonSet` to cache the image in advance: + When running a deployment with the SLA planner, to reduce image pull time, deploy a `DaemonSet` to cache the image in advance:
232-243: Context for result reproducibility.Add a short note stating test date, genai-perf version, model build/commit, and cluster specs (GPU type/count, driver, CUDA, NCCL) used to produce the figure/table so others can reproduce.
Happy to draft a “Repro Environment” subsection if you share the exact versions.
components/planner/src/dynamo/planner/utils/planner_core.py (1)
299-311: Cache decode engine cap and guard against zero/invalid capacity from interpolatorMinor: reuse computed capacity in logging. Optional: defensive guard if interpolator ever returns 0/NaN to avoid div-by-zero.
Apply:
- pred_decode_throughput = next_num_req * next_osl / self.args.adjustment_interval - next_num_d = math.ceil( - pred_decode_throughput - / pred_decode_thpt_per_gpu - / self.args.decode_engine_num_gpu - ) - - logger.info( - f"Decode calculation: {pred_decode_throughput:.2f}(d_thpt) / " - f"{pred_decode_thpt_per_gpu * self.args.decode_engine_num_gpu:.2f}(d_engine_cap) = " - f"{next_num_d}(num_d)" - ) + pred_decode_throughput = next_num_req * next_osl / self.args.adjustment_interval + d_engine_cap = pred_decode_thpt_per_gpu * self.args.decode_engine_num_gpu + if not math.isfinite(d_engine_cap) or d_engine_cap <= 0: + logger.warning("Decode engine capacity <= 0; defaulting to min endpoint to avoid div-by-zero") + next_num_d = self.args.min_endpoint + else: + next_num_d = math.ceil(pred_decode_throughput / d_engine_cap) + + logger.info( + f"Decode calculation: {pred_decode_throughput:.2f}(d_thpt) / " + f"{d_engine_cap:.2f}(d_engine_cap) = {next_num_d}(num_d)" + )components/planner/src/dynamo/planner/utils/load_predictor.py (2)
45-47: Use pd.isna for broader NaN/None handling.math.isnan misses None and some numpy dtypes. pd.isna handles NaN/None robustly.
Apply:
- if math.isnan(value): + if pd.isna(value): value = 0
133-136: De-duplicate idle-skip logic between Base and Prophet.Prophet.add_data_point reimplements NaN→0 and initial-zero skip. Factor this into a Base helper to keep one source of truth.
Example:
# In BasePredictor def _normalize_or_skip(self, value): v = 0 if pd.isna(value) else value return None if len(self.data_buffer) == 0 and v == 0 else vThen:
# BasePredictor.add_data_point - if pd.isna(value): - value = 0 - if len(self.data_buffer) == 0 and value == 0: - return - else: - self.data_buffer.append(value) + v = self._normalize_or_skip(value) + if v is None: + return + self.data_buffer.append(v)# ProphetPredictor.add_data_point - value = 0 if math.isnan(value) else value - if len(self.data_buffer) == 0 and value == 0: - return - self.data_buffer.append({"ds": timestamp, "y": value}) + v = self._normalize_or_skip(value) + if v is None: + return + self.data_buffer.append({"ds": timestamp, "y": v})benchmarks/sin_load_generator/sin_synth.py (2)
101-106: Fix typo in comment: “poison” → “Poisson”Minor doc correctness nit.
- # from a poison distribution with the following parameters: + # from a Poisson distribution with the following parameters:
54-54: Make ms timestamp truncation explicitAt benchmarks/sin_load_generator/sin_synth.py:54, either use
math.floor(t_req * 1000)with a comment like# floored to msor, if you keepint(), update the comment to# truncates to msto clarify intent.components/backends/vllm/deploy/disagg_planner.yaml (1)
124-130: Direct python exec LGTM; add readiness probe for steady-state gatingStartupProbe is present, but no readinessProbe. Add an HTTP readiness probe on /health so the Service only routes after the worker is ready.
mainContainer: startupProbe: httpGet: path: /health port: 9090 periodSeconds: 10 failureThreshold: 60 + readinessProbe: + httpGet: + path: /health + port: 9090 + periodSeconds: 10 + timeoutSeconds: 2tests/planner/perf_test_configs/image_cache_daemonset.yaml (1)
4-41: Harden the DaemonSet: run as non-root and drop privilegesAddress CKV_K8S_20/23 by adding pod/container security contexts; this is low-risk for a sleep container and improves cluster posture.
spec: selector: matchLabels: app: vllm-runtime-image-cache template: metadata: labels: app: vllm-runtime-image-cache spec: + securityContext: + runAsNonRoot: true + runAsUser: 65532 + runAsGroup: 65532 + fsGroup: 65532 imagePullSecrets: - name: nvcr-imagepullsecret containers: - name: image-cache image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0902-01 + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: ["ALL"] command: - /bin/sh - -c - "sleep infinity" resources: requests: cpu: "10m" memory: "64Mi" limits: cpu: "100m" memory: "128Mi"components/backends/sglang/src/dynamo/sglang/main.py (1)
130-135: Typo in log message“succesfully” → “successfully”.
- logging.info("Metrics task succesfully cancelled") + logging.info("Metrics task successfully cancelled")components/backends/sglang/deploy/disagg_planner.yaml (4)
126-127: Reassess--trust-remote-codeusageThis flag is often required for Qwen, but it’s a security footgun in shared clusters. Consider gating it behind an env toggle or documenting why it’s safe in this image/namespace.
If Qwen3 requires it in your image, add a YAML comment noting the requirement to prevent later “hardening” PRs from breaking boot.
Also applies to: 157-158
56-66: Planner still uses/bin/sh -c; switch to exec form for consistencyNot critical (it’s not the long-running model worker), but aligning everything to exec form reduces shutdown variability and avoids zombie shells.
Apply:
- command: - - /bin/sh - - -c - args: - - >- - python3 -m planner_sla - --environment=kubernetes - --backend=sglang - --adjustment-interval=60 - --profile-results-dir=/workspace/profiling_results + command: ["python3"] + args: + - -m + - planner_sla + - --environment=kubernetes + - --backend=sglang + - --adjustment-interval=60 + - --profile-results-dir=/workspace/profiling_results
96-101: Prometheus sidecar: avoid shell wrapperShell wrapping here is unnecessary. Use exec form.
- command: - - /bin/sh - - -c - args: - - "python3 -m dynamo.planner.prometheus" + command: ["python3"] + args: ["-m", "dynamo.planner.prometheus"]
29-47: Liveness/readiness are unconditionalexit 0These probes always pass and can mask real startup issues.
Consider a lightweight HTTP health endpoint (or a short Python check) instead of
exit 0.Also applies to: 73-91
tests/planner/perf_test_configs/disagg_8b_2p2d.yaml (2)
93-101: Frontend still uses shell wrapper; switch to exec formNot strictly blocking, but aligns with the PR’s shutdown goals and avoids a PID 1 shell.
- command: - - /bin/sh - - -c - args: - - "python3 -m dynamo.frontend --http-port 8000 --kv-cache-block-size 128" + command: ["python3"] + args: ["-m", "dynamo.frontend", "--http-port", "8000", "--kv-cache-block-size", "128"]
53-67: Probe thresholds are generous; ensure terminationGracePeriodSeconds ≥ worst-case drain timeDuring scale-down, decode workers may take time to migrate/drain. Consider setting
terminationGracePeriodSecondsto match your max OSL drain to avoid K8s SIGKILL.Also applies to: 103-117
tests/planner/perf_test_configs/agg_8b.yaml (1)
40-48: Frontend can also use exec formOptional consistency tweak.
- command: - - /bin/sh - - -c - args: - - "python3 -m dynamo.frontend --http-port 8000" + command: ["python3"] + args: ["-m", "dynamo.frontend", "--http-port", "8000"]tests/planner/perf_test_configs/disagg_8b_3p1d.yaml (1)
40-48: Frontend: optional exec-form cleanupFor consistency across components.
- command: - - /bin/sh - - -c - args: - - "python3 -m dynamo.frontend --http-port 8000 --kv-cache-block-size 128" + command: ["python3"] + args: ["-m", "dynamo.frontend", "--http-port", "8000", "--kv-cache-block-size", "128"]tests/planner/perf_test_configs/disagg_8b_planner.yaml (2)
29-38: Avoid exec/curl+jq in readiness; prefer httpGet probe.Exec depends on curl/jq presence and is heavier. If /health returns non-200 when unready, switch to httpGet.
- readinessProbe: - exec: - command: - - /bin/sh - - -c - - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""' + readinessProbe: + httpGet: + path: /health + port: 8000 initialDelaySeconds: 60 periodSeconds: 60 timeoutSeconds: 30 failureThreshold: 10
131-140: Expose Prometheus container port explicitly.Not required but improves introspection and service mapping.
extraPodSpec: mainContainer: image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0902-01 workingDir: /workspace/components/backends/vllm + ports: + - name: metrics + containerPort: 8000 command: - /bin/sh - -c args: - "python3 -m dynamo.planner.prometheus"tests/planner/perf_test_configs/disagg_8b_tp2.yaml (3)
83-91: Add generous terminationGracePeriodSeconds for graceful drain (consistent with other config).extraPodSpec: + terminationGracePeriodSeconds: 600 mainContainer: startupProbe:Also applies to: 133-141, 176-196, 231-252
54-60: Liveness is too aggressive (failureThreshold: 1).Single transient failure can kill pods under load spikes. Use 3.
- failureThreshold: 1 + failureThreshold: 3Also applies to: 104-110
23-31: Frontend readiness: prefer httpGet over curl+jq.Reduces image deps and probe overhead.
- readinessProbe: - exec: - command: - - /bin/sh - - -c - - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""' + readinessProbe: + httpGet: + path: /health + port: 8000 initialDelaySeconds: 60 periodSeconds: 60 timeoutSeconds: 30 failureThreshold: 10lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)
69-69: Enrich cancellation log with context.Include component/endpoint to aid triage.
- tracing::info!("PushEndpoint received cancellation signal, shutting down service"); + tracing::info!( + component = %component_name_local, + endpoint = %endpoint_name_local, + "PushEndpoint received cancellation signal, shutting down service" + );
80-82: Minor wording fix in warn log."suspect the requester has shut down" reads better.
- "Failed to respond to request; this may indicate the request has shutdown: {:?}", + "Failed to respond to request; requester may have shut down: {:?}",lib/runtime/src/lib.rs (1)
87-89: Document the new shutdown fieldsAdd brief doc-comments to clarify roles of the two tokens and the tracker.
Apply this diff:
cancellation_token: CancellationToken, - endpoint_shutdown_token: CancellationToken, - graceful_shutdown_tracker: Arc<GracefulShutdownTracker>, + /// Token used to quiesce endpoints (stop accepting new work) during graceful shutdown. + endpoint_shutdown_token: CancellationToken, + /// Tracks in-flight endpoint work; awaited between endpoint quiesce and full runtime cancel. + graceful_shutdown_tracker: Arc<GracefulShutdownTracker>,lib/runtime/src/runtime.rs (2)
34-36: Clean up unused importsOrdering, OnceCell, signal, JoinHandle, and Mutex aren’t used in this file. Trim to avoid warnings.
Apply this diff:
-use once_cell::sync::OnceCell; -use std::sync::{Arc, atomic::Ordering}; -use tokio::{signal, task::JoinHandle, sync::Mutex}; +use std::sync::Arc;
126-156: Tighten Phase 2 wait and fix misleading comment
- Comment says “BEFORE cancelling tokens” but cancel happens inside the spawned task.
- Call wait_for_completion unconditionally; it’s cheap when count==0 and removes a race where count hits zero just before the check.
Apply this diff:
- // Spawn the shutdown coordination task BEFORE cancelling tokens + // Spawn the shutdown coordination task; tokens are cancelled within the task to enforce phase ordering let tracker = self.graceful_shutdown_tracker.clone(); let main_token = self.cancellation_token.clone(); let endpoint_token = self.endpoint_shutdown_token.clone(); // Use the runtime handle to spawn the task let handle = self.primary(); handle.spawn(async move { // Phase 1: Cancel endpoint shutdown token to stop accepting new requests tracing::info!("Phase 1: Cancelling endpoint shutdown token"); endpoint_token.cancel(); // Phase 2: Wait for all graceful endpoints to complete tracing::info!("Phase 2: Waiting for graceful endpoints to complete"); - - let count = tracker.get_count(); - tracing::info!("Active graceful endpoints: {}", count); - - if count != 0 { - tracker.wait_for_completion().await; - } + tracing::info!("Active graceful endpoints (at start of wait): {}", tracker.get_count()); + tracker.wait_for_completion().await; // Phase 3: Now shutdown NATS/ETCD by cancelling the main token tracing::info!("Phase 3: All graceful endpoints completed, shutting down NATS/ETCD connections"); main_token.cancel(); });lib/runtime/src/utils/graceful_shutdown.rs (4)
41-45: Use AcqRel for counter updates; reduce log verbositySeqCst is stronger than needed. AcqRel is sufficient here; consider trace-level logs for hot paths.
Apply this diff:
- pub(crate) fn register_endpoint(&self) { - let count = self.active_endpoints.fetch_add(1, Ordering::SeqCst); - tracing::debug!("Endpoint registered, total active: {} -> {}", count, count + 1); - } + pub(crate) fn register_endpoint(&self) { + let count = self.active_endpoints.fetch_add(1, Ordering::AcqRel); + tracing::trace!("Endpoint registered, total active: {} -> {}", count, count + 1); + }
61-79: Lighter memory ordering in wait loopAcquire loads are sufficient when paired with AcqRel updates; keeps semantics while reducing constraints.
Apply this diff:
- let count = self.active_endpoints.load(Ordering::SeqCst); + let count = self.active_endpoints.load(Ordering::Acquire);
81-82: Remove stray commentLeftover comment is misleading.
Apply this diff:
- // This method is no longer needed since we can access the tracker directly }
41-55: Optional: RAII guard to ensure balanced register/unregisterReturning a guard from registration prevents leaks on early returns/panics; Drop handles the unregister.
// Add in this module: pub(crate) struct EndpointGuard(Arc<GracefulShutdownTracker>); impl Drop for EndpointGuard { fn drop(&mut self) { self.0.unregister_endpoint(); } } impl GracefulShutdownTracker { pub(crate) fn register_guard(self: &Arc<Self>) -> EndpointGuard { self.register_endpoint(); EndpointGuard(self.clone()) } }Usage in endpoint code:
let guard = tracker.register_guard(); // do work; guard drop will unregisterlib/runtime/src/component/endpoint.rs (3)
118-151: Optional: factor token linking into a small helperTo reduce repetition and make intent obvious, consider a helper like link_cancellation(lease_token, runtime_shutdown_token) -> CancellationToken.
181-199: Spawn on the intended runtime, or update the commentComment says “launch in primary runtime” but uses tokio::spawn. If available, prefer the explicit primary().spawn to match intent.
- let task = tokio::spawn(async move { + let task = endpoint.drt().runtime().primary().spawn(async move {If primary().spawn isn’t available, change the comment to “launch on current runtime”.
181-199: Optional: ensure tracker unregisters even if the task panicsConsider an RAII guard (e.g., scopeguard) inside the spawned task so unregister executes on all exit paths, including panics.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (2)
tests/planner/figures/dryrun_plot.pngis excluded by!**/*.pngtests/planner/figures/sla_planner_perf.pngis excluded by!**/*.png
📒 Files selected for processing (23)
benchmarks/nixl/nixl-benchmark-deployment.yaml(1 hunks)benchmarks/sin_load_generator/sin_synth.py(1 hunks)components/backends/sglang/deploy/disagg_planner.yaml(2 hunks)components/backends/sglang/src/dynamo/sglang/main.py(1 hunks)components/backends/vllm/deploy/disagg_planner.yaml(2 hunks)components/backends/vllm/src/dynamo/vllm/main.py(4 hunks)components/planner/src/dynamo/planner/utils/load_predictor.py(2 hunks)components/planner/src/dynamo/planner/utils/planner_core.py(2 hunks)lib/llm/src/discovery/watcher.rs(0 hunks)lib/runtime/src/component/endpoint.rs(3 hunks)lib/runtime/src/distributed.rs(2 hunks)lib/runtime/src/lib.rs(4 hunks)lib/runtime/src/pipeline/network/ingress/push_endpoint.rs(1 hunks)lib/runtime/src/runtime.rs(4 hunks)lib/runtime/src/utils.rs(1 hunks)lib/runtime/src/utils/graceful_shutdown.rs(1 hunks)tests/planner/README.md(7 hunks)tests/planner/perf_test_configs/agg_8b.yaml(1 hunks)tests/planner/perf_test_configs/disagg_8b_2p2d.yaml(1 hunks)tests/planner/perf_test_configs/disagg_8b_3p1d.yaml(1 hunks)tests/planner/perf_test_configs/disagg_8b_planner.yaml(1 hunks)tests/planner/perf_test_configs/disagg_8b_tp2.yaml(1 hunks)tests/planner/perf_test_configs/image_cache_daemonset.yaml(1 hunks)
💤 Files with no reviewable changes (1)
- lib/llm/src/discovery/watcher.rs
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-06-03T15:26:55.732Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.
Applied to files:
benchmarks/nixl/nixl-benchmark-deployment.yaml
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.
Applied to files:
lib/runtime/src/component/endpoint.rs
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Applied to files:
lib/runtime/src/lib.rs
📚 Learning: 2025-06-17T00:50:44.845Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Applied to files:
lib/runtime/src/lib.rs
🧬 Code graph analysis (7)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/runtime.rs (1)
graceful_shutdown_tracker(122-124)
components/planner/src/dynamo/planner/utils/planner_core.py (1)
components/planner/src/dynamo/planner/utils/perf_interpolation.py (2)
interpolate_thpt_per_gpu(52-54)interpolate_thpt_per_gpu(141-145)
components/backends/vllm/src/dynamo/vllm/main.py (5)
components/backends/trtllm/src/dynamo/trtllm/main.py (2)
init(105-325)graceful_shutdown(48-51)examples/multimodal/components/worker.py (6)
init(434-466)generate(181-184)generate(199-228)generate(265-395)graceful_shutdown(398-407)clear_kv_blocks(186-191)lib/bindings/python/rust/lib.rs (2)
serve_endpoint(522-548)generate(706-718)lib/bindings/python/src/dynamo/_core.pyi (2)
serve_endpoint(220-230)generate(1222-1245)components/backends/vllm/src/dynamo/vllm/handlers.py (4)
generate(40-41)generate(133-209)generate(216-255)clear_kv_blocks(43-48)
components/backends/sglang/src/dynamo/sglang/main.py (6)
components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (1)
generate(56-94)components/backends/sglang/src/dynamo/sglang/request_handlers/handler_base.py (1)
generate(32-33)components/backends/sglang/src/dynamo/sglang/request_handlers/prefill_handler.py (1)
generate(49-70)components/backends/vllm/src/dynamo/vllm/main.py (1)
graceful_shutdown(54-63)components/backends/trtllm/src/dynamo/trtllm/main.py (1)
graceful_shutdown(48-51)lib/llm/src/local_model.rs (1)
migration_limit(150-153)
lib/runtime/src/utils/graceful_shutdown.rs (1)
lib/runtime/src/runtime.rs (1)
new(40-67)
lib/runtime/src/component/endpoint.rs (5)
lib/runtime/src/component.rs (9)
endpoint(229-236)subject(492-494)etcd_path(212-215)etcd_path(450-457)etcd_path(593-595)new(566-572)name(221-223)name(424-426)name(597-602)lib/runtime/src/distributed.rs (2)
etcd_client(256-258)new(54-171)lib/runtime/src/runtime.rs (1)
new(40-67)lib/runtime/src/utils/graceful_shutdown.rs (1)
new(34-39)lib/runtime/src/lib.rs (2)
new(107-126)new(211-216)
lib/runtime/src/runtime.rs (2)
lib/runtime/src/utils/graceful_shutdown.rs (1)
new(34-39)lib/runtime/src/distributed.rs (4)
new(54-171)child_token(260-262)graceful_shutdown_tracker(264-266)shutdown(198-200)
🪛 Checkov (3.2.334)
tests/planner/perf_test_configs/image_cache_daemonset.yaml
[MEDIUM] 4-41: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[MEDIUM] 4-41: Minimize the admission of root containers
(CKV_K8S_23)
🪛 LanguageTool
tests/planner/README.md
[grammar] ~108-~108: There might be a mistake here.
Context: ...o_output_plot> ``` For example, to dry run SLA planner for the previous FP8 8B on ...
(QB_NEW_EN)
[grammar] ~196-~196: Ensure spelling is correct
Context: ...8 model on H200 and the dataset used in dryrun: - Config 1 with inefficient P/D ratio: 3xT...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~238-~238: There might be a mistake here.
Context: ... Improvement | Goodput/GPU Improvement | |---------------|-----------------|-----...
(QB_NEW_EN)
[grammar] ~239-~239: There might be a mistake here.
Context: ...-------------|-------------------------| | Inefficient P/D ratio | 725% | 600% | ...
(QB_NEW_EN)
[grammar] ~240-~240: There might be a mistake here.
Context: ... | Inefficient P/D ratio | 725% | 600% | | Inefficient parallelization mapping | ...
(QB_NEW_EN)
[grammar] ~241-~241: There might be a mistake here.
Context: ... parallelization mapping | 311% | 249% | | Best static deployment | 52% | 29% | ...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (25)
tests/planner/README.md (1)
220-230: Fix bash multiline comments and ensure URL includes scheme
- Move inline comments after backslashes to their own lines so continuations work as intended (stackoverflow.com, shellcheck.net)
- Prepend a protocol (e.g.
http://localhost:8000) to--urlif genai-perf requires an explicit schema; please confirm the expected format (scrapfly.io, requests-furo.readthedocs.io)components/planner/src/dynamo/planner/utils/load_predictor.py (1)
48-53: Idle-period skip in Base looks good.This cleanly trims initial zeros to ignore cold-start noise.
benchmarks/nixl/nixl-benchmark-deployment.yaml (1)
18-18: Secret rename looks good; please verify namespace presenceConfirm that Secret "nvcr-imagepullsecret" exists in the target namespace used for this Deployment to avoid ImagePullBackOff at rollout time.
components/backends/vllm/deploy/disagg_planner.yaml (1)
126-130: Confirm migration-limit default is disabled when flag removedPlanner-side request migration is being disabled in this PR; verify backend default for migration_limit is effectively 0 once the flag is removed. If not, set it explicitly to 0.
Optional explicit pin (only if needed):
- --model - Qwen/Qwen3-0.6B + - --migration-limit + - "0"Also applies to: 151-156
components/backends/sglang/deploy/disagg_planner.yaml (2)
114-131: Good move: switch workers to exec form for proper SIGTERM deliveryUsing
command: ["python3"]with explicit args eliminates the shell wrapper and ensures the Python PID receives SIGTERM directly.
145-162: Prefill worker uses exec form as well — consistent and correctSame benefit for graceful shutdown and signal handling here. Looks good.
components/backends/vllm/src/dynamo/vllm/main.py (5)
75-83: Signal handlers + runtime shutdown path look solidRegistering SIGTERM/SIGINT and delegating to
graceful_shutdownmatches the new shutdown design.
85-93: Helpful lifecycle loggingThe added debug logs around init paths will make phased shutdowns and stuck tasks easier to diagnose.
153-169: Prefill endpoint graceful=True is consistent with non-migratable prefillMatches the stated behavior that prefill should drain quickly; logging around start/finish is useful.
220-238: KV event publisher wiring is conditioned correctlyOnly constructed when prefix caching is enabled; IDs/ports derived from endpoint lease and DP rank look consistent with existing patterns.
269-272: NoNoneguard needed formigration_limit
The dataclass in components/backends/vllm/src/dynamo/vllm/args.py line 49 setsmigration_limit: int = 0, and it’s always assigned fromargs.migration_limit, soconfig.migration_limitcan’t beNone.tests/planner/perf_test_configs/disagg_8b_planner.yaml (3)
176-196: Good: direct python entrypoint (no shell wrapper).This ensures SIGTERM reaches the worker process and matches the PR’s stated fix.
231-252: Good: prefill worker uses direct python entrypoint and long TGP.Consistent with graceful shutdown goals.
11-16: Add Planner scrape config
Prometheus only scrapes localhost:8000 and the frontend—append
{"job_name":"planner","static_configs":[{"targets":["vllm-disagg-planner-planner:9085"]}]}
to theDYNAMO_SERVICE_CONFIGJSON so it pulls planner metrics. Confirm your controller’s service DNS if it differs.lib/runtime/src/utils.rs (1)
18-18: LGTM: clean module add + re-export.Public surface is minimal and discoverable via utils root.
Also applies to: 27-28
lib/runtime/src/distributed.rs (1)
28-29: LGTM: expose GracefulShutdownTracker via DRT.Keeps tracker centralized while allowing distributed components to coordinate shutdown.
Also applies to: 264-266
lib/runtime/src/lib.rs (1)
279-279: Good switch to tokio::sync::Mutex for async pathUsing Tokio’s Mutex here avoids accidental blocking in async contexts. LGTM.
lib/runtime/src/runtime.rs (3)
29-30: Import of GracefulShutdownTracker is correctMatches the new Runtime field and accessors.
47-49: Endpoint token derived from main token — goodThis enables phased cancel: endpoints first, infra later. No issues.
116-120: Verify downstream cancellation timingchild_token() now derives from endpoint_shutdown_token, so all child tokens will cancel on endpoint quiesce (Phase 1) rather than only at full shutdown. Confirm that each consumer (e.g. system_status_server.rs, worker.rs, worker_monitor.rs, transports/zmq.rs, distributed.rs, etc.) is intended to stop at quiesce.
lib/runtime/src/utils/graceful_shutdown.rs (1)
19-23: Solid, minimal tracker structurePublic type with private fields is appropriate; interior mutability via AtomicUsize/Notify removes the need for external locks.
lib/runtime/src/component/endpoint.rs (4)
17-17: Good addition: explicit CancellationToken importThis aligns with the new coordinated shutdown flow.
121-129: Pre-extracting values to avoid borrow/move issues is solidReduces capture complexity and avoids accidental borrows inside spawned tasks.
162-168: Passing the cancellation token into PushEndpoint is the right integration pointKeeps cancellation concerns centralized.
204-209: Using precomputed identity/transport fields for discovery is cleanAvoids re-deriving these inside the async task and keeps etcd payload stable.
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com> Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com> Signed-off-by: hongkuan <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com> Signed-off-by: hongkuan <hongkuanz@nvidia.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
graph TD A["Main CancellationToken<br/>(controls NATS/ETCD)"] --> B["endpoint_shutdown_token<br/>(child of main)"] B --> C["Endpoint Token 1<br/>(no lease)"] B --> D["Endpoint Token 2<br/>(no lease)"] B --> E["Runtime Shutdown Token<br/>(for endpoint with lease)"] F["ETCD Lease Token"] --> G["Combined Token<br/>(cancelled if either parent cancels)"] E --> G style A fill:#ff9999 style B fill:#ffcc99 style G fill:#99ccff A -.->|"Phase 3: Cancels to<br/>shutdown NATS/ETCD"| A B -.->|"Phase 1: Cancels to stop<br/>accepting new requests"| B F -.->|"Cancels on<br/>lease expiry"| Fgraph TD A[Main Token] --> B[endpoint_shutdown_token] A --> C[Lease Token] B --> D[runtime_shutdown_token] C --> E[lease_token] D --> F{Has Lease?} E --> F F -->|Yes| G[Combined Token<br/>Monitors both] F -->|No| H[Just runtime token] G --> I[Endpoint with lease] H --> J[Endpoint without lease] K[runtime.shutdown] -->|cancels| B L[Lease expires] -->|cancels| C B -->|triggers| G C -->|triggers| G style A fill:#f96,stroke:#333,stroke-width:4px style B fill:#69f,stroke:#333,stroke-width:4px style C fill:#9f6,stroke:#333,stroke-width:4px style G fill:#fcf,stroke:#333,stroke-width:3pxsequenceDiagram participant Signal participant Python Worker participant Runtime participant Tracker participant Endpoints participant NATS/ETCD Signal->>Python Worker: SIGTERM/SIGINT Python Worker->>Runtime: runtime.shutdown() Note over Runtime: Spawn shutdown coordination task rect rgb(255, 235, 205) Note over Runtime: Phase 1: Stop accepting new requests Runtime->>Endpoints: Cancel endpoint_shutdown_token Endpoints->>Endpoints: Stop accepting new requests<br/>(but keep processing in-flight) end rect rgb(205, 235, 255) Note over Runtime: Phase 2: Wait for graceful endpoints Runtime->>Tracker: Check active endpoints count loop While active_endpoints > 0 Note over NATS/ETCD: Still alive for D→P calls Endpoints->>Endpoints: Complete in-flight requests Endpoints->>Tracker: unregister_endpoint() Tracker->>Runtime: notify_waiters() Runtime->>Tracker: Recheck count end Tracker->>Runtime: All endpoints completed (count = 0) end rect rgb(255, 205, 205) Note over Runtime: Phase 3: Shutdown infrastructure Runtime->>NATS/ETCD: Cancel main cancellation_token NATS/ETCD->>NATS/ETCD: Shutdown connections end Python Worker->>Signal: Process exits cleanlySummary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests
Documentation
Chores