Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Sep 22, 2025

Overview:

revamp kubernetes doc

Summary by CodeRabbit

  • Documentation
    • Revamped Kubernetes docs: updated all links to Kubernetes-specific guides (API Reference, Operator, Create Deployment).
    • Expanded backend deployment tables to include Disaggregated Multi-node for vLLM, TensorRT-LLM, and SGLang.
    • Installation guide: added Helm customization via values.yaml, clarified multinode requirements (enable Grove and Kai Scheduler; not installed by default), and corrected helm command to use install.
    • Renamed terminology from PodGangSet to PodCliqueSet across Grove docs.
    • Expanded Additional Resources under /docs/kubernetes/.
    • Minor formatting improvements.

@pull-request-size pull-request-size bot added size/M and removed size/L labels Sep 22, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 22, 2025

Walkthrough

Documentation updates across Kubernetes guides: retargeted links to /docs/kubernetes, added disaggregated multi-node entries for several backends, introduced Helm customization guidance, clarified multinode prerequisites (Grove and Kai Scheduler), adjusted an install command, and standardized Grove terminology from PodGangSet to PodCliqueSet.

Changes

Cohort / File(s) Summary of Changes
Kubernetes docs link retargeting
docs/kubernetes/README.md, docs/kubernetes/dynamo_operator.md, docs/kubernetes/fluxcd.md
Updated internal and cross-document links from generic guides to Kubernetes-specific paths; adjusted navigation (API Reference, Operator, Create Deployment) to /docs/kubernetes/...; aligned FluxCD prerequisites/steps with Kubernetes installation guide.
Backend deployment matrix updates
docs/kubernetes/README.md
Expanded deployment tables to include “Disaggregated Multi-node” for vLLM, TensorRT-LLM, and SGLang.
Grove terminology standardization
docs/kubernetes/grove.md
Renamed occurrences of PodGangSet to PodCliqueSet across headers and descriptions, including scheduling and scaling references.
Installation guide enhancements
docs/kubernetes/installation_guide.md
Added Helm customization via values.yaml (-f/--set); clarified multinode requirements (enable Grove and Kai Scheduler; not installed by default); noted install options; changed platform command from helm upgrade --install to helm install; formatting adjustments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

In clusters where the pods align,
I thump my paws in tidy time—
Links now hop to K8s lanes,
Grove’s cliques replace old gang names.
Helm flags flutter, values set—
Multi-node dreams? We’re ready, bet!
— A cheerful ops-savvy rabbit 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description only contains a one-line Overview ("revamp kubernetes doc") and does not follow the repository's required template: it is missing the Details section, "Where should the reviewer start?", and Related Issues entries. Given the changes touch multiple docs (installation guide, README, API/operator links, terminology rename, etc.), the current description is insufficient for reviewers to understand scope, rationale, and review hotspots. This makes the description largely incomplete relative to the template. Please expand the PR description to follow the repository template by adding a Details section summarizing concrete file-level changes and rationale, a "Where should the reviewer start?" section that lists key files to review (e.g., docs/kubernetes/installation_guide.md, README.md, grove.md), and any related issue references or action keywords; also call out notable changes such as the PodGangSet→PodCliqueSet rename and the Helm command change so reviewers can validate migration impacts. After updating the description the check can be re-evaluated.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: revamp kubernetes doc" is concise and accurately reflects the primary change in the changeset, which updates and reorganizes multiple files under docs/kubernetes per the PR summary. It follows conventional-commit style and clearly signals the main intent for reviewers scanning history. For these reasons the title meets the repository's expectations for a short, focused PR title.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/kubernetes/installation_guide.md (1)

68-82: Standardize NAMESPACE across Kubernetes docs (choose one canonical value or a single placeholder)

Multiple docs export different NAMESPACE values — consolidate to a single namespace or consistent placeholder to avoid copy/paste "not found" errors. Occurrences to update:

  • docs/kubernetes/installation_guide.md — line 72: export NAMESPACE=dynamo-kubernetes; line 111: export NAMESPACE=dynamo-cloud
  • docs/kubernetes/README.md — line 26: export NAMESPACE=dynamo-kubernetes; line 55: export NAMESPACE=dynamo-cloud
  • docs/kubernetes/metrics.md — line 56: export NAMESPACE=dynamo
  • docs/kubernetes/fluxcd.md — line 70: export NAMESPACE=
  • docs/kubernetes/sla_planner_deployment.md — line 32: export NAMESPACE=your-namespace
  • docs/kubernetes/gke_setup.md — line 15: export NAMESPACE=your-k8s-namespace

Choose a canonical value or a single clear placeholder and update the listed files (and top-level README) accordingly.

🧹 Nitpick comments (11)
docs/kubernetes/grove.md (2)

42-46: Minor style nit: tighten phrasing.

“within a PodCliqueSet” is implied; consider trimming for brevity.

- PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
+ PodCliques and PodCliqueScalingGroups allow flexible gang-scheduling at multiple levels in a PodCliqueSet to prevent deadlocks and ensure all components start together.

90-95: Standardize "KAI" vs "Kai" capitalization across docs

Mixed usage found: docs/kubernetes/grove.md uses "KAI Scheduler" (lines 90, 92, 100) while docs/kubernetes/installation_guide.md uses "Kai Scheduler" (lines 85–88). Pick one form and make all occurrences consistent — recommend "KAI Scheduler" to match the linked NVIDIA repo.

docs/kubernetes/fluxcd.md (3)

21-22: Fix grammar in Step 1 line.

-First, follow to [See Install Dynamo Cloud](/docs/kubernetes/installation_guide.md).
+First, follow the [Dynamo Kubernetes Platform Installation Guide](/docs/kubernetes/installation_guide.md).

3-3: Align example link path with backend docs layout.

Other docs reference backend deploy READMEs under deploy/. Update for consistency.

-This section describes how to use FluxCD ... We'll use the [aggregated vLLM example](../../../components/backends/vllm/README.md) to demonstrate the workflow.
+This section describes how to use FluxCD ... We'll use the [aggregated vLLM example](../../../components/backends/vllm/deploy/README.md) to demonstrate the workflow.

73-74: Use canonical kubectl resource name (lowercase plural).

Kubernetes CLI commonly uses plural, lowercase resource names for CRDs.

-kubectl get dynamographdeployment llm-agg -n $NAMESPACE
+kubectl get dynamographdeployments llm-agg -n $NAMESPACE
docs/kubernetes/dynamo_operator.md (1)

30-31: Likely broken anchor (#overview).

installation_guide.md doesn’t define an “Overview” header. Link to the guide root or an existing section.

-[See installation steps](/docs/kubernetes/installation_guide.md#overview)
+[See installation steps](/docs/kubernetes/installation_guide.md)
docs/kubernetes/installation_guide.md (3)

35-47: Helm customization note: clarify multi-file and type handling.

Small clarity gains: mention multiple -f files and --set-string for literal strings.

-All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:
+You can customize any helm install by supplying one or more values files or inline flags:
@@
-helm install ...
-  -f your-values.yaml
+helm install ... \
+  -f your-values.yaml \   # can be repeated multiple times
@@
-  --set "your-value=your-value"
+  --set "your-value=your-value" \
+  --set-string "someString=00123"   # preserve leading zeros, etc.

84-92: Nice: explicit flags to enable Grove and Kai. Add a full command example.

Including the flags within a full helm command reduces user error.

+Example:
+```bash
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace ${NAMESPACE} --create-namespace \
+  --set "grove.enabled=true" \
+  --set "kai-scheduler.enabled=true"
+```

200-211: Tiny wording/case nit in troubleshooting.

-just add the following to the helm install command:
+Just add the following to the helm install command:
docs/kubernetes/README.md (2)

55-56: Namespace mismatch with earlier section.

Earlier you export NAMESPACE=dynamo-kubernetes; here it’s dynamo-cloud. Pick one.

-export NAMESPACE=dynamo-cloud
+export NAMESPACE=dynamo-kubernetes

61-66: Use canonical kubectl resource name (lowercase plural).

Mirror the suggestion made in fluxcd.md for consistency.

-kubectl get dynamoGraphDeployment -n ${NAMESPACE}
+kubectl get dynamographdeployments -n ${NAMESPACE}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51b4cd7 and b82f998.

📒 Files selected for processing (5)
  • docs/kubernetes/README.md (3 hunks)
  • docs/kubernetes/dynamo_operator.md (1 hunks)
  • docs/kubernetes/fluxcd.md (2 hunks)
  • docs/kubernetes/grove.md (2 hunks)
  • docs/kubernetes/installation_guide.md (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: Build and Test - sglang
🔇 Additional comments (8)
docs/kubernetes/grove.md (1)

22-27: Terminology update looks consistent.

PodCliqueSet description reads well and aligns with the rest of the doc.

docs/kubernetes/fluxcd.md (1)

7-7: Good: prerequisite now points to the Kubernetes install guide.

docs/kubernetes/dynamo_operator.md (1)

26-27: API reference link update: LGTM.

docs/kubernetes/installation_guide.md (1)

145-149: Switch to helm install: LGTM.

Install path reads well and matches earlier steps.

docs/kubernetes/README.md (4)

39-39: Good cross-link to the Installation Guide.


47-50: Backend table additions: LGTM.

The multi‑node options look consistent across backends.


76-85: Kubernetes docs links: LGTM.


169-176: Verify Helm charts path; both deploy/helm/README.md and deploy/cloud/helm/platform/README.md exist

Both files are present in the repo; installation_guide refers to the cloud path — update docs/kubernetes/README.md to point to /deploy/cloud/helm/platform/README.md for consistency, or confirm the canonical location.

- **[Helm Charts](/deploy/helm/README.md)** - For advanced users
+ **[Helm Charts](/deploy/cloud/helm/platform/README.md)** - For advanced users

@athreesh
Copy link
Contributor

I think CI caught some broken links. also, can you grep for docs/guides/dynamo_deploy across the codebase to see if there's any lingering references? also left a comment re: bitnami

otherwise, awesome changes & LGTM! i think this will help us quite a bit :)

Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
@julienmancuso julienmancuso merged commit 4a71802 into main Sep 23, 2025
13 of 14 checks passed
@julienmancuso julienmancuso deleted the jsm/dep-406 branch September 23, 2025 20:09
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
kylehh pushed a commit that referenced this pull request Sep 25, 2025
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Signed-off-by: Kyle H <kylhuang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants