feat: add grove multinode support #2269

julienmancuso · 2025-08-04T15:51:03Z

Overview:

add grove multinode support

Summary by CodeRabbit

New Features
- Added support for a new backend framework option, "trtllm", for deployments.
- Introduced multinode deployment capabilities with explicit leader and worker roles for supported backends.
- Added ability to specify node allocation via a new "nodes" resource field in deployment specifications.
- Enhanced backend framework detection and modular backend customization for SGLang, VLLM, and TRTLLM.
- Improved resource and pod specification handling for distributed workloads, including startup dependencies.
Bug Fixes
- Improved merging of user-provided metadata and resource specifications in deployment generation.
Tests
- Added comprehensive unit tests for multinode backend logic and resource specification handling.
Chores
- Updated dependency versions and chart metadata.

coderabbitai · 2025-08-04T16:02:49Z

Walkthrough

This change introduces multinode deployment support for new backend frameworks—SGLang, VLLM, and TRTLLM—across the Dynamo operator, CRDs, and Helm charts. It adds multinode role abstractions, explicit startup dependencies, modular backend handling, and new resource fields (e.g., nodes). The update also includes comprehensive backend-specific logic, refactored pod spec generation, and extensive new unit tests.

Changes

Cohort / File(s)	Change Summary
CRD Schema Updates `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml`	Added `backendFramework` enum with `"trtllm"`, and new optional `nodes` string field to `resources.limits` and `resources.requests` in both component and graph deployment CRDs.
CRD Go Types `deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go`, `deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go`	Added `BackendFramework` field to deployment spec structs with enum validation; removed `GetImage()` from component deployment type.
Common API Types `deploy/cloud/operator/api/dynamo/common/common.go`	Added `Nodes` string field to `ResourceItem` struct.
Helm Chart Metadata `deploy/cloud/helm/crds/Chart.yaml`	Bumped chart version from 0.4.0 to 0.4.1.
Go Module `deploy/cloud/operator/go.mod`	Updated `github.com/NVIDIA/grove/operator/api` dependency version.
Operator Constants `deploy/cloud/operator/internal/consts/consts.go`	Added new constants for component types, shared memory, multinode deployment types, and Grove role suffixes.
Controller Logic `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go`	Refactored pod template spec generation to use a base pod spec generator, modularized label/annotation merging, and removed manual resource config helper.
Controller Tests `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go`	Updated test to cover new backend framework, resource fields, shared memory volume, and metadata merging.
Backend Framework Abstraction `deploy/cloud/operator/internal/dynamo/graph.go`	Major refactor: introduced backend abstraction, multinode roles, explicit pod startup dependencies, modular pod spec generation, backend detection, and resource handling.
SGLang Backend Implementation `deploy/cloud/operator/internal/dynamo/backend_sglang.go`, `deploy/cloud/operator/internal/dynamo/backend_sglang_test.go`	Added SGLang backend logic for multinode flag injection, probe removal, and unit tests for argument/probe handling.
VLLM Backend Implementation `deploy/cloud/operator/internal/dynamo/backend_vllm.go`, `deploy/cloud/operator/internal/dynamo/backend_vllm_test.go`	Added VLLM backend logic for Ray multinode orchestration, probe removal, and unit tests for command/probe handling.
TRTLLM Backend Implementation `deploy/cloud/operator/internal/dynamo/backend_trtllm.go`, `deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go`	Added TRTLLM backend logic for SSH/MPI multinode orchestration, SSH volume handling, environment propagation, and comprehensive unit tests.
Backend Common Utilities `deploy/cloud/operator/internal/dynamo/backend_common.go`	Added utility for generating Grove leader hostname.

Sequence Diagram(s)

sequenceDiagram
    participant Operator
    participant BackendFactory
    participant Backend
    participant K8sAPI

    Operator->>BackendFactory: Select backend (SGLang/VLLM/TRTLLM/Noop)
    BackendFactory->>Backend: Return backend instance
    Operator->>Backend: UpdateContainer/UpdatePodSpec (per role: leader/worker/main)
    Backend-->>Operator: Modified container/pod spec
    Operator->>K8sAPI: Create/Update Pod/PodGangSet with multinode roles and dependencies

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Possibly related PRs

feat: kubernetes overrides for the entrypoint and cmd #1396: Implements main container command/args overrides in the operator and CRDs, closely related to this PR’s backend-specific handling of container commands and arguments.
feat: simplify k8s deployment #1708: Removes the DynamoComponent resource and image builder, directly related to this PR’s refactor and extension of CRD and deployment management logic.

Poem

🐇✨
Three backends now can play,
SGLang, VLLM, and TRTLLM lead the way!
Multinode roles and pods align,
With SSH, Ray, or Python fine.
Pods start in order, tests abound—
In this meadow, new frameworks are found!
🌱🚀

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Note

🔌 MCP (Model Context Protocol) integration is now available in Early Access!

Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbb4caa and 3743598.

⛔ Files ignored due to path filters (1)

deploy/cloud/operator/go.sum is excluded by !**/*.sum

📒 Files selected for processing (20)

deploy/cloud/helm/crds/Chart.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (3 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (3 hunks)
deploy/cloud/operator/api/dynamo/common/common.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (3 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (3 hunks)
deploy/cloud/operator/go.mod (1 hunks)
deploy/cloud/operator/internal/consts/consts.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (6 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (5 hunks)
deploy/cloud/operator/internal/dynamo/backend_common.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_trtllm.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (1 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (5 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (9)

deploy/cloud/operator/go.mod (1)

9-9: LGTM! Dependency update supports new features.

The grove operator API update aligns with the multinode support implementation. However, consider using tagged releases instead of pseudo-versions for better dependency management and reproducibility in production environments.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
407-412: Confirm field was generated via kubebuilder markers, not hand-edited

The new backendFramework property looks syntactically correct (default matches one of the enum values).
Because this CRD YAML is meant to be fully auto-generated, please double-check that the corresponding Go type includes the proper kubebuilder tags, e.g.
// +kubebuilder:validation:Enum=sglang;vllm
// +kubebuilder:default=vllm
BackendFramework string `json:"backendFramework,omitempty"`
and that the file was regenerated with make manifests (or equivalent). Otherwise the change will be lost on the next generation run.
Consider adding an additionalPrinterColumns entry so users can see the chosen backend directly via kubectl get.
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

474-493: Consider adding basic value validation for the new dynamoConfig sub-fields

dataParallelSize, numberOfNodes, and tensorParallelSize logically cannot be negative or zero, yet the schema currently allows any int32. Adding minimum: 1 (via // +kubebuilder:validation:Minimum=1 on the Go struct) will prevent invalid specs from reaching the reconciler and avoid runtime errors.

Likewise, for extraArgs you may want x-kubernetes-list-type: atomic for patch-friendly behaviour, mirroring other string arrays in the CRD.

These tweaks belong in the Go API types, not this generated YAML.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)
474-493: Consider tightening validation on numeric sizes in dynamoConfig

dataParallelSize, tensorParallelSize, and numberOfNodes accept any 32-bit integer, including zero or negative values, which are unlikely to be meaningful.
If the operator requires these to be positive (or at least ≥1) you can add a minimum: 1 constraint via a +kubebuilder:validation:Minimum=1 marker on the corresponding Go struct fields before regenerating the CRD.
+                          dataParallelSize:
+                            minimum: 1
                           ...
+                          numberOfNodes:
+                            minimum: 1
Not blocking, but worth verifying against controller logic.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
1232-1264: Consider extracting probe configuration to constants or a helper function

The default probe configurations are hardcoded with specific values. Consider extracting these to constants or a helper function for better maintainability and reusability across the codebase.

Example refactor:
func getDefaultLivenessProbe() *corev1.Probe {
    return &corev1.Probe{
        InitialDelaySeconds: 60,
        PeriodSeconds:       60,
        TimeoutSeconds:      5,
        FailureThreshold:    10,
        SuccessThreshold:    1,
        ProbeHandler: corev1.ProbeHandler{
            HTTPGet: &corev1.HTTPGetAction{
                Path: "/healthz",
                Port: intstr.FromString(commonconsts.DynamoHealthPortName),
            },
        },
    }
}

func getDefaultReadinessProbe() *corev1.Probe {
    return &corev1.Probe{
        InitialDelaySeconds: 60,
        PeriodSeconds:       60,
        TimeoutSeconds:      5,
        FailureThreshold:    10,
        SuccessThreshold:    1,
        ProbeHandler: corev1.ProbeHandler{
            HTTPGet: &corev1.HTTPGetAction{
                Path: "/readyz",
                Port: intstr.FromString(commonconsts.DynamoHealthPortName),
            },
        },
    }
}
deploy/cloud/operator/internal/dynamo/graph.go (1)

556-659: Consider breaking down GenerateBasePodSpec into smaller functions

This function handles many responsibilities including container setup, environment variables, volumes, secrets, and pod spec merging. Consider extracting some logic into helper functions for better maintainability.

Suggested breakdown:

setupContainer: Initialize base container with ports and probes

applyExtraPodSpec: Handle extraPodSpec merging

setupVolumes: Handle PVC and volume mounts

setupImagePullSecrets: Handle secret retrieval and configuration

This would make the main function more readable and each piece more testable.
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)
8-9: Consolidate duplicate consts imports.

The same package is imported twice with different aliases. Consider consolidating to use a single import.
-	"github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"
-	commonconsts "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"
+	"github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"
Then update usage from commonconsts.ComponentTypeMain to consts.ComponentTypeMain throughout the file.
deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (2)
14-201: Well-structured comprehensive tests.

The table-driven test approach is excellent and covers diverse scenarios including edge cases. The use of expectContains provides flexible assertion capabilities.

Minor suggestions for improvement:

Consider adding test cases for:

LWS multinode deployment type (currently only Grove is tested)

Error scenarios or invalid configurations

Boundary conditions (e.g., numberOfNodes = 0)
+		{
+			name:                    "worker component multinode worker LWS",
+			componentType:           commonconsts.ComponentTypeWorker,
+			numberOfNodes:           3,
+			role:                    RoleWorker,
+			multinodeDeploymentType: consts.MultinodeDeploymentTypeLWS,
+			component: &v1alpha1.DynamoComponentDeploymentOverridesSpec{
+				DynamoComponentDeploymentSharedSpec: v1alpha1.DynamoComponentDeploymentSharedSpec{
+					DynamoConfig: &v1alpha1.DynamoConfig{},
+				},
+			},
+			expectedCmd:    []string{"/bin/sh", "-c"},
+			expectContains: []string{"python3 -m dynamo.sglang.worker", "dist-init-addr", "LWS_LEADER_ADDRESS"},
+		},
203-271: Good test coverage for MergeArgs functionality.

The test cases appropriately cover different scenarios and roles.

Suggestion for maintainability:

The complex expected result on line 257 could be fragile due to argument ordering. Consider using expectContains pattern here too:
-			expectedResult: []string{"user args --custom-flag custom-value --dist-init-addr ${GROVE_HEADLESS_SERVICE}:29500 --dp-size 3 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --tp-size 2 --extra arg"},
+			expectContains: []string{"user args", "--custom-flag custom-value", "--dist-init-addr", "${GROVE_HEADLESS_SERVICE}:29500", "--dp-size 3", "--nnodes 3", "--extra arg"},

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbb4caa and 2f380aa.

⛔ Files ignored due to path filters (1)

deploy/cloud/operator/go.sum is excluded by !**/*.sum

📒 Files selected for processing (19)

deploy/cloud/helm/crds/Chart.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (3 hunks)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/go.mod (1 hunks)
deploy/cloud/operator/internal/consts/consts.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (7 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (6 hunks)
deploy/cloud/operator/internal/dynamo/backend_common.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_common_test.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (1 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (3 hunks)

🧰 Additional context used

🧠 Learnings (9)

📓 Common learnings

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.

📚 Learning: the stopsignal field under lifecycle in dynamocomponentdeployment crds is autogenerated due to kuber...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
deploy/cloud/helm/crds/Chart.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/dynamo/graph.go

📚 Learning: the `stopsignal` field in kubernetes crds like dynamographdeployment and dynamocomponentdeployment i...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
deploy/cloud/helm/crds/Chart.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/dynamo/graph.go

📚 Learning: the `dyn_deployment_config` environment variable (commonconsts.dynamodeploymentconfigenvvar) in the ...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/dynamo/graph.go

📚 Learning: the image-builder serviceaccount in deploy/cloud/helm/platform/components/operator/templates/image-b...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

Applied to files:

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/cloud/operator/internal/dynamo/graph.go

📚 Learning: crd schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from kubernetes...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

deploy/cloud/helm/crds/Chart.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml

📚 Learning: in the dynamo operator, the project’s preferred security posture is to set a pod-level `podsecurityc...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.

Applied to files:

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go

📚 Learning: in vllm worker deployments, startup probes (with longer periods and higher failure thresholds like p...

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go

📚 Learning: in the dynamo codebase, componenttypeplanner constants with different cases ("planner" vs "planner")...

Learnt from: biswapanda
PR: ai-dynamo/dynamo#1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

Applied to files:

deploy/cloud/operator/internal/consts/consts.go

🧬 Code Graph Analysis (3)

deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)

deploy/cloud/operator/internal/dynamo/graph.go (1)

BackendFramework (505-505)

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

DynamoConfig (103-123)

deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (4)

deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)

VLLMBackend (13-13)

deploy/cloud/operator/internal/dynamo/graph.go (5)

Role (475-475)

RoleMain (480-480)

DynamoConfig (45-50)

RoleLeader (478-478)

RoleWorker (479-479)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (3)

DynamoComponentDeploymentOverridesSpec (53-55)

DynamoComponentDeploymentSharedSpec (57-95)

DynamoConfig (103-123)

deploy/cloud/operator/internal/consts/consts.go (6)

MultinodeDeploymentType (51-51)

ComponentTypeMain (38-38)

MultinodeDeploymentTypeGrove (54-54)

ComponentTypeWorker (42-42)

ComponentTypePrefillWorker (43-43)

ComponentTypeDecodeWorker (44-44)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Mirror Repository to GitLab
GitHub Check: Build and Test - vllm

🔇 Additional comments (33)

deploy/cloud/helm/crds/Chart.yaml (1)

19-19: LGTM! Appropriate version increment.

The chart version bump from 0.4.0 to 0.4.1 correctly reflects the additive CRD schema changes for grove multinode support without breaking existing functionality.

deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)

43-46: LGTM! Well-structured backend framework field.

The BackendFramework field is properly implemented with:

Appropriate kubebuilder validation enum constraining values to supported backends

Sensible default value ("vllm") for backward compatibility

Optional field design following Kubernetes API conventions

deploy/cloud/operator/internal/dynamo/backend_common.go (1)

8-37: LGTM! Well-designed utility function.

The applyFlagOverridesAndExtraArgs function is well-implemented with:

Correct handling of flag overrides and removals (nil values)

Deterministic output through sorting

Proper command-line flag formatting

Clear separation of flag processing and extra args appending

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2)

253-257: LGTM! Correct deepcopy integration.

The deepcopy logic for the new DynamoConfig field in DynamoComponentDeploymentSharedSpec is correctly implemented with proper nil checking.

345-394: LGTM! Auto-generated deepcopy methods are correct.

The auto-generated deepcopy methods for DynamoConfig correctly handle:

Pointer fields with proper nil checking and allocation

Map with pointer-to-string values including nil value handling

String slice with proper copying

As this is auto-generated code, avoid manual modifications.

deploy/cloud/operator/internal/dynamo/backend_common_test.go (6)

3-8: LGTM!

The imports are appropriate and necessary for the test functionality. Good use of Gomega for handling non-deterministic map iteration and the ptr utility for creating test data.

10-17: LGTM!

Excellent use of table-driven tests with a clear, well-structured test case format. The field names are descriptive and the approach enables comprehensive scenario coverage.

18-84: Excellent test case coverage!

The test cases comprehensively cover all important scenarios:

Basic functionality without modifications

Flag overriding and addition

Flag removal via nil pointers

Extra arguments handling

Complex combinations

Edge cases with empty inputs

The expected results are correctly formatted as command-line flags, and the test data is realistic and well-organized.

86-95: Excellent handling of non-deterministic behavior!

The test execution properly uses ConsistOf matcher to handle non-deterministic map iteration order, with a clear comment explaining the rationale. The subtest structure with t.Run and proper Gomega setup follows best practices.

20-83: Test data values are accurate and comprehensive.

The test data correctly represents various scenarios:

Proper flag formatting with "--flag value" pattern

Correct use of pointers with nil for flag removal

Realistic combinations of flag overrides and extra arguments

Edge cases with empty inputs

All expected results align with the intended functionality and command-line flag conventions.

1-96: High-quality test implementation supporting the multinode backend framework.

This test file exemplifies excellent Go testing practices:

Comprehensive table-driven tests covering all scenarios

Proper handling of non-deterministic map iteration

Clear structure and documentation

Good integration with testing libraries

The test ensures correctness of a foundational utility that's critical for the new backend abstraction framework's command generation logic.

deploy/cloud/operator/internal/consts/consts.go (2)

41-44: LGTM! New component types support disaggregated serving architecture.

The new component type constants are well-documented and follow existing naming conventions. They properly support the multinode deployment functionality with clear distinctions between aggregated and disaggregated worker roles.

51-56: Excellent type-safe approach for multinode deployment types.

The new MultinodeDeploymentType provides type safety and prevents string literal errors. The "grove" constant directly supports the PR objective of adding grove multinode support, and the dedicated type enables better compile-time checking.

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)

407-412: backendFramework schema addition LGTM

Enum + default are consistent (vllm / sglang) and match the operator refactor.
No issues spotted.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

47-52: backendFramework property LGTM – just verify operator defaulting

The enum + default definition is syntactically correct and aligns with the operator’s two supported back-ends. Confirm that the Go struct defining BackendFramework has the // +kubebuilder:default=vllm (or equivalent) tag so that both the CRD and webhook defaulting behave consistently.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

47-52: Addition aligns with new backend abstraction – looks good

The backendFramework field is correctly added at the same level as other root-spec properties, its default (vllm) matches one of the enumerated values, and the enum is restrictive (sglang / vllm) which prevents typos.
No issues detected here.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)

825-857: Test updates align well with the new backend framework and multinode support.

The test correctly validates:

New BackendFramework field propagation

ExtraPodMetadata merging into pod templates

Resource requests alongside limits

Proper structure for multinode deployments

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

101-127: Excellent design for flexible backend configuration.

The DynamoConfig struct provides a clean, extensible approach that:

Unifies configuration across different backends

Allows fine-grained control via FlagOverrides (including flag removal with nil values)

Supports both single-node and multinode deployments declaratively

Simplifies the API by replacing complex backend-specific configs

This is a well-thought-out abstraction.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (3)

25-25: LGTM! Appropriate imports and constant addition

The new imports and DeploymentTypeMultinodeGrove constant are properly integrated and follow the existing patterns.

Also applies to: 39-40, 80-80

513-513: Good use of maps.Copy for label merging

Using maps.Copy is more idiomatic and safer than manual map copying. It properly handles nil maps and ensures all entries are copied.

Also applies to: 555-555

1215-1227: Excellent refactoring to centralize pod spec generation

The refactoring to use dynamo.GenerateBasePodSpecForController significantly simplifies the code and improves maintainability by:

Removing duplicate logic for environment variables, volumes, and security contexts

Centralizing backend-specific command and argument generation

Maintaining consistency across different deployment types

The use of maps.Copy for merging annotations and labels is also a good improvement.

Also applies to: 1304-1311

deploy/cloud/operator/internal/dynamo/graph.go (4)

150-150: Well-structured helper functions and framework propagation

Good additions:

Proper propagation of BackendFramework from graph to component deployments

Safe getNumberOfNodes helper with nil checks and sensible default

Clean mergeContainerCommand helper following the user-override pattern

Also applies to: 315-321, 464-470

491-501: Clean role expansion logic

The expandRolesForService function properly handles both single-node and multinode deployments with clear role assignments and replica counts.

682-748: Well-structured Grove deployment generation

The GenerateGrovePodGangSet function excellently handles:

Role-based pod clique generation for multinode deployments

Proper scaling group configuration

Metadata merging from extraPodMetadata

Clean separation between single-node and multinode logic

757-795: Good adapter pattern for controller compatibility

The controller-specific functions (ConvertDynamoComponentDeploymentToSpec and GenerateBasePodSpecForController) provide a clean adapter pattern that allows the controller to leverage the centralized backend logic while maintaining its specific requirements.

deploy/cloud/operator/internal/dynamo/backend_vllm.go (4)

13-13: LGTM!

Empty struct is appropriate for stateless backend implementation that satisfies the Backend interface.

86-106: LGTM!

The function correctly handles nil config, properly converts integer values to string flags, and uses the common helper function for applying overrides and extra arguments. Good separation of concerns.

1-13: LGTM!

Package declaration, imports, and struct definition are well-organized and follow Go conventions.

86-106: LGTM!

The helper function is well-structured with proper nil handling and follows good practices for building argument lists from configuration.

deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (4)

3-12: LGTM!

All imports are appropriate for the test functionality and are used within the file.

14-201: Excellent test coverage!

The test function provides comprehensive coverage of different scenarios including component types, multinode configurations, roles, and various DynamoConfig options. The table-driven approach with clear test names makes it easy to understand what each test validates.

203-355: Comprehensive test coverage for argument handling!

Both TestSGLangBackend_MergeArgs and TestBuildSGLangArgs provide excellent coverage:

Argument merging logic with default/user args scenarios

Multinode behavior validation

Configuration option testing with both positive and negative assertions

Proper handling of nil configurations

The table-driven approach with clear expectations makes the tests maintainable and easy to understand.

273-355: Excellent test coverage with both positive and negative assertions.

The test function demonstrates thorough testing practices by validating both what should be present (expectContains) and what should be absent (expectNotContains). This approach helps catch regression issues effectively.

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml

deploy/cloud/operator/internal/dynamo/backend_sglang.go

deploy/cloud/operator/internal/dynamo/backend_vllm_test.go

deploy/cloud/operator/internal/dynamo/backend_vllm.go

deploy/cloud/operator/internal/dynamo/graph.go

nvrohanv

LGTM for initial PR. In follow ups I think Grove Unit tests similar to LWS would be good. I also think we should discuss some way to decouple the backend (sglang, vllm, trtllm) code a little more from the orchestrator (LWS, Grove) etc so that it can be extended or modified in a more modular way

nvrohanv · 2025-08-10T05:20:06Z

deploy/cloud/operator/internal/dynamo/backend_sglang.go

+	}
+
+	// Remove probes for multinode leader and worker
+	if role == RoleLeader || role == RoleWorker {


Why no probes? Is this only until the PublishNotReadyAddresses is default true in grove?

it's a miss, probes should only be removed for the workers

nvrohanv · 2025-08-10T05:21:08Z

deploy/cloud/operator/internal/dynamo/backend_sglang.go

+	}
+
+	// Generate the flags to add
+	flags := b.getMultinodeFlags(numberOfNodes, role, multinodeDeploymentType, serviceName)


Users shouldnt provide it (and I'm not sure how theyd even figure out how to provide it) but I'm wondering if we should autodetect if the multinode flags were provided and either error out with a message or just override it. What do you think?

not sure what would be the best solution ...
I guess maybe implements a user override might make sense ?

nvrohanv · 2025-08-10T05:24:53Z

deploy/cloud/operator/internal/dynamo/backend_sglang.go

+	if role == RoleLeader {
+		nodeRank = "0"
+	} else {
+		if multinodeDeploymentType == commonconsts.MultinodeDeploymentTypeGrove {


While I think its fine for now, at some point we probably want to decouple this a little bit so that adding in alternatives to Grove or LWS can be done without having to change the backend_sglang. We'll probably have to think up some interfaces for adding in a "super-pod" creator like Grove or LWS.

yes good idea, we can define an interface for deploymentType that the backend can also consume

nvrohanv · 2025-08-11T21:16:50Z

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go

can be put in a separate pr since this one is large, but will we have grove specific tests at some point?

graph_test.go extensively covers grove multinode podgagnset generation

julienmancuso requested review from biswapanda, hhzhang16, hutm, ishandhanani, mohammedabdulwahhab and nnshah1 as code owners August 4, 2025 15:51

pull-request-size bot added the size/XXL label Aug 4, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 15:51 Inactive

github-actions bot added the feat label Aug 4, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 15:53 Inactive

coderabbitai bot reviewed Aug 4, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 16:49 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 16:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 20:08 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 20:09 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 4, 2025 20:53 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 5, 2025 03:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 5, 2025 03:04 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 5, 2025 19:36 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 5, 2025 19:37 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 5, 2025 20:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 15:49 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 16:33 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 16:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 16:46 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 16:47 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 8, 2025 00:09 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 8, 2025 00:10 Inactive

julienmancuso force-pushed the jsm/dep-243 branch from fb0571b to c3f0d37 Compare August 11, 2025 14:40

copy-pr-bot bot temporarily deployed to GITLAB August 11, 2025 14:40 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 11, 2025 14:41 Inactive

mohammedabdulwahhab approved these changes Aug 11, 2025

View reviewed changes

feat: add grove multinode support

9ebeefa

julienmancuso force-pushed the jsm/dep-243 branch from c3f0d37 to 9ebeefa Compare August 11, 2025 19:01

copy-pr-bot bot temporarily deployed to GITLAB August 11, 2025 19:01 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 11, 2025 19:04 Inactive

feat: add grove multinode support

30ca640

copy-pr-bot bot temporarily deployed to GITLAB August 11, 2025 20:05 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 11, 2025 20:06 Inactive

julienmancuso merged commit dabd226 into main Aug 11, 2025
8 of 10 checks passed

julienmancuso deleted the jsm/dep-243 branch August 11, 2025 20:31

nvrohanv reviewed Aug 11, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Aug 11, 2025

feat: extract deploymentType as interface #2405

Merged

krishung5 pushed a commit that referenced this pull request Aug 12, 2025

feat: add grove multinode support (#2269)

602eccf

This was referenced Aug 20, 2025

fix: revisit grove and LWS selection #2564

Merged

fix: increase shm default size and make it configurable #2616

Merged

harryskim mentioned this pull request Aug 22, 2025

[Roadmap]: 0.4.1 - 0.5.0 roadmap and key dates #2649

Open

This was referenced Aug 22, 2025

fix: do not fail if backendFramework cannot be detected #2655

Merged

feat: Auto-inject kai-scheduler annotations and label #2748

Merged

coderabbitai bot mentioned this pull request Sep 19, 2025

fix: improve sglang multinode handling in operator #3151

Merged

feat: add grove multinode support #2269

feat: add grove multinode support #2269

Uh oh!

Conversation

julienmancuso commented Aug 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvrohanv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

julienmancuso commented Aug 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 4, 2025 •

edited

Loading