Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Aug 4, 2025

Overview:

add grove multinode support

Summary by CodeRabbit

  • New Features

    • Added support for a new backend framework option, "trtllm", for deployments.
    • Introduced multinode deployment capabilities with explicit leader and worker roles for supported backends.
    • Added ability to specify node allocation via a new "nodes" resource field in deployment specifications.
    • Enhanced backend framework detection and modular backend customization for SGLang, VLLM, and TRTLLM.
    • Improved resource and pod specification handling for distributed workloads, including startup dependencies.
  • Bug Fixes

    • Improved merging of user-provided metadata and resource specifications in deployment generation.
  • Tests

    • Added comprehensive unit tests for multinode backend logic and resource specification handling.
  • Chores

    • Updated dependency versions and chart metadata.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 4, 2025

Walkthrough

This change introduces multinode deployment support for new backend frameworks—SGLang, VLLM, and TRTLLM—across the Dynamo operator, CRDs, and Helm charts. It adds multinode role abstractions, explicit startup dependencies, modular backend handling, and new resource fields (e.g., nodes). The update also includes comprehensive backend-specific logic, refactored pod spec generation, and extensive new unit tests.

Changes

Cohort / File(s) Change Summary
CRD Schema Updates
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml, deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
Added backendFramework enum with "trtllm", and new optional nodes string field to resources.limits and resources.requests in both component and graph deployment CRDs.
CRD Go Types
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go, deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
Added BackendFramework field to deployment spec structs with enum validation; removed GetImage() from component deployment type.
Common API Types
deploy/cloud/operator/api/dynamo/common/common.go
Added Nodes string field to ResourceItem struct.
Helm Chart Metadata
deploy/cloud/helm/crds/Chart.yaml
Bumped chart version from 0.4.0 to 0.4.1.
Go Module
deploy/cloud/operator/go.mod
Updated github.com/NVIDIA/grove/operator/api dependency version.
Operator Constants
deploy/cloud/operator/internal/consts/consts.go
Added new constants for component types, shared memory, multinode deployment types, and Grove role suffixes.
Controller Logic
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
Refactored pod template spec generation to use a base pod spec generator, modularized label/annotation merging, and removed manual resource config helper.
Controller Tests
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
Updated test to cover new backend framework, resource fields, shared memory volume, and metadata merging.
Backend Framework Abstraction
deploy/cloud/operator/internal/dynamo/graph.go
Major refactor: introduced backend abstraction, multinode roles, explicit pod startup dependencies, modular pod spec generation, backend detection, and resource handling.
SGLang Backend Implementation
deploy/cloud/operator/internal/dynamo/backend_sglang.go, deploy/cloud/operator/internal/dynamo/backend_sglang_test.go
Added SGLang backend logic for multinode flag injection, probe removal, and unit tests for argument/probe handling.
VLLM Backend Implementation
deploy/cloud/operator/internal/dynamo/backend_vllm.go, deploy/cloud/operator/internal/dynamo/backend_vllm_test.go
Added VLLM backend logic for Ray multinode orchestration, probe removal, and unit tests for command/probe handling.
TRTLLM Backend Implementation
deploy/cloud/operator/internal/dynamo/backend_trtllm.go, deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go
Added TRTLLM backend logic for SSH/MPI multinode orchestration, SSH volume handling, environment propagation, and comprehensive unit tests.
Backend Common Utilities
deploy/cloud/operator/internal/dynamo/backend_common.go
Added utility for generating Grove leader hostname.

Sequence Diagram(s)

sequenceDiagram
    participant Operator
    participant BackendFactory
    participant Backend
    participant K8sAPI

    Operator->>BackendFactory: Select backend (SGLang/VLLM/TRTLLM/Noop)
    BackendFactory->>Backend: Return backend instance
    Operator->>Backend: UpdateContainer/UpdatePodSpec (per role: leader/worker/main)
    Backend-->>Operator: Modified container/pod spec
    Operator->>K8sAPI: Create/Update Pod/PodGangSet with multinode roles and dependencies
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Possibly related PRs

Poem

🐇✨
Three backends now can play,
SGLang, VLLM, and TRTLLM lead the way!
Multinode roles and pods align,
With SSH, Ray, or Python fine.
Pods start in order, tests abound—
In this meadow, new frameworks are found!
🌱🚀

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Note

🔌 MCP (Model Context Protocol) integration is now available in Early Access!

Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbb4caa and 3743598.

⛔ Files ignored due to path filters (1)
  • deploy/cloud/operator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (20)
  • deploy/cloud/helm/crds/Chart.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (3 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (3 hunks)
  • deploy/cloud/operator/api/dynamo/common/common.go (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (3 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (3 hunks)
  • deploy/cloud/operator/go.mod (1 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (6 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (5 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_common.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_sglang.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_trtllm.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_vllm.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (5 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (9)
deploy/cloud/operator/go.mod (1)

9-9: LGTM! Dependency update supports new features.

The grove operator API update aligns with the multinode support implementation. However, consider using tagged releases instead of pseudo-versions for better dependency management and reproducibility in production environments.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)

407-412: Confirm field was generated via kubebuilder markers, not hand-edited

The new backendFramework property looks syntactically correct (default matches one of the enum values).
Because this CRD YAML is meant to be fully auto-generated, please double-check that the corresponding Go type includes the proper kubebuilder tags, e.g.

// +kubebuilder:validation:Enum=sglang;vllm
// +kubebuilder:default=vllm
BackendFramework string `json:"backendFramework,omitempty"`

and that the file was regenerated with make manifests (or equivalent). Otherwise the change will be lost on the next generation run.
Consider adding an additionalPrinterColumns entry so users can see the chosen backend directly via kubectl get.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

474-493: Consider adding basic value validation for the new dynamoConfig sub-fields

dataParallelSize, numberOfNodes, and tensorParallelSize logically cannot be negative or zero, yet the schema currently allows any int32. Adding minimum: 1 (via // +kubebuilder:validation:Minimum=1 on the Go struct) will prevent invalid specs from reaching the reconciler and avoid runtime errors.

Likewise, for extraArgs you may want x-kubernetes-list-type: atomic for patch-friendly behaviour, mirroring other string arrays in the CRD.

These tweaks belong in the Go API types, not this generated YAML.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

474-493: Consider tightening validation on numeric sizes in dynamoConfig

dataParallelSize, tensorParallelSize, and numberOfNodes accept any 32-bit integer, including zero or negative values, which are unlikely to be meaningful.
If the operator requires these to be positive (or at least ≥1) you can add a minimum: 1 constraint via a +kubebuilder:validation:Minimum=1 marker on the corresponding Go struct fields before regenerating the CRD.

+                          dataParallelSize:
+                            minimum: 1
                           ...
+                          numberOfNodes:
+                            minimum: 1

Not blocking, but worth verifying against controller logic.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)

1232-1264: Consider extracting probe configuration to constants or a helper function

The default probe configurations are hardcoded with specific values. Consider extracting these to constants or a helper function for better maintainability and reusability across the codebase.

Example refactor:

func getDefaultLivenessProbe() *corev1.Probe {
    return &corev1.Probe{
        InitialDelaySeconds: 60,
        PeriodSeconds:       60,
        TimeoutSeconds:      5,
        FailureThreshold:    10,
        SuccessThreshold:    1,
        ProbeHandler: corev1.ProbeHandler{
            HTTPGet: &corev1.HTTPGetAction{
                Path: "/healthz",
                Port: intstr.FromString(commonconsts.DynamoHealthPortName),
            },
        },
    }
}

func getDefaultReadinessProbe() *corev1.Probe {
    return &corev1.Probe{
        InitialDelaySeconds: 60,
        PeriodSeconds:       60,
        TimeoutSeconds:      5,
        FailureThreshold:    10,
        SuccessThreshold:    1,
        ProbeHandler: corev1.ProbeHandler{
            HTTPGet: &corev1.HTTPGetAction{
                Path: "/readyz",
                Port: intstr.FromString(commonconsts.DynamoHealthPortName),
            },
        },
    }
}
deploy/cloud/operator/internal/dynamo/graph.go (1)

556-659: Consider breaking down GenerateBasePodSpec into smaller functions

This function handles many responsibilities including container setup, environment variables, volumes, secrets, and pod spec merging. Consider extracting some logic into helper functions for better maintainability.

Suggested breakdown:

  • setupContainer: Initialize base container with ports and probes
  • applyExtraPodSpec: Handle extraPodSpec merging
  • setupVolumes: Handle PVC and volume mounts
  • setupImagePullSecrets: Handle secret retrieval and configuration

This would make the main function more readable and each piece more testable.

deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)

8-9: Consolidate duplicate consts imports.

The same package is imported twice with different aliases. Consider consolidating to use a single import.

-	"github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"
-	commonconsts "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"
+	"github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"

Then update usage from commonconsts.ComponentTypeMain to consts.ComponentTypeMain throughout the file.

deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (2)

14-201: Well-structured comprehensive tests.

The table-driven test approach is excellent and covers diverse scenarios including edge cases. The use of expectContains provides flexible assertion capabilities.

Minor suggestions for improvement:

Consider adding test cases for:

  • LWS multinode deployment type (currently only Grove is tested)
  • Error scenarios or invalid configurations
  • Boundary conditions (e.g., numberOfNodes = 0)
+		{
+			name:                    "worker component multinode worker LWS",
+			componentType:           commonconsts.ComponentTypeWorker,
+			numberOfNodes:           3,
+			role:                    RoleWorker,
+			multinodeDeploymentType: consts.MultinodeDeploymentTypeLWS,
+			component: &v1alpha1.DynamoComponentDeploymentOverridesSpec{
+				DynamoComponentDeploymentSharedSpec: v1alpha1.DynamoComponentDeploymentSharedSpec{
+					DynamoConfig: &v1alpha1.DynamoConfig{},
+				},
+			},
+			expectedCmd:    []string{"/bin/sh", "-c"},
+			expectContains: []string{"python3 -m dynamo.sglang.worker", "dist-init-addr", "LWS_LEADER_ADDRESS"},
+		},

203-271: Good test coverage for MergeArgs functionality.

The test cases appropriately cover different scenarios and roles.

Suggestion for maintainability:

The complex expected result on line 257 could be fragile due to argument ordering. Consider using expectContains pattern here too:

-			expectedResult: []string{"user args --custom-flag custom-value --dist-init-addr ${GROVE_HEADLESS_SERVICE}:29500 --dp-size 3 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --tp-size 2 --extra arg"},
+			expectContains: []string{"user args", "--custom-flag custom-value", "--dist-init-addr", "${GROVE_HEADLESS_SERVICE}:29500", "--dp-size 3", "--nnodes 3", "--extra arg"},
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbb4caa and 2f380aa.

⛔ Files ignored due to path filters (1)
  • deploy/cloud/operator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (19)
  • deploy/cloud/helm/crds/Chart.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (3 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2 hunks)
  • deploy/cloud/operator/go.mod (1 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (7 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (6 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_common.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_common_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_sglang.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_vllm.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (3 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: the stopsignal field under lifecycle in dynamocomponentdeployment crds is autogenerated due to kuber...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
  • deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
  • deploy/cloud/helm/crds/Chart.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: the `stopsignal` field in kubernetes crds like dynamographdeployment and dynamocomponentdeployment i...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
  • deploy/cloud/helm/crds/Chart.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: the `dyn_deployment_config` environment variable (commonconsts.dynamodeploymentconfigenvvar) in the ...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
  • deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: the image-builder serviceaccount in deploy/cloud/helm/platform/components/operator/templates/image-b...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

Applied to files:

  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: crd schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from kubernetes...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • deploy/cloud/helm/crds/Chart.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
📚 Learning: in the dynamo operator, the project’s preferred security posture is to set a pod-level `podsecurityc...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.

Applied to files:

  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
📚 Learning: in vllm worker deployments, startup probes (with longer periods and higher failure thresholds like p...
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
📚 Learning: in the dynamo codebase, componenttypeplanner constants with different cases ("planner" vs "planner")...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

Applied to files:

  • deploy/cloud/operator/internal/consts/consts.go
🧬 Code Graph Analysis (3)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)
deploy/cloud/operator/internal/dynamo/graph.go (1)
  • BackendFramework (505-505)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
  • DynamoConfig (103-123)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (4)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)
  • VLLMBackend (13-13)
deploy/cloud/operator/internal/dynamo/graph.go (5)
  • Role (475-475)
  • RoleMain (480-480)
  • DynamoConfig (45-50)
  • RoleLeader (478-478)
  • RoleWorker (479-479)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (3)
  • DynamoComponentDeploymentOverridesSpec (53-55)
  • DynamoComponentDeploymentSharedSpec (57-95)
  • DynamoConfig (103-123)
deploy/cloud/operator/internal/consts/consts.go (6)
  • MultinodeDeploymentType (51-51)
  • ComponentTypeMain (38-38)
  • MultinodeDeploymentTypeGrove (54-54)
  • ComponentTypeWorker (42-42)
  • ComponentTypePrefillWorker (43-43)
  • ComponentTypeDecodeWorker (44-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (33)
deploy/cloud/helm/crds/Chart.yaml (1)

19-19: LGTM! Appropriate version increment.

The chart version bump from 0.4.0 to 0.4.1 correctly reflects the additive CRD schema changes for grove multinode support without breaking existing functionality.

deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)

43-46: LGTM! Well-structured backend framework field.

The BackendFramework field is properly implemented with:

  • Appropriate kubebuilder validation enum constraining values to supported backends
  • Sensible default value ("vllm") for backward compatibility
  • Optional field design following Kubernetes API conventions
deploy/cloud/operator/internal/dynamo/backend_common.go (1)

8-37: LGTM! Well-designed utility function.

The applyFlagOverridesAndExtraArgs function is well-implemented with:

  • Correct handling of flag overrides and removals (nil values)
  • Deterministic output through sorting
  • Proper command-line flag formatting
  • Clear separation of flag processing and extra args appending
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2)

253-257: LGTM! Correct deepcopy integration.

The deepcopy logic for the new DynamoConfig field in DynamoComponentDeploymentSharedSpec is correctly implemented with proper nil checking.


345-394: LGTM! Auto-generated deepcopy methods are correct.

The auto-generated deepcopy methods for DynamoConfig correctly handle:

  • Pointer fields with proper nil checking and allocation
  • Map with pointer-to-string values including nil value handling
  • String slice with proper copying

As this is auto-generated code, avoid manual modifications.

deploy/cloud/operator/internal/dynamo/backend_common_test.go (6)

3-8: LGTM!

The imports are appropriate and necessary for the test functionality. Good use of Gomega for handling non-deterministic map iteration and the ptr utility for creating test data.


10-17: LGTM!

Excellent use of table-driven tests with a clear, well-structured test case format. The field names are descriptive and the approach enables comprehensive scenario coverage.


18-84: Excellent test case coverage!

The test cases comprehensively cover all important scenarios:

  • Basic functionality without modifications
  • Flag overriding and addition
  • Flag removal via nil pointers
  • Extra arguments handling
  • Complex combinations
  • Edge cases with empty inputs

The expected results are correctly formatted as command-line flags, and the test data is realistic and well-organized.


86-95: Excellent handling of non-deterministic behavior!

The test execution properly uses ConsistOf matcher to handle non-deterministic map iteration order, with a clear comment explaining the rationale. The subtest structure with t.Run and proper Gomega setup follows best practices.


20-83: Test data values are accurate and comprehensive.

The test data correctly represents various scenarios:

  • Proper flag formatting with "--flag value" pattern
  • Correct use of pointers with nil for flag removal
  • Realistic combinations of flag overrides and extra arguments
  • Edge cases with empty inputs

All expected results align with the intended functionality and command-line flag conventions.


1-96: High-quality test implementation supporting the multinode backend framework.

This test file exemplifies excellent Go testing practices:

  • Comprehensive table-driven tests covering all scenarios
  • Proper handling of non-deterministic map iteration
  • Clear structure and documentation
  • Good integration with testing libraries

The test ensures correctness of a foundational utility that's critical for the new backend abstraction framework's command generation logic.

deploy/cloud/operator/internal/consts/consts.go (2)

41-44: LGTM! New component types support disaggregated serving architecture.

The new component type constants are well-documented and follow existing naming conventions. They properly support the multinode deployment functionality with clear distinctions between aggregated and disaggregated worker roles.


51-56: Excellent type-safe approach for multinode deployment types.

The new MultinodeDeploymentType provides type safety and prevents string literal errors. The "grove" constant directly supports the PR objective of adding grove multinode support, and the dedicated type enables better compile-time checking.

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)

407-412: backendFramework schema addition LGTM

Enum + default are consistent (vllm / sglang) and match the operator refactor.
No issues spotted.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

47-52: backendFramework property LGTM – just verify operator defaulting

The enum + default definition is syntactically correct and aligns with the operator’s two supported back-ends. Confirm that the Go struct defining BackendFramework has the // +kubebuilder:default=vllm (or equivalent) tag so that both the CRD and webhook defaulting behave consistently.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

47-52: Addition aligns with new backend abstraction – looks good

The backendFramework field is correctly added at the same level as other root-spec properties, its default (vllm) matches one of the enumerated values, and the enum is restrictive (sglang / vllm) which prevents typos.
No issues detected here.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)

825-857: Test updates align well with the new backend framework and multinode support.

The test correctly validates:

  • New BackendFramework field propagation
  • ExtraPodMetadata merging into pod templates
  • Resource requests alongside limits
  • Proper structure for multinode deployments
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

101-127: Excellent design for flexible backend configuration.

The DynamoConfig struct provides a clean, extensible approach that:

  • Unifies configuration across different backends
  • Allows fine-grained control via FlagOverrides (including flag removal with nil values)
  • Supports both single-node and multinode deployments declaratively
  • Simplifies the API by replacing complex backend-specific configs

This is a well-thought-out abstraction.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (3)

25-25: LGTM! Appropriate imports and constant addition

The new imports and DeploymentTypeMultinodeGrove constant are properly integrated and follow the existing patterns.

Also applies to: 39-40, 80-80


513-513: Good use of maps.Copy for label merging

Using maps.Copy is more idiomatic and safer than manual map copying. It properly handles nil maps and ensures all entries are copied.

Also applies to: 555-555


1215-1227: Excellent refactoring to centralize pod spec generation

The refactoring to use dynamo.GenerateBasePodSpecForController significantly simplifies the code and improves maintainability by:

  • Removing duplicate logic for environment variables, volumes, and security contexts
  • Centralizing backend-specific command and argument generation
  • Maintaining consistency across different deployment types

The use of maps.Copy for merging annotations and labels is also a good improvement.

Also applies to: 1304-1311

deploy/cloud/operator/internal/dynamo/graph.go (4)

150-150: Well-structured helper functions and framework propagation

Good additions:

  • Proper propagation of BackendFramework from graph to component deployments
  • Safe getNumberOfNodes helper with nil checks and sensible default
  • Clean mergeContainerCommand helper following the user-override pattern

Also applies to: 315-321, 464-470


491-501: Clean role expansion logic

The expandRolesForService function properly handles both single-node and multinode deployments with clear role assignments and replica counts.


682-748: Well-structured Grove deployment generation

The GenerateGrovePodGangSet function excellently handles:

  • Role-based pod clique generation for multinode deployments
  • Proper scaling group configuration
  • Metadata merging from extraPodMetadata
  • Clean separation between single-node and multinode logic

757-795: Good adapter pattern for controller compatibility

The controller-specific functions (ConvertDynamoComponentDeploymentToSpec and GenerateBasePodSpecForController) provide a clean adapter pattern that allows the controller to leverage the centralized backend logic while maintaining its specific requirements.

deploy/cloud/operator/internal/dynamo/backend_vllm.go (4)

13-13: LGTM!

Empty struct is appropriate for stateless backend implementation that satisfies the Backend interface.


86-106: LGTM!

The function correctly handles nil config, properly converts integer values to string flags, and uses the common helper function for applying overrides and extra arguments. Good separation of concerns.


1-13: LGTM!

Package declaration, imports, and struct definition are well-organized and follow Go conventions.


86-106: LGTM!

The helper function is well-structured with proper nil handling and follows good practices for building argument lists from configuration.

deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (4)

3-12: LGTM!

All imports are appropriate for the test functionality and are used within the file.


14-201: Excellent test coverage!

The test function provides comprehensive coverage of different scenarios including component types, multinode configurations, roles, and various DynamoConfig options. The table-driven approach with clear test names makes it easy to understand what each test validates.


203-355: Comprehensive test coverage for argument handling!

Both TestSGLangBackend_MergeArgs and TestBuildSGLangArgs provide excellent coverage:

  • Argument merging logic with default/user args scenarios
  • Multinode behavior validation
  • Configuration option testing with both positive and negative assertions
  • Proper handling of nil configurations

The table-driven approach with clear expectations makes the tests maintainable and easy to understand.


273-355: Excellent test coverage with both positive and negative assertions.

The test function demonstrates thorough testing practices by validating both what should be present (expectContains) and what should be absent (expectNotContains). This approach helps catch regression issues effectively.

@julienmancuso julienmancuso merged commit dabd226 into main Aug 11, 2025
8 of 10 checks passed
@julienmancuso julienmancuso deleted the jsm/dep-243 branch August 11, 2025 20:31
Copy link
Contributor

@nvrohanv nvrohanv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for initial PR. In follow ups I think Grove Unit tests similar to LWS would be good. I also think we should discuss some way to decouple the backend (sglang, vllm, trtllm) code a little more from the orchestrator (LWS, Grove) etc so that it can be extended or modified in a more modular way

}

// Remove probes for multinode leader and worker
if role == RoleLeader || role == RoleWorker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no probes? Is this only until the PublishNotReadyAddresses is default true in grove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a miss, probes should only be removed for the workers

}

// Generate the flags to add
flags := b.getMultinodeFlags(numberOfNodes, role, multinodeDeploymentType, serviceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users shouldnt provide it (and I'm not sure how theyd even figure out how to provide it) but I'm wondering if we should autodetect if the multinode flags were provided and either error out with a message or just override it. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what would be the best solution ...
I guess maybe implements a user override might make sense ?

if role == RoleLeader {
nodeRank = "0"
} else {
if multinodeDeploymentType == commonconsts.MultinodeDeploymentTypeGrove {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think its fine for now, at some point we probably want to decouple this a little bit so that adding in alternatives to Grove or LWS can be done without having to change the backend_sglang. We'll probably have to think up some interfaces for adding in a "super-pod" creator like Grove or LWS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good idea, we can define an interface for deploymentType that the backend can also consume

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be put in a separate pr since this one is large, but will we have grove specific tests at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_test.go extensively covers grove multinode podgagnset generation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants