-
Notifications
You must be signed in to change notification settings - Fork 690
feat: add grove multinode support #2269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change introduces multinode deployment support for new backend frameworks—SGLang, VLLM, and TRTLLM—across the Dynamo operator, CRDs, and Helm charts. It adds multinode role abstractions, explicit startup dependencies, modular backend handling, and new resource fields (e.g., nodes). The update also includes comprehensive backend-specific logic, refactored pod spec generation, and extensive new unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator
participant BackendFactory
participant Backend
participant K8sAPI
Operator->>BackendFactory: Select backend (SGLang/VLLM/TRTLLM/Noop)
BackendFactory->>Backend: Return backend instance
Operator->>Backend: UpdateContainer/UpdatePodSpec (per role: leader/worker/main)
Backend-->>Operator: Modified container/pod spec
Operator->>K8sAPI: Create/Update Pod/PodGangSet with multinode roles and dependencies
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Note 🔌 MCP (Model Context Protocol) integration is now available in Early Access!Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context. 📜 Recent review detailsConfiguration used: .coderabbit.yaml ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (20)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🧹 Nitpick comments (9)
deploy/cloud/operator/go.mod (1)
9-9: LGTM! Dependency update supports new features.The grove operator API update aligns with the multinode support implementation. However, consider using tagged releases instead of pseudo-versions for better dependency management and reproducibility in production environments.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
407-412: Confirm field was generated via kubebuilder markers, not hand-editedThe new
backendFrameworkproperty looks syntactically correct (default matches one of the enum values).
Because this CRD YAML is meant to be fully auto-generated, please double-check that the corresponding Go type includes the proper kubebuilder tags, e.g.// +kubebuilder:validation:Enum=sglang;vllm // +kubebuilder:default=vllm BackendFramework string `json:"backendFramework,omitempty"`and that the file was regenerated with
make manifests(or equivalent). Otherwise the change will be lost on the next generation run.
Consider adding anadditionalPrinterColumnsentry so users can see the chosen backend directly viakubectl get.deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
474-493: Consider adding basic value validation for the newdynamoConfigsub-fields
dataParallelSize,numberOfNodes, andtensorParallelSizelogically cannot be negative or zero, yet the schema currently allows any int32. Addingminimum: 1(via// +kubebuilder:validation:Minimum=1on the Go struct) will prevent invalid specs from reaching the reconciler and avoid runtime errors.Likewise, for
extraArgsyou may wantx-kubernetes-list-type: atomicfor patch-friendly behaviour, mirroring other string arrays in the CRD.These tweaks belong in the Go API types, not this generated YAML.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)
474-493: Consider tightening validation on numeric sizes indynamoConfig
dataParallelSize,tensorParallelSize, andnumberOfNodesaccept any 32-bit integer, including zero or negative values, which are unlikely to be meaningful.
If the operator requires these to be positive (or at least ≥1) you can add aminimum: 1constraint via a+kubebuilder:validation:Minimum=1marker on the corresponding Go struct fields before regenerating the CRD.+ dataParallelSize: + minimum: 1 ... + numberOfNodes: + minimum: 1Not blocking, but worth verifying against controller logic.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
1232-1264: Consider extracting probe configuration to constants or a helper functionThe default probe configurations are hardcoded with specific values. Consider extracting these to constants or a helper function for better maintainability and reusability across the codebase.
Example refactor:
func getDefaultLivenessProbe() *corev1.Probe { return &corev1.Probe{ InitialDelaySeconds: 60, PeriodSeconds: 60, TimeoutSeconds: 5, FailureThreshold: 10, SuccessThreshold: 1, ProbeHandler: corev1.ProbeHandler{ HTTPGet: &corev1.HTTPGetAction{ Path: "/healthz", Port: intstr.FromString(commonconsts.DynamoHealthPortName), }, }, } } func getDefaultReadinessProbe() *corev1.Probe { return &corev1.Probe{ InitialDelaySeconds: 60, PeriodSeconds: 60, TimeoutSeconds: 5, FailureThreshold: 10, SuccessThreshold: 1, ProbeHandler: corev1.ProbeHandler{ HTTPGet: &corev1.HTTPGetAction{ Path: "/readyz", Port: intstr.FromString(commonconsts.DynamoHealthPortName), }, }, } }deploy/cloud/operator/internal/dynamo/graph.go (1)
556-659: Consider breaking down GenerateBasePodSpec into smaller functionsThis function handles many responsibilities including container setup, environment variables, volumes, secrets, and pod spec merging. Consider extracting some logic into helper functions for better maintainability.
Suggested breakdown:
setupContainer: Initialize base container with ports and probesapplyExtraPodSpec: Handle extraPodSpec mergingsetupVolumes: Handle PVC and volume mountssetupImagePullSecrets: Handle secret retrieval and configurationThis would make the main function more readable and each piece more testable.
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)
8-9: Consolidate duplicate consts imports.The same package is imported twice with different aliases. Consider consolidating to use a single import.
- "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts" - commonconsts "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts" + "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts"Then update usage from
commonconsts.ComponentTypeMaintoconsts.ComponentTypeMainthroughout the file.deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (2)
14-201: Well-structured comprehensive tests.The table-driven test approach is excellent and covers diverse scenarios including edge cases. The use of
expectContainsprovides flexible assertion capabilities.Minor suggestions for improvement:
Consider adding test cases for:
- LWS multinode deployment type (currently only Grove is tested)
- Error scenarios or invalid configurations
- Boundary conditions (e.g., numberOfNodes = 0)
+ { + name: "worker component multinode worker LWS", + componentType: commonconsts.ComponentTypeWorker, + numberOfNodes: 3, + role: RoleWorker, + multinodeDeploymentType: consts.MultinodeDeploymentTypeLWS, + component: &v1alpha1.DynamoComponentDeploymentOverridesSpec{ + DynamoComponentDeploymentSharedSpec: v1alpha1.DynamoComponentDeploymentSharedSpec{ + DynamoConfig: &v1alpha1.DynamoConfig{}, + }, + }, + expectedCmd: []string{"/bin/sh", "-c"}, + expectContains: []string{"python3 -m dynamo.sglang.worker", "dist-init-addr", "LWS_LEADER_ADDRESS"}, + },
203-271: Good test coverage for MergeArgs functionality.The test cases appropriately cover different scenarios and roles.
Suggestion for maintainability:
The complex expected result on line 257 could be fragile due to argument ordering. Consider using
expectContainspattern here too:- expectedResult: []string{"user args --custom-flag custom-value --dist-init-addr ${GROVE_HEADLESS_SERVICE}:29500 --dp-size 3 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --tp-size 2 --extra arg"}, + expectContains: []string{"user args", "--custom-flag custom-value", "--dist-init-addr", "${GROVE_HEADLESS_SERVICE}:29500", "--dp-size 3", "--nnodes 3", "--extra arg"},
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
deploy/cloud/operator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (19)
deploy/cloud/helm/crds/Chart.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml(2 hunks)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go(3 hunks)deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go(1 hunks)deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go(2 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml(2 hunks)deploy/cloud/operator/go.mod(1 hunks)deploy/cloud/operator/internal/consts/consts.go(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(7 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go(6 hunks)deploy/cloud/operator/internal/dynamo/backend_common.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_common_test.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_sglang.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_sglang_test.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_vllm.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_vllm_test.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph.go(3 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: the stopsignal field under lifecycle in dynamocomponentdeployment crds is autogenerated due to kuber...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Applied to files:
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.godeploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.godeploy/cloud/helm/crds/Chart.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yamldeploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/cloud/operator/internal/consts/consts.godeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: the `stopsignal` field in kubernetes crds like dynamographdeployment and dynamocomponentdeployment i...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Applied to files:
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.godeploy/cloud/helm/crds/Chart.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yamldeploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: the `dyn_deployment_config` environment variable (commonconsts.dynamodeploymentconfigenvvar) in the ...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
Applied to files:
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.godeploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yamldeploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.godeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/cloud/operator/internal/consts/consts.godeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: the image-builder serviceaccount in deploy/cloud/helm/platform/components/operator/templates/image-b...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.
Applied to files:
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.godeploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/cloud/operator/internal/dynamo/graph.go
📚 Learning: crd schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from kubernetes...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
deploy/cloud/helm/crds/Chart.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yamldeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yamldeploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.godeploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
📚 Learning: in the dynamo operator, the project’s preferred security posture is to set a pod-level `podsecurityc...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.
Applied to files:
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
📚 Learning: in vllm worker deployments, startup probes (with longer periods and higher failure thresholds like p...
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
Applied to files:
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
📚 Learning: in the dynamo codebase, componenttypeplanner constants with different cases ("planner" vs "planner")...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.
Applied to files:
deploy/cloud/operator/internal/consts/consts.go
🧬 Code Graph Analysis (3)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)
deploy/cloud/operator/internal/dynamo/graph.go (1)
BackendFramework(505-505)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
DynamoConfig(103-123)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (4)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)
VLLMBackend(13-13)deploy/cloud/operator/internal/dynamo/graph.go (5)
Role(475-475)RoleMain(480-480)DynamoConfig(45-50)RoleLeader(478-478)RoleWorker(479-479)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (3)
DynamoComponentDeploymentOverridesSpec(53-55)DynamoComponentDeploymentSharedSpec(57-95)DynamoConfig(103-123)deploy/cloud/operator/internal/consts/consts.go (6)
MultinodeDeploymentType(51-51)ComponentTypeMain(38-38)MultinodeDeploymentTypeGrove(54-54)ComponentTypeWorker(42-42)ComponentTypePrefillWorker(43-43)ComponentTypeDecodeWorker(44-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (33)
deploy/cloud/helm/crds/Chart.yaml (1)
19-19: LGTM! Appropriate version increment.The chart version bump from 0.4.0 to 0.4.1 correctly reflects the additive CRD schema changes for grove multinode support without breaking existing functionality.
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)
43-46: LGTM! Well-structured backend framework field.The
BackendFrameworkfield is properly implemented with:
- Appropriate kubebuilder validation enum constraining values to supported backends
- Sensible default value ("vllm") for backward compatibility
- Optional field design following Kubernetes API conventions
deploy/cloud/operator/internal/dynamo/backend_common.go (1)
8-37: LGTM! Well-designed utility function.The
applyFlagOverridesAndExtraArgsfunction is well-implemented with:
- Correct handling of flag overrides and removals (nil values)
- Deterministic output through sorting
- Proper command-line flag formatting
- Clear separation of flag processing and extra args appending
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2)
253-257: LGTM! Correct deepcopy integration.The deepcopy logic for the new
DynamoConfigfield inDynamoComponentDeploymentSharedSpecis correctly implemented with proper nil checking.
345-394: LGTM! Auto-generated deepcopy methods are correct.The auto-generated deepcopy methods for
DynamoConfigcorrectly handle:
- Pointer fields with proper nil checking and allocation
- Map with pointer-to-string values including nil value handling
- String slice with proper copying
As this is auto-generated code, avoid manual modifications.
deploy/cloud/operator/internal/dynamo/backend_common_test.go (6)
3-8: LGTM!The imports are appropriate and necessary for the test functionality. Good use of Gomega for handling non-deterministic map iteration and the ptr utility for creating test data.
10-17: LGTM!Excellent use of table-driven tests with a clear, well-structured test case format. The field names are descriptive and the approach enables comprehensive scenario coverage.
18-84: Excellent test case coverage!The test cases comprehensively cover all important scenarios:
- Basic functionality without modifications
- Flag overriding and addition
- Flag removal via nil pointers
- Extra arguments handling
- Complex combinations
- Edge cases with empty inputs
The expected results are correctly formatted as command-line flags, and the test data is realistic and well-organized.
86-95: Excellent handling of non-deterministic behavior!The test execution properly uses
ConsistOfmatcher to handle non-deterministic map iteration order, with a clear comment explaining the rationale. The subtest structure witht.Runand proper Gomega setup follows best practices.
20-83: Test data values are accurate and comprehensive.The test data correctly represents various scenarios:
- Proper flag formatting with "--flag value" pattern
- Correct use of pointers with nil for flag removal
- Realistic combinations of flag overrides and extra arguments
- Edge cases with empty inputs
All expected results align with the intended functionality and command-line flag conventions.
1-96: High-quality test implementation supporting the multinode backend framework.This test file exemplifies excellent Go testing practices:
- Comprehensive table-driven tests covering all scenarios
- Proper handling of non-deterministic map iteration
- Clear structure and documentation
- Good integration with testing libraries
The test ensures correctness of a foundational utility that's critical for the new backend abstraction framework's command generation logic.
deploy/cloud/operator/internal/consts/consts.go (2)
41-44: LGTM! New component types support disaggregated serving architecture.The new component type constants are well-documented and follow existing naming conventions. They properly support the multinode deployment functionality with clear distinctions between aggregated and disaggregated worker roles.
51-56: Excellent type-safe approach for multinode deployment types.The new
MultinodeDeploymentTypeprovides type safety and prevents string literal errors. The "grove" constant directly supports the PR objective of adding grove multinode support, and the dedicated type enables better compile-time checking.deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)
407-412:backendFrameworkschema addition LGTMEnum + default are consistent (
vllm/sglang) and match the operator refactor.
No issues spotted.deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
47-52:backendFrameworkproperty LGTM – just verify operator defaultingThe enum + default definition is syntactically correct and aligns with the operator’s two supported back-ends. Confirm that the Go struct defining
BackendFrameworkhas the// +kubebuilder:default=vllm(or equivalent) tag so that both the CRD and webhook defaulting behave consistently.deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)
47-52: Addition aligns with new backend abstraction – looks goodThe
backendFrameworkfield is correctly added at the same level as other root-spec properties, its default (vllm) matches one of the enumerated values, and the enum is restrictive (sglang/vllm) which prevents typos.
No issues detected here.deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)
825-857: Test updates align well with the new backend framework and multinode support.The test correctly validates:
- New
BackendFrameworkfield propagationExtraPodMetadatamerging into pod templates- Resource requests alongside limits
- Proper structure for multinode deployments
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
101-127: Excellent design for flexible backend configuration.The
DynamoConfigstruct provides a clean, extensible approach that:
- Unifies configuration across different backends
- Allows fine-grained control via
FlagOverrides(including flag removal with nil values)- Supports both single-node and multinode deployments declaratively
- Simplifies the API by replacing complex backend-specific configs
This is a well-thought-out abstraction.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (3)
25-25: LGTM! Appropriate imports and constant additionThe new imports and
DeploymentTypeMultinodeGroveconstant are properly integrated and follow the existing patterns.Also applies to: 39-40, 80-80
513-513: Good use of maps.Copy for label mergingUsing
maps.Copyis more idiomatic and safer than manual map copying. It properly handles nil maps and ensures all entries are copied.Also applies to: 555-555
1215-1227: Excellent refactoring to centralize pod spec generationThe refactoring to use
dynamo.GenerateBasePodSpecForControllersignificantly simplifies the code and improves maintainability by:
- Removing duplicate logic for environment variables, volumes, and security contexts
- Centralizing backend-specific command and argument generation
- Maintaining consistency across different deployment types
The use of
maps.Copyfor merging annotations and labels is also a good improvement.Also applies to: 1304-1311
deploy/cloud/operator/internal/dynamo/graph.go (4)
150-150: Well-structured helper functions and framework propagationGood additions:
- Proper propagation of
BackendFrameworkfrom graph to component deployments- Safe
getNumberOfNodeshelper with nil checks and sensible default- Clean
mergeContainerCommandhelper following the user-override patternAlso applies to: 315-321, 464-470
491-501: Clean role expansion logicThe
expandRolesForServicefunction properly handles both single-node and multinode deployments with clear role assignments and replica counts.
682-748: Well-structured Grove deployment generationThe
GenerateGrovePodGangSetfunction excellently handles:
- Role-based pod clique generation for multinode deployments
- Proper scaling group configuration
- Metadata merging from extraPodMetadata
- Clean separation between single-node and multinode logic
757-795: Good adapter pattern for controller compatibilityThe controller-specific functions (
ConvertDynamoComponentDeploymentToSpecandGenerateBasePodSpecForController) provide a clean adapter pattern that allows the controller to leverage the centralized backend logic while maintaining its specific requirements.deploy/cloud/operator/internal/dynamo/backend_vllm.go (4)
13-13: LGTM!Empty struct is appropriate for stateless backend implementation that satisfies the Backend interface.
86-106: LGTM!The function correctly handles nil config, properly converts integer values to string flags, and uses the common helper function for applying overrides and extra arguments. Good separation of concerns.
1-13: LGTM!Package declaration, imports, and struct definition are well-organized and follow Go conventions.
86-106: LGTM!The helper function is well-structured with proper nil handling and follows good practices for building argument lists from configuration.
deploy/cloud/operator/internal/dynamo/backend_sglang_test.go (4)
3-12: LGTM!All imports are appropriate for the test functionality and are used within the file.
14-201: Excellent test coverage!The test function provides comprehensive coverage of different scenarios including component types, multinode configurations, roles, and various DynamoConfig options. The table-driven approach with clear test names makes it easy to understand what each test validates.
203-355: Comprehensive test coverage for argument handling!Both
TestSGLangBackend_MergeArgsandTestBuildSGLangArgsprovide excellent coverage:
- Argument merging logic with default/user args scenarios
- Multinode behavior validation
- Configuration option testing with both positive and negative assertions
- Proper handling of nil configurations
The table-driven approach with clear expectations makes the tests maintainable and easy to understand.
273-355: Excellent test coverage with both positive and negative assertions.The test function demonstrates thorough testing practices by validating both what should be present (
expectContains) and what should be absent (expectNotContains). This approach helps catch regression issues effectively.
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
Outdated
Show resolved
Hide resolved
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
Outdated
Show resolved
Hide resolved
fb0571b to
c3f0d37
Compare
c3f0d37 to
9ebeefa
Compare
nvrohanv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for initial PR. In follow ups I think Grove Unit tests similar to LWS would be good. I also think we should discuss some way to decouple the backend (sglang, vllm, trtllm) code a little more from the orchestrator (LWS, Grove) etc so that it can be extended or modified in a more modular way
| } | ||
|
|
||
| // Remove probes for multinode leader and worker | ||
| if role == RoleLeader || role == RoleWorker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no probes? Is this only until the PublishNotReadyAddresses is default true in grove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a miss, probes should only be removed for the workers
| } | ||
|
|
||
| // Generate the flags to add | ||
| flags := b.getMultinodeFlags(numberOfNodes, role, multinodeDeploymentType, serviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users shouldnt provide it (and I'm not sure how theyd even figure out how to provide it) but I'm wondering if we should autodetect if the multinode flags were provided and either error out with a message or just override it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what would be the best solution ...
I guess maybe implements a user override might make sense ?
| if role == RoleLeader { | ||
| nodeRank = "0" | ||
| } else { | ||
| if multinodeDeploymentType == commonconsts.MultinodeDeploymentTypeGrove { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think its fine for now, at some point we probably want to decouple this a little bit so that adding in alternatives to Grove or LWS can be done without having to change the backend_sglang. We'll probably have to think up some interfaces for adding in a "super-pod" creator like Grove or LWS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good idea, we can define an interface for deploymentType that the backend can also consume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be put in a separate pr since this one is large, but will we have grove specific tests at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graph_test.go extensively covers grove multinode podgagnset generation
Overview:
add grove multinode support
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores