Skip to content

Commit 7c62081

Browse files
feat: install dynamo operator cluster-wide by default (#3199)
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
1 parent 088295e commit 7c62081

File tree

12 files changed

+278
-20
lines changed

12 files changed

+278
-20
lines changed

deploy/cloud/helm/platform/README.md

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
3838
- Sufficient cluster resources for your deployment scale
3939
- Container registry access (if using private images)
4040

41+
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
42+
43+
### Single Cluster-Wide Operator (Recommended)
44+
45+
**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.**
46+
47+
-**Recommended**: Deploy one cluster-wide operator per cluster
48+
-**Not Recommended**: Multiple cluster-wide operators in the same cluster
49+
50+
### Multiple Namespace-Scoped Operators (Advanced)
51+
52+
If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment:
53+
54+
```yaml
55+
# values.yaml
56+
dynamo-operator:
57+
namespaceRestriction:
58+
enabled: true
59+
targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace
60+
```
61+
62+
### Validation and Safety
63+
64+
The chart includes built-in validation to prevent all operator conflicts:
65+
66+
- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation
67+
- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists
68+
- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists
69+
- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist
70+
- **Safe Defaults**: Leader election uses shared ID for proper coordination
71+
72+
#### 🚫 **Blocked Conflict Scenarios**
73+
74+
| Existing Operator | New Operator | Status | Reason |
75+
|-------------------|--------------|---------|--------|
76+
| None | Cluster-wide | ✅ **Allowed** | No conflicts |
77+
| None | Namespace-restricted | ✅ **Allowed** | No conflicts |
78+
| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers |
79+
| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace |
80+
| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators |
81+
| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes |
82+
4183
## 🔧 Configuration
4284
4385
## Requirements
@@ -58,11 +100,13 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
58100
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
59101
| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" |
60102
| dynamo-operator.modelExpressURL | string | `""` | URL for the Model Express server if not deployed by this helm chart. This is ignored if Model Express server is installed by this helm chart (global.model-express.enabled is true). |
61-
| dynamo-operator.namespaceRestriction | object | `{"enabled":true,"targetNamespace":null}` | Namespace access controls for the operator |
62-
| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces |
103+
| dynamo-operator.namespaceRestriction | object | `{"enabled":false,"targetNamespace":null}` | Namespace access controls for the operator |
104+
| dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). |
63105
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
64106
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
65107
| dynamo-operator.controllerManager.affinity | list | `[]` | Affinity for controller manager pods |
108+
| dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. |
109+
| dynamo-operator.controllerManager.leaderElection.namespace | string | `""` | Namespace for leader election leases (only used in cluster-wide mode). If empty, defaults to kube-system for cluster-wide coordination. All cluster-wide operators should use the SAME namespace for proper leader election. |
66110
| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository |
67111
| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) |
68112
| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image |

deploy/cloud/helm/platform/README.md.gotmpl

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
3838
- Sufficient cluster resources for your deployment scale
3939
- Container registry access (if using private images)
4040

41+
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
42+
43+
### Single Cluster-Wide Operator (Recommended)
44+
45+
**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.**
46+
47+
- ✅ **Recommended**: Deploy one cluster-wide operator per cluster
48+
- ❌ **Not Recommended**: Multiple cluster-wide operators in the same cluster
49+
50+
### Multiple Namespace-Scoped Operators (Advanced)
51+
52+
If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment:
53+
54+
```yaml
55+
# values.yaml
56+
dynamo-operator:
57+
namespaceRestriction:
58+
enabled: true
59+
targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace
60+
```
61+
62+
### Validation and Safety
63+
64+
The chart includes built-in validation to prevent all operator conflicts:
65+
66+
- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation
67+
- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists
68+
- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists
69+
- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist
70+
- **Safe Defaults**: Leader election uses shared ID for proper coordination
71+
72+
#### 🚫 **Blocked Conflict Scenarios**
73+
74+
| Existing Operator | New Operator | Status | Reason |
75+
|-------------------|--------------|---------|--------|
76+
| None | Cluster-wide | ✅ **Allowed** | No conflicts |
77+
| None | Namespace-restricted | ✅ **Allowed** | No conflicts |
78+
| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers |
79+
| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace |
80+
| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators |
81+
| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes |
82+
4183
## 🔧 Configuration
4284

4385
{{ template "chart.requirementsSection" . }}
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
{{/*
17+
Validation to prevent operator conflicts
18+
Prevents all conflict scenarios:
19+
1. Multiple cluster-wide operators (multiple cluster managers)
20+
2. Namespace-restricted operator when cluster-wide exists (both would manage same resources)
21+
3. Cluster-wide operator when namespace-restricted exist (both would manage same resources)
22+
*/}}
23+
{{- define "dynamo-operator.validateClusterWideInstallation" -}}
24+
{{- $currentReleaseName := .Release.Name -}}
25+
26+
{{/* Check for existing namespace-restricted operators (only when installing cluster-wide) */}}
27+
{{- if not .Values.namespaceRestriction.enabled -}}
28+
{{- $allRoles := lookup "rbac.authorization.k8s.io/v1" "Role" "" "" -}}
29+
{{- $namespaceRestrictedOperators := list -}}
30+
31+
{{- if $allRoles -}}
32+
{{- range $role := $allRoles.items -}}
33+
{{- if and (contains "-dynamo-operator-" $role.metadata.name) (hasSuffix "-manager-role" $role.metadata.name) -}}
34+
{{- $namespaceRestrictedOperators = append $namespaceRestrictedOperators $role.metadata.namespace -}}
35+
{{- end -}}
36+
{{- end -}}
37+
{{- end -}}
38+
39+
{{- if $namespaceRestrictedOperators -}}
40+
{{- fail (printf "VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: %s. This would create resource conflicts as both the cluster-wide operator and namespace-restricted operators would manage the same DGDs/DCDs. Either:\n1. Use one of the existing namespace-restricted operators for your specific namespace, or\n2. Uninstall all existing namespace-restricted operators first, or\n3. Install this operator in namespace-restricted mode: --set namespaceRestriction.enabled=true" (join ", " ($namespaceRestrictedOperators | uniq))) -}}
41+
{{- end -}}
42+
{{- end -}}
43+
44+
{{/* Check for existing ClusterRoles that would indicate other cluster-wide installations */}}
45+
{{- $existingClusterRoles := lookup "rbac.authorization.k8s.io/v1" "ClusterRole" "" "" -}}
46+
{{- $foundExistingClusterWideOperator := false -}}
47+
{{- $existingOperatorRelease := "" -}}
48+
{{- $existingOperatorRoleName := "" -}}
49+
{{- $existingOperatorNamespace := "" -}}
50+
51+
{{- if $existingClusterRoles -}}
52+
{{- range $cr := $existingClusterRoles.items -}}
53+
{{- if and (contains "-dynamo-operator-" $cr.metadata.name) (hasSuffix "-manager-role" $cr.metadata.name) -}}
54+
{{- $currentRoleName := printf "%s-dynamo-operator-manager-role" $currentReleaseName -}}
55+
{{- if ne $cr.metadata.name $currentRoleName -}}
56+
{{- $foundExistingClusterWideOperator = true -}}
57+
{{- $existingOperatorRoleName = $cr.metadata.name -}}
58+
{{- if $cr.metadata.labels -}}
59+
{{- if $cr.metadata.labels.release -}}
60+
{{- $existingOperatorRelease = $cr.metadata.labels.release -}}
61+
{{- else if index $cr.metadata.labels "app.kubernetes.io/instance" -}}
62+
{{- $existingOperatorRelease = index $cr.metadata.labels "app.kubernetes.io/instance" -}}
63+
{{- end -}}
64+
{{- end -}}
65+
66+
{{/* Find the namespace by looking at ClusterRoleBinding subjects */}}
67+
{{- $clusterRoleBindings := lookup "rbac.authorization.k8s.io/v1" "ClusterRoleBinding" "" "" -}}
68+
{{- if $clusterRoleBindings -}}
69+
{{- range $crb := $clusterRoleBindings.items -}}
70+
{{- if eq $crb.roleRef.name $cr.metadata.name -}}
71+
{{- range $subject := $crb.subjects -}}
72+
{{- if and (eq $subject.kind "ServiceAccount") $subject.namespace -}}
73+
{{- $existingOperatorNamespace = $subject.namespace -}}
74+
{{- end -}}
75+
{{- end -}}
76+
{{- end -}}
77+
{{- end -}}
78+
{{- end -}}
79+
{{- end -}}
80+
{{- end -}}
81+
{{- end -}}
82+
{{- end -}}
83+
84+
{{- if $foundExistingClusterWideOperator -}}
85+
{{- $uninstallCmd := printf "helm uninstall %s" $existingOperatorRelease -}}
86+
{{- if $existingOperatorNamespace -}}
87+
{{- $uninstallCmd = printf "helm uninstall %s -n %s" $existingOperatorRelease $existingOperatorNamespace -}}
88+
{{- end -}}
89+
90+
{{- if .Values.namespaceRestriction.enabled -}}
91+
{{- if $existingOperatorNamespace -}}
92+
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}}
93+
{{- else -}}
94+
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}}
95+
{{- end -}}
96+
{{- else -}}
97+
{{- if $existingOperatorNamespace -}}
98+
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}}
99+
{{- else -}}
100+
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}}
101+
{{- end -}}
102+
{{- end -}}
103+
{{- end -}}
104+
105+
{{/* Additional validation for cluster-wide mode */}}
106+
{{- if not .Values.namespaceRestriction.enabled -}}
107+
{{/* Warn if using different leader election IDs */}}
108+
{{- $leaderElectionId := default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id -}}
109+
{{- if ne $leaderElectionId "dynamo.nvidia.com" -}}
110+
{{- fail (printf "VALIDATION WARNING: Using custom leader election ID '%s' in cluster-wide mode. For proper coordination, all cluster-wide Dynamo operators should use the SAME leader election ID. Different IDs will allow multiple leaders simultaneously (split-brain scenario)." $leaderElectionId) -}}
111+
{{- end -}}
112+
{{- end -}}
113+
{{- end -}}
114+
115+
{{/*
116+
Validation for configuration consistency
117+
*/}}
118+
{{- define "dynamo-operator.validateConfiguration" -}}
119+
{{/* Validate leader election namespace setting */}}
120+
{{- if and (not .Values.namespaceRestriction.enabled) .Values.controllerManager.leaderElection.namespace -}}
121+
{{- if eq .Values.controllerManager.leaderElection.namespace .Release.Namespace -}}
122+
{{- printf "\nWARNING: Leader election namespace is set to the same as release namespace (%s) in cluster-wide mode. This may prevent proper coordination between multiple releases. Consider using 'kube-system' or leaving empty for default.\n" .Release.Namespace | fail -}}
123+
{{- end -}}
124+
{{- end -}}
125+
{{- end -}}

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,11 @@
1212
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
15+
16+
{{/* Validate installation to prevent conflicts */}}
17+
{{- include "dynamo-operator.validateClusterWideInstallation" . -}}
18+
{{- include "dynamo-operator.validateConfiguration" . -}}
19+
1520
---
1621
apiVersion: apps/v1
1722
kind: Deployment
@@ -76,7 +81,8 @@ spec:
7681
- --leader-elect=false
7782
{{- else }}
7883
- --leader-elect
79-
- --leader-election-id=dynamo.nvidia.com
84+
- --leader-election-id={{ default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id }}
85+
- --leader-election-namespace={{ default "kube-system" .Values.controllerManager.leaderElection.namespace }}
8086
{{- end }}
8187
{{- if .Values.natsAddr }}
8288
- --natsAddr={{ .Values.natsAddr }}

deploy/cloud/helm/platform/components/operator/templates/leader-election-rbac.yaml

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,14 @@
1212
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
15+
{{/*
16+
Only create leader election RBAC when leader election is enabled.
17+
When namespaceRestriction.enabled=true, leader election is disabled (--leader-elect=false),
18+
so these permissions are not needed.
19+
*/}}
20+
{{- if not .Values.namespaceRestriction.enabled }}
1521
apiVersion: rbac.authorization.k8s.io/v1
16-
kind: Role
22+
kind: ClusterRole
1723
metadata:
1824
name: {{ include "dynamo-operator.fullname" . }}-leader-election-role
1925
labels:
@@ -55,7 +61,7 @@ rules:
5561
- patch
5662
---
5763
apiVersion: rbac.authorization.k8s.io/v1
58-
kind: RoleBinding
64+
kind: ClusterRoleBinding
5965
metadata:
6066
name: {{ include "dynamo-operator.fullname" . }}-leader-election-rolebinding
6167
labels:
@@ -65,9 +71,10 @@ metadata:
6571
{{- include "dynamo-operator.labels" . | nindent 4 }}
6672
roleRef:
6773
apiGroup: rbac.authorization.k8s.io
68-
kind: Role
74+
kind: ClusterRole
6975
name: '{{ include "dynamo-operator.fullname" . }}-leader-election-role'
7076
subjects:
7177
- kind: ServiceAccount
7278
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
73-
namespace: '{{ .Release.Namespace }}'
79+
namespace: '{{ .Release.Namespace }}'
80+
{{- end }}

deploy/cloud/helm/platform/components/operator/values.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,19 @@ namespaceRestriction:
2727
targetNamespace: ""
2828
controllerManager:
2929
tolerations: []
30+
31+
# Leader election configuration
32+
leaderElection:
33+
# Leader election ID for cluster-wide coordination
34+
# WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain
35+
# Different IDs would allow multiple leaders simultaneously
36+
id: "" # If empty, defaults to: dynamo.nvidia.com (shared across all cluster-wide operators)
37+
38+
# Namespace for leader election leases (only used in cluster-wide mode)
39+
# If empty, defaults to kube-system for cluster-wide coordination
40+
# All cluster-wide operators should use the SAME namespace for proper leader election
41+
namespace: ""
42+
3043
kubeRbacProxy:
3144
args:
3245
- --secure-listen-address=0.0.0.0:8443

0 commit comments

Comments
 (0)