Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 46 additions & 2 deletions deploy/cloud/helm/platform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)

## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment

### Single Cluster-Wide Operator (Recommended)

**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.**

- ✅ **Recommended**: Deploy one cluster-wide operator per cluster
- ❌ **Not Recommended**: Multiple cluster-wide operators in the same cluster

### Multiple Namespace-Scoped Operators (Advanced)

If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment:

```yaml
# values.yaml
dynamo-operator:
namespaceRestriction:
enabled: true
targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace
```

### Validation and Safety

The chart includes built-in validation to prevent all operator conflicts:

- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation
- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists
- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists
- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist
- **Safe Defaults**: Leader election uses shared ID for proper coordination

#### 🚫 **Blocked Conflict Scenarios**

| Existing Operator | New Operator | Status | Reason |
|-------------------|--------------|---------|--------|
| None | Cluster-wide | ✅ **Allowed** | No conflicts |
| None | Namespace-restricted | ✅ **Allowed** | No conflicts |
| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers |
| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace |
| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators |
| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes |

## 🔧 Configuration

## Requirements
Expand All @@ -58,11 +100,13 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" |
| dynamo-operator.modelExpressURL | string | `""` | URL for the Model Express server if not deployed by this helm chart. This is ignored if Model Express server is installed by this helm chart (global.model-express.enabled is true). |
| dynamo-operator.namespaceRestriction | object | `{"enabled":true,"targetNamespace":null}` | Namespace access controls for the operator |
| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces |
| dynamo-operator.namespaceRestriction | object | `{"enabled":false,"targetNamespace":null}` | Namespace access controls for the operator |
| dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). |
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
| dynamo-operator.controllerManager.affinity | list | `[]` | Affinity for controller manager pods |
| dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. |
| dynamo-operator.controllerManager.leaderElection.namespace | string | `""` | Namespace for leader election leases (only used in cluster-wide mode). If empty, defaults to kube-system for cluster-wide coordination. All cluster-wide operators should use the SAME namespace for proper leader election. |
| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository |
| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) |
| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image |
Expand Down
42 changes: 42 additions & 0 deletions deploy/cloud/helm/platform/README.md.gotmpl
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)

## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment

### Single Cluster-Wide Operator (Recommended)

**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.**

- ✅ **Recommended**: Deploy one cluster-wide operator per cluster
- ❌ **Not Recommended**: Multiple cluster-wide operators in the same cluster

### Multiple Namespace-Scoped Operators (Advanced)

If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment:

```yaml
# values.yaml
dynamo-operator:
namespaceRestriction:
enabled: true
targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace
```

### Validation and Safety

The chart includes built-in validation to prevent all operator conflicts:

- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation
- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists
- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists
- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist
- **Safe Defaults**: Leader election uses shared ID for proper coordination

#### 🚫 **Blocked Conflict Scenarios**

| Existing Operator | New Operator | Status | Reason |
|-------------------|--------------|---------|--------|
| None | Cluster-wide | ✅ **Allowed** | No conflicts |
| None | Namespace-restricted | ✅ **Allowed** | No conflicts |
| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers |
| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace |
| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators |
| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes |

## 🔧 Configuration

{{ template "chart.requirementsSection" . }}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

{{/*
Validation to prevent operator conflicts
Prevents all conflict scenarios:
1. Multiple cluster-wide operators (multiple cluster managers)
2. Namespace-restricted operator when cluster-wide exists (both would manage same resources)
3. Cluster-wide operator when namespace-restricted exist (both would manage same resources)
*/}}
{{- define "dynamo-operator.validateClusterWideInstallation" -}}
{{- $currentReleaseName := .Release.Name -}}

{{/* Check for existing namespace-restricted operators (only when installing cluster-wide) */}}
{{- if not .Values.namespaceRestriction.enabled -}}
{{- $allRoles := lookup "rbac.authorization.k8s.io/v1" "Role" "" "" -}}
{{- $namespaceRestrictedOperators := list -}}

{{- if $allRoles -}}
{{- range $role := $allRoles.items -}}
{{- if and (contains "-dynamo-operator-" $role.metadata.name) (hasSuffix "-manager-role" $role.metadata.name) -}}
{{- $namespaceRestrictedOperators = append $namespaceRestrictedOperators $role.metadata.namespace -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{- if $namespaceRestrictedOperators -}}
{{- fail (printf "VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: %s. This would create resource conflicts as both the cluster-wide operator and namespace-restricted operators would manage the same DGDs/DCDs. Either:\n1. Use one of the existing namespace-restricted operators for your specific namespace, or\n2. Uninstall all existing namespace-restricted operators first, or\n3. Install this operator in namespace-restricted mode: --set namespaceRestriction.enabled=true" (join ", " ($namespaceRestrictedOperators | uniq))) -}}
{{- end -}}
{{- end -}}

{{/* Check for existing ClusterRoles that would indicate other cluster-wide installations */}}
{{- $existingClusterRoles := lookup "rbac.authorization.k8s.io/v1" "ClusterRole" "" "" -}}
{{- $foundExistingClusterWideOperator := false -}}
{{- $existingOperatorRelease := "" -}}
{{- $existingOperatorRoleName := "" -}}
{{- $existingOperatorNamespace := "" -}}

{{- if $existingClusterRoles -}}
{{- range $cr := $existingClusterRoles.items -}}
{{- if and (contains "-dynamo-operator-" $cr.metadata.name) (hasSuffix "-manager-role" $cr.metadata.name) -}}
{{- $currentRoleName := printf "%s-dynamo-operator-manager-role" $currentReleaseName -}}
{{- if ne $cr.metadata.name $currentRoleName -}}
{{- $foundExistingClusterWideOperator = true -}}
{{- $existingOperatorRoleName = $cr.metadata.name -}}
{{- if $cr.metadata.labels -}}
{{- if $cr.metadata.labels.release -}}
{{- $existingOperatorRelease = $cr.metadata.labels.release -}}
{{- else if index $cr.metadata.labels "app.kubernetes.io/instance" -}}
{{- $existingOperatorRelease = index $cr.metadata.labels "app.kubernetes.io/instance" -}}
{{- end -}}
{{- end -}}

{{/* Find the namespace by looking at ClusterRoleBinding subjects */}}
{{- $clusterRoleBindings := lookup "rbac.authorization.k8s.io/v1" "ClusterRoleBinding" "" "" -}}
{{- if $clusterRoleBindings -}}
{{- range $crb := $clusterRoleBindings.items -}}
{{- if eq $crb.roleRef.name $cr.metadata.name -}}
{{- range $subject := $crb.subjects -}}
{{- if and (eq $subject.kind "ServiceAccount") $subject.namespace -}}
{{- $existingOperatorNamespace = $subject.namespace -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{- if $foundExistingClusterWideOperator -}}
{{- $uninstallCmd := printf "helm uninstall %s" $existingOperatorRelease -}}
{{- if $existingOperatorNamespace -}}
{{- $uninstallCmd = printf "helm uninstall %s -n %s" $existingOperatorRelease $existingOperatorNamespace -}}
{{- end -}}

{{- if .Values.namespaceRestriction.enabled -}}
{{- if $existingOperatorNamespace -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}}
{{- else -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}}
{{- end -}}
{{- else -}}
{{- if $existingOperatorNamespace -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}}
{{- else -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/* Additional validation for cluster-wide mode */}}
{{- if not .Values.namespaceRestriction.enabled -}}
{{/* Warn if using different leader election IDs */}}
{{- $leaderElectionId := default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id -}}
{{- if ne $leaderElectionId "dynamo.nvidia.com" -}}
{{- fail (printf "VALIDATION WARNING: Using custom leader election ID '%s' in cluster-wide mode. For proper coordination, all cluster-wide Dynamo operators should use the SAME leader election ID. Different IDs will allow multiple leaders simultaneously (split-brain scenario)." $leaderElectionId) -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/*
Validation for configuration consistency
*/}}
{{- define "dynamo-operator.validateConfiguration" -}}
{{/* Validate leader election namespace setting */}}
{{- if and (not .Values.namespaceRestriction.enabled) .Values.controllerManager.leaderElection.namespace -}}
{{- if eq .Values.controllerManager.leaderElection.namespace .Release.Namespace -}}
{{- printf "\nWARNING: Leader election namespace is set to the same as release namespace (%s) in cluster-wide mode. This may prevent proper coordination between multiple releases. Consider using 'kube-system' or leaving empty for default.\n" .Release.Namespace | fail -}}
{{- end -}}
{{- end -}}
{{- end -}}
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

{{/* Validate installation to prevent conflicts */}}
{{- include "dynamo-operator.validateClusterWideInstallation" . -}}
{{- include "dynamo-operator.validateConfiguration" . -}}

---
apiVersion: apps/v1
kind: Deployment
Expand Down Expand Up @@ -76,7 +81,8 @@ spec:
- --leader-elect=false
{{- else }}
- --leader-elect
- --leader-election-id=dynamo.nvidia.com
- --leader-election-id={{ default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id }}
- --leader-election-namespace={{ default "kube-system" .Values.controllerManager.leaderElection.namespace }}
{{- end }}
{{- if .Values.natsAddr }}
- --natsAddr={{ .Values.natsAddr }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,14 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
{{/*
Only create leader election RBAC when leader election is enabled.
When namespaceRestriction.enabled=true, leader election is disabled (--leader-elect=false),
so these permissions are not needed.
*/}}
{{- if not .Values.namespaceRestriction.enabled }}
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
kind: ClusterRole
metadata:
name: {{ include "dynamo-operator.fullname" . }}-leader-election-role
labels:
Expand Down Expand Up @@ -55,7 +61,7 @@ rules:
- patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
kind: ClusterRoleBinding
metadata:
name: {{ include "dynamo-operator.fullname" . }}-leader-election-rolebinding
labels:
Expand All @@ -65,9 +71,10 @@ metadata:
{{- include "dynamo-operator.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
kind: ClusterRole
name: '{{ include "dynamo-operator.fullname" . }}-leader-election-role'
subjects:
- kind: ServiceAccount
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
namespace: '{{ .Release.Namespace }}'
namespace: '{{ .Release.Namespace }}'
{{- end }}
13 changes: 13 additions & 0 deletions deploy/cloud/helm/platform/components/operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,19 @@ namespaceRestriction:
targetNamespace: ""
controllerManager:
tolerations: []

# Leader election configuration
leaderElection:
# Leader election ID for cluster-wide coordination
# WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain
# Different IDs would allow multiple leaders simultaneously
id: "" # If empty, defaults to: dynamo.nvidia.com (shared across all cluster-wide operators)

# Namespace for leader election leases (only used in cluster-wide mode)
# If empty, defaults to kube-system for cluster-wide coordination
# All cluster-wide operators should use the SAME namespace for proper leader election
namespace: ""

kubeRbacProxy:
args:
- --secure-listen-address=0.0.0.0:8443
Expand Down
Loading
Loading